Can tokenization free up more data for training AI models?

AI For Business


Companies expanding their AI deployments are faced with the dilemma of how to leverage more valuable internal data to train models without compromising sensitive information.

A recent study by Capital One Software and consulting firm PwC suggests that business leaders can have their data science cake and eat it, too. The study, published on March 23, points to tokenization as an approach that preserves not only data efficacy but also privacy and security. Capital One Software is a B2B software business operating within a $53.4 billion financial services company.

Data tokenization replaces sensitive data with a token that preserves the data format. This process is one of several options for protecting sensitive information, including data masking and redactions that hide or remove portions of data. However, efforts to hide critical data often prevent companies from using the most useful data to train AI models. The data is safe, but the predictive power of the model is reduced.

Against this backdrop, Capital One Software and PwC compared a baseline consisting of original plaintext data to masked and tokenized data sets. The study found that models trained on tokenized data retained 99.7% of their predictive performance compared to the baseline. Furthermore, the model trained on the tokenized data was almost twice as accurate as the model trained on the masked dataset.

Vince Goveas, director of product management at Capital One Software, said the researchers had a hunch that tokenization would improve performance, but the results were still surprising.

He cited the benefits of tokenization over masked data, saying, “We were expecting an improvement, but we didn’t expect this level of improvement.”

Data protection and the value of AI

For this evaluation, we tokenized the data using Capital One Software’s Databolt service. According to Capital One Software, Databolt uses cryptographic algorithms to generate tokens on the fly. Launched in 2025, Databolt is powered by Capital One’s internal tokenization engine.

The study shows that the trade-off between protecting sensitive data and extracting maximum value from AI “no longer needs to exist,” said Mir Kasifuddin, PwC’s data risk and privacy practice leader.

“By using tokenization to protect sensitive information while preserving data structure, organizations can train highly effective AI models without exposing them to the outside world. [personally identifiable information] or [protected health information]” he explained.

This research has implications for companies in regulated industries. Kashifuddin said the combination of data protection and AI performance allows businesses to “innovate with confidence” while meeting customers’ privacy, security and regulatory expectations.

The study cited healthcare, insurance, and financial services as areas where sensitive data is hindering AI.

Capital One’s AI Plan and Use of Tokenization

Data confidentiality is a top consideration in Capital One’s own AI plans, Goveas said. “With the advent of AI, we wanted to be able to train models using a lot of internal data, because this is valuable to enterprises. [making] “The biggest barrier to entry for data-driven decision-making was that we didn’t want to compromise privacy and security,” Gobeas said.

Capital One has been tokenizing data for several years. That history and the company’s model training goals drove the research project, Gobeas said. The company wanted to understand how tokenized data performs during model training.

For Capital One, research verified that models trained on tokenized internal data performed well without compromising security and privacy.

Goveas summarized the key questions regarding tokenized data: “If I chunk data into a model for training, will it provide a meaningful output to the data scientist and analyst community?”

For Capital One, research verified that models trained on tokenized internal data performed well without compromising security and privacy. Goveas points to another benefit: Using tokenized data speeds up data sourcing and preparation processes that previously required numerous checks and approvals.

“The time to value has been significantly reduced,” he said.

Trends in the data tokenization market

Although this research focused on training AI models, tokenization has broader applicability. Prominent examples include payment processing use cases such as e-commerce and mobile wallets.

According to Business Research Company, the overall tokenization market is expected to reach $5.19 billion in 2026, compared to $4.1 billion last year. The market research firm released its forecast on March 10, predicting an average annual growth rate of 26.4%. The company reported that market growth drivers include widespread adoption of digital payment platforms, data breach incidents, and rising regulatory compliance demands.

Business Research Company predicts that the tokenization market will continue to grow steadily at 26.3% annually, reaching $13.2 billion by 2030. Expected contributors to this phase of growth include increased adoption of zero trust security, increased interest in privacy-enhancing technologies, and “broader application of tokenization beyond payment data,” the company said.

As a cybersecurity technology, tokenization falls under the data security category. But it also borders on attack surface management, in that it seeks to reduce an organization’s exposure by replacing sensitive data with tokens.

Deployment issues: infrastructure and change management

Adopting enterprise technology often hinges on infrastructure requirements and an organization’s ability to respond to change. Companies that are already training data for generative AI applications are likely to have the necessary infrastructure up and running, Goveas notes.

That said, tokenization adopters should be aware of organizational considerations. “There are aspects of change management that organizations must go through, and it starts at the top,” Goveas explained. “It has to be a leader-driven priority.” A top-down approach is essential to make privacy and security a priority for companies, rather than an afterthought, he said.

Creating a change management process starts with identifying sensitive data and determining how to declassify it, Gobeas said. This task also requires pinpointing the specific elements of data needed by data science and analytics teams, he added.

“The barrier to entry is identifying, preparing and labeling the data. [and] “You classify it. Then you protect it,” he said.

John Moore is a freelance writer who has covered business and technology topics for 40 years. We focus on enterprise IT strategy, AI deployment, data management, and partner ecosystem.



Source link