Can tokenization free up more data for training AI models?

Companies expanding their AI deployments are faced with the dilemma of how to leverage more valuable internal data to train models without compromising sensitive information.

A recent study by Capital One Software and consulting firm PwC suggests that business leaders can have their data science cake and eat it, too. The study, published on March 23, points to tokenization as an approach that preserves not only data efficacy but also privacy and security. Capital One Software is a B2B software business operating within a $53.4 billion financial services company.

Data tokenization replaces sensitive data with a token that preserves the data format. This process is one of several options for protecting sensitive information, including data masking and redactions that hide or remove portions of data. However, efforts to hide critical data often prevent companies from using the most useful data to train AI models. The data is safe, but the predictive power of the model is reduced.

Against this backdrop, Capital One Software and PwC compared a baseline consisting of original plaintext data to masked and tokenized data sets. The study found that models trained on tokenized data retained 99.7% of their predictive performance compared to the baseline. Furthermore, the model trained on the tokenized data was almost twice as accurate as the model trained on the masked dataset.

Vince Goveas, director of product management at Capital One Software, said the researchers had a hunch that tokenization would improve performance, but the results were still surprising.

He cited the benefits of tokenization over masked data, saying, “We were expecting an improvement, but we didn’t expect this level of improvement.”