Researchers use machine learning to enhance data search

If you’ve ever shopped online, searched online databases, compressed files, or signed digital documents, you’ve probably used something called hashing. Hashing is used in a variety of applications and is used in blockchain technology, cryptography, image processing, and more.

An international team of researchers from MIT, Harvard University, and the Technical University of Munich (TUM) have discovered a new way to retrieve data from large databases by using machine learning to speed up hash functions. Specifically, we have found a way to avoid data retrieval slowdowns caused by so-called hash collisions.

hash collision

To retrieve data from large databases, hash functions are used to mathematically transform a specific key or string into a compressed set of representative values called hash values. This function also determines where this data is stored by generating code that points to that location.

However, this system has limitations. Traditional hash functions generate code to randomly retrieve data, and sometimes the hash function generates the same hash code for different data, which can lead to collisions. This often means slower performance as it takes longer to get the data you need.

Many techniques have been developed to handle such incidents and prevent collisions, including a class of hash functions known as perfect hash functions. However, a full hash function must be customized for each dataset, which can significantly increase computation time.

To address this problem, researchers have developed what are known as learning models by running machine learning algorithms on experimental datasets so that the model can be evaluated for specific properties. They found that this AI-assisted approach improves computational efficiency and reduces the likelihood of collisions compared to traditional hashing processes.

“On one end, traditional hash functions are fast to compute, but suffer from collisions that can degrade query performance,” said a paper presented at the 2023 International Conference on Very Large Databases. the team wrote.

“Perfect hash functions, on the other hand, avoid collisions, but are difficult to build and not scalable in the sense that the size of the function representation grows with the size of the input data. It could potentially offer a better trade-off between collisions.”

To create the trained model, the team used machine learning algorithms to estimate how the data was distributed in the sample dataset. Data distribution shows all possible values in the dataset and how often each value pops up. By knowing what the shape of the data distribution looks like, we can determine the probability of where a particular value will appear in that data set. Machine learning can speed up this process because it can more quickly predict where a key will be in a dataset.

The team’s experiments demonstrated that the learning model could reduce the probability of hash collisions from 30% to 15% compared to traditional hash functions. Moreover, the trained model reduces computation time by about 30% and is easier to train and operate compared to a full hash function.

However, there are limitations, as the team also noted that using the learned model could actually result in more hash collisions if the data distribution is spaced too widely. Additionally, the team explored the impact of varying configurations of the learned model using different combinations of linear submodels.Cursive model index and radix spline index — Approximate data distribution. Utilizing these smaller submodels improved accuracy, but also increased the time it takes to fetch the data.

“At a certain threshold of submodels, we get enough information to construct the necessary approximation for the hash function. co-author and MIT CSAIL Postdoctoral Fellow Ibrahim Sabek, explains in a press release:

The team hopes that research on the learned models will help other experts improve hash functions for other categories of information. Additionally, the team would like to further explore how the learned models can be adapted to incorporate the capabilities of dynamic databases where data is inserted or deleted without compromising accuracy.

“We want to encourage the community to use machine learning within more fundamental data structures and algorithms,” said Sabek. “All sorts of core data structures offer opportunities to use machine learning to capture data properties and improve performance. There is still much more that we can explore.”

Source link