Is your AI benchmark lying to you?

Machine Learning


Anshul Kundaje summarises his frustration over the use of artificial intelligence in science in three terms: “propagating bad benchmarks.”

Kundaje studies computational genomics at Stanford University in California. He is keen to incorporate all forms of artificial intelligence (AI) that will help accelerate advancements in his field, and countless researchers have stepped up to provide tools for this purpose. However, with some researchers making doubtful claims about AI models they have developed, it is becoming increasingly difficult to find the most effective one. These claims can take months to check. And they have been found to be often false – primarily because the benchmarks used to demonstrate and compare the performance of these tools are not suitable for purpose.

By then, it's often too late. Kundaje and his colleagues continue to slap the mall after the lacking benchmark has been adopted and “improved” by enthusiastic but naive users. “Everyone is using these in the meantime. [benchmarks] There is false information and false predictions for all kinds of wrong things,” he says.

This is just one reason more and more scientists worry that AI systems designed to accelerate scientific advancement will have the opposite effect until the benchmarks are fundamentally improved.

A benchmark is a test that can be used to compare the performance of various methods, such that the standard length of a meter provides a way to evaluate the accuracy of a ruler. “This is the standardization and definition of the meaning of progress,” said Max Welling, a machine learning researcher and co-founder of Cuspai, an AI company based in Cambridge, UK. A good benchmark allows users to choose the best method for a particular application and determine whether more traditional algorithms will produce better results. “But the first question is, what does 'better' mean? ” Welling says.

That's an incredibly deep question. Does “better” mean faster? cheap? Is it more accurate? If you are buying a car, you should consider a wide range of factors, including acceleration, start-up ability, and safety. AI benchmark tools are no different. In some applications, speed may not be as relevant as accuracy, for example.

But it's even more complicated than that. If the benchmark is badly designed, the information it provides can be misleading. If the benchmark has a “leak” that relies on the data used to train the algorithm, the benchmark will be more like a memory game than a problem-solving test. Or the test may be irrelevant to your needs. For example, it could be overly specific, such as not being able to answer a wide range of questions that interest you.

This is an issue that Kundaje and his colleagues identified in DNA Language Models (DNALMS), which AI developers believe can help discover interesting regulatory mechanisms in the genome. Approximately 1.5% of the human genome is composed of protein coding sequences that provide templates for creating RNA (transcription) and protein (translation). From 5% to 20% of the genome are composed of non-coding regulatory elements that regulate gene transcription and translation. Getting DNALMS correctly helps interpret and discover functional sequences, predict the outcome of modifying those sequences, and redesign them to have specific, desirable properties.

However, so far, DNALM has not reached these goals. According to Kundaje and his colleagues, that's because they're not being used for the right task. They are designed to compare favorably with benchmark tests. Many of them assess usefulness for surrogate goals that the model can meet rather than important biological applications.1. The situation is different from schools that “teach tests.” Students (or AI tools) who are eligible to pass the test will finish, but little else can do.

Kundaje and his colleagues at Stanford University have discovered these important shortcomings in several popular DNALM benchmarks, datasets and metrics. For example, one important task is to assess the ability of the model to rank functional genetic variation. Changes in DNA sequences that can affect the risk of disease or molecular function of a cell. Some DNAMs have simply not been evaluated in this task, while others use flawed benchmark datasets that cannot explain the “linkage imbalance,” a non-random association of genetic variants.

This makes it difficult to isolate the true functional variants, which are flaws that result in unrealistic estimates of the capabilities of these models. It's a rookie error, says Kundaje. “This doesn't require deep domain knowledge. It's genetics 101.”

Transparency and bulging

Inadequate benchmarks cause similar education and testing problems in a variety of science fields. But failures don't just happen because it's difficult to create a good benchmark. That's because there's often no better pressure, according to Nick McGravie, who earned his PhD in AI application in physics at Princeton University in New Jersey last year.

Most people who use AI in science seem satisfied that AI tools developers can use their criteria to assess their usefulness. It's like having a drug company decide whether or not a drug should go to the market, says McGreivy. “The same people who evaluate the performance of AI models also benefit from these assessments,” he says. This means that research may be biased even if it is not intentionally fraudulent.

Lorena Barba, a mechanical and aerospace engineer at George Washington University in Washington, DC, has a similar perspective. Science suffers because of “low transparency, closet failure, data failure, excessive generalization, data failure, gatekeeping, bulging.”

Barba's own field is fluid dynamics. This involves research into problems such as smoothing the flow of air through the aircraft wings to improve fuel efficiency. Doing that involves solving partial differential equations (PDEs), but that's not easy. Most PDEs cannot be solved through numerical analysis. Instead, the solution must be approximated through a (professionally guided) trial-and-error-like process.

The mathematical tool that achieves this is called the standard solver. They are relatively effective, but also require important computational resources. So many people in fluid mechanics hope that AI, in particular, machine learning approaches, will help them do more with fewer resources.

Machine learning is the most advanced form of AI in the last five years. This is mainly because training data is available. Machine learning involves feeding data to algorithms that search and predict patterns for data. You can fine-tune the parameters of the algorithm to optimize the usefulness of your predictions.

In theory, machine learning can provide solutions to PDEs that use fewer computing resources faster than traditional methods. The question is how can you trust the output of the model you are validating if you cannot trust that the benchmark used to evaluate performance is useful or reliable.

The portrait of Nicholas McGravie was taken outside.

Nick McGreivy has discovered that some published improvements to the AI model have made misleading claims.Credit: Nicholas McGribby

McGreivy and his colleague Ammar Hakim, a computational physicist at Princeton University, conducted an analysis of “improvements” published on standard solvers and found that 79% of the papers they investigated made problematic claims.2. Many of them have to do with benchmarks for what they call weak baselines. This can result from unfair comparisons. Machine learning in PDE can be considered more efficient in terms of computing resources (such as short runtimes) than standard solvers. However, unless the solution has similar accuracy, the comparison is pointless. Researchers suggest that comparisons should be made with either comparable precision or equal runtimes.

Another source of weak benchmarks is to compare AI applications with relatively inefficient non-AI numerical methods. For example, in 2021, data scientist Sifan Wang, now at Yale University in New Haven, Connecticut, and Paris Perdicalis, a computer scientist at the University of Pennsylvania in Philadelphia, claimed that machine learning-based solvers for differential classes of differentiated classes were 10 times faster than traditional numerical solvers.3. However, the pair did not compare it to state-of-the-art numerical solvers, as Chris Luckaccus, a computer scientist at the Massachusetts Institute of Technology in Cambridge, pointed out in the video.

“It's fair [Perdikaris]after I pointed out this, they edited the paper,” says Ruckaccus. However, he says that the original paper is the only version accessible without a paywall, and still creates false hopes regarding AI promises in this field.

There are many such misleading claims, McGreivy warns. Scientific literature is “not a reliable source of information to assess the success of machine learning in solving PDEs,” he says. In fact, he remains unconvinced that machine learning has everything it offers in this field. “In PDE research, machine learning is a solution that seeks problems and still exists,” he says.

Johannes Brandstetter, a machine learning researcher at Johannes Kepler University in Linz, Austria, and co-founders of an AI-driven physics simulation startup called Emmi AI, are more optimistic. He pointed to a critical assessment of the Structural Prediction (CASP) competition, allowing machine learning to assist in predicting 3D protein structures from amino acid sequences.4.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *