Artificial intelligence has measurement problems

San Francisco: We have a leadership problem. artificial intelligence Tools like ChatGPT, Gemini, Claude: I don't really know how smart these are. That's because, unlike companies that make cars, pharmaceuticals, and baby formula, AI companies don't have to submit their products to testing before making them available to the public.
Users have to rely on the claims of AI companies. AI companies often use vague and ambiguous phrases like “improved functionality” to describe how a model differs from one version to the next. Models are updated frequently, so a chatbot that struggles with a task one day may be mysteriously good at it the next. Sloppy measurements also create safety risks. Without a better test, AI modelit's difficult to know which features are improving faster than expected, and which products may actually pose a threat of harm.
In this year's AI Index, a major annual report published by Stanford University's Institute for Human-Centered Artificial Intelligence, the authors say inadequate measurement is one of the biggest challenges facing AI researchers. I am. “The lack of standardized assessments makes it extremely difficult to systematically compare the limitations and risks of different AI models,” said Editor-in-Chief Nestor Masrei.

For many years, the most common way to measure AI has been turing test – An exercise proposed by mathematician Alan Turing in 1950. Test whether a computer program can trick a human into mistaking its responses for human responses. But since today's AI systems can pass the Turing test with flying colors, researchers have had to come up with even tougher assessments.
One of the most common tests given to AI models today (the chatbot SAT) is a test known as Massive Multitask Language Understanding (MMLU).
Released in 2020, MMLU consists of a collection of approximately 16,000 multiple-choice questions covering dozens of academic subjects, from abstract algebra to law to medicine. This is supposed to be some kind of general intelligence test. The more the chatbot answers correctly, the smarter it becomes.
This has become the gold standard for AI companies vying for dominance. (when Google Release the most advanced AI model, gemini ultraearlier this year boasted that he had scored 90% on the MMLU – the highest score ever recorded).
Dan Hendricks, an AI safety researcher who helped develop MMLU while in graduate school at the University of California, Berkeley, believes MMLU “probably has a shelf life of another year or two,” but it's not likely to happen in the near future. He said that it was necessary to replace it with Different and more difficult tests. AI systems are becoming too smart for current tests, and designing new tests is becoming increasingly difficult.
There are many other tests with names like TruthfulQA and HellaSwag that aim to capture other aspects of AI performance. However, these tests only measure a small portion of an AI system's capabilities. And none of them are designed to answer the more subjective questions that many users have, such as “Do you enjoy talking to this chatbot?” Is it suitable for automating mundane office tasks or for creative brainstorming? How tight are the safety barriers?
There is an issue known as “data contamination” where the AI model's training data contains benchmark test questions and answers, effectively making it possible to cheat. And since there is no independent testing or auditing process for these models, AI companies are essentially grading their own homework. In other words, AI measurement is a mess. A combination of sloppy testing, apples-to-orange comparisons, and self-serving hype has left users, regulators, and AI developers themselves in the dark.
“Despite the appearance of science, most developers actually judge models based on mood and instinct,” says Nathan Benaich, an AI investor at Air Street Capital. “That may be fine for now, but as the power and social relevance of these models grows, it will no longer be enough.” The solution in this case is likely to be a combination of public and private efforts. is.
Governments can and should come up with robust policies. test program It measures both the raw capabilities and safety risks of AI models, and should fund grants and research projects aimed at producing new, high-quality assessments.
In an executive order on AI last year, the White House directed several federal agencies, including the National Institute of Standards and Technology, to create and oversee new ways to evaluate AI systems.