Can AI benchmarks be trusted? – Spiceworks

Machine Learning


  • AI companies have long used benchmarks to promote their products and services as the best in the industry and claim they are better than their competitors.
  • AI benchmarks provide a measure of the technical capabilities of large-scale language models, but are they reliable differentiators for those that form the basis of generative AI tools?

The advent of the era of generative AI raises pertinent questions. Which large-scale language model (LLM) is the best? And more importantly, how do you measure it?

AI benchmarking is difficult given that LLM tools require testing accuracy, veracity, relevance, context, and other subjective parameters, as opposed to hardware where computational speed is the defining criterion. There may be cases.

Over the years, several AI benchmarks have been created as technical tests designed to evaluate specific capabilities such as question answering, reasoning, coding, text generation, and image generation.

AI benchmarks also include objective comparisons, feature evaluation (summarization and inference), generalization, and robustness in handling complex language structures and tracking progress.

AI companies have used these tests to promote their products and services as the best in the industry and claim they are better than their competitors. The recently released LLM has already surpassed humans in several benchmarks. In other things they still do not match us.

For example, Gemini Ultra topped the Multitasking Language Understanding or MMLU benchmark with a score of 90%, followed by Claude 3 Opus (88.2%), Leeroo (86.64%), and GPT-4 (86.4%). The MMLU is a 57-subject knowledge test that includes elementary mathematics, U.S. history, computer science, law, and more.

Claude 3 Opus, on the other hand, scored just over 50% in scientific reasoning under the graduate level GPQA benchmark. GPT-4 Turbo (knowledge cutoff until April 2024) he scored 46.5% and GPT-4 Turbo (January 2024) he scored over 43%.

So there is some truth to the claim that AI tools are on par with what we imagined. However, since AI benchmarks provide task-specific evaluations, their usage across domain-independent general-purpose applications should be improved. So are they on par with what we expected?

Spiceworks News & Insights investigates why AI benchmarks are inconsistently evaluated and inappropriate for comparison.

AI benchmark limitations

AI benchmarking presents multiple challenges related to making general comparisons for LLMs. These include:

1. Lack of standardization

Ralph Meyer, Manager of Engines and Algorithms at Hyland, told Spiceworks that AI benchmarking is important because of the diversity of applications and requirements and the evaluation criteria, especially responsible AI capabilities (transparency, explainability, Due to a lack of consensus on data privacy, there is a lack of proper standardization. and resource constraints.

“AI systems are being applied to a wide range of domains and tasks, each with unique requirements and nuances. Developing standardized benchmarks that can accurately capture the performance and limitations of AI models across all these diverse applications. is a big challenge,” Meyer said.

“Evaluating state-of-the-art AI models can be prohibitively expensive and time-consuming, especially for independent researchers and small organizations. Open source benchmarks (e.g. the benchmarks mentioned above) can be widely adopted “However, this has the added risk of contaminating the training dataset with information used for or related to a particular benchmark.”

Rakesh Yadav, founder and CEO of Aidaptive, hopes to see standardization of AI benchmarks in some areas. “I predict that over the next few years, we will see AI benchmarks established for at least a limited set of use cases, and eventually a standard process for continually adapting benchmarks with innovation. ”

see next: Top 3 LLM comparison: GPT-4 Turbo vs. Claude 3 Opus vs. Gemini 1.5 Pro

2. Most AI benchmarks are outdated

The breakneck speed of LLM development over the past few years has made it difficult for benchmarks to keep up with the latest advances and features. By the time a benchmark is developed and adopted, new models have already outgrown its range. “This could lead to discrepancies in ratings,” Meyer added.

For example, a report co-authored by the state-run China Institute of Science and Technology Information notes that U.S. organizations released 11 LLMs in 2020, 30 LLMs in 2021, and 37 LLMs in 2021. . At the same time, Chinese companies have 2, 30 and 28 LLMs scheduled in 2020, 2021 and 2022, respectively.

By May 2023, US companies have deployed 18 LLMs, and Chinese companies have also launched 18 LLMs.

“We need modern benchmarks that can assess the end-to-end performance of AI systems in real-world applications, including pre-processing, post-processing, and interactions with other systems and humans. , helps bridge the gap between the broader requirements for deploying AI solutions in complex and dynamic environments,” said Meyer.

“Overall, existing benchmarks have played an important role in advancing AI research and development, but rapid advances in the field, particularly in generative AI, have led to more comprehensive benchmarks that can better assess capabilities and limitations. “There is a need to create new transparent benchmarks for modern AI models.”

3.Vested interests

Yadav reiterated that current AI benchmarks are created by organizations with specific commercial objectives. Most prominent technology companies have invested billions of dollars in AI research and companies that build AI tools and services. “Currently, these benchmarks are built by companies with profit-based motives and are inherently biased towards their own business needs (and rightfully so),” Yadav said.

“Ideally, government-funded benchmarks and standards would be established by an unbiased consortium of large companies, with ongoing research to ensure these standards are updated in line with new developments. However, this is a developing field with intense innovation.”

4. Benchmarking specific problems

The picture that AI benchmarks paint is often distorted, given that certain prompt engineering techniques can manipulate the results. The LLM's response is measured as its performance and is determined by how the prompts are constructed.

Google was criticized for claiming that Gemini Ultra outperformed OpenAI's GPT-4. This criticism (and some ridicule) stems from the fact that the company used Chain of Thought or CoT@32 prompt engineering techniques to obtain higher benchmark scores on his MMLU instead of his 5-shot learning. It's starting from the beginning.