Benchmarking AI agents is misleading, study warns

AI News

We want to hear from you. Take our short AI survey and let us know your thoughts about the current state of AI, how you're adopting it, and your hopes for the future. Learn more

AI agents are becoming a promising new research direction with potential real-world applications. These agents use underlying models such as Large-Scale Language Models (LLMs) and Visual Language Models (VLMs) to receive natural language instructions and pursue complex goals autonomously or semi-autonomously. AI agents can use a variety of tools, such as browsers, search engines, and code compilers, to validate their actions and reason about their goals.

However, a recent analysis by researchers at Princeton University reveals that current methods for benchmarking and evaluating agents have several shortcomings that prevent them from being useful in real-world applications.

Their findings highlight that benchmarking agents comes with distinct challenges and that agents cannot be evaluated in the same way as benchmarking underlying models.

Cost vs. accuracy tradeoff

One major problem the researchers highlight in their work is the lack of cost control in agent evaluation: AI agents often rely on probabilistic language models that can produce different results when running the same query multiple times, which can make them much more expensive to run than a single model invocation.

Countdown to VB Transform 2024

Join enterprise leaders at our flagship AI event in San Francisco July 9-11. Network with your peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications in your industry. Register now

To improve accuracy, some agent systems generate multiple responses and use mechanisms such as voting or external validators to select the best answer. Sampling hundreds or thousands of responses may improve the agent's accuracy. This approach can improve performance, but it comes at a significant computational cost. The inference cost is not necessarily an issue in research environments where the goal is to maximize accuracy.

However, in real applications, there is a limited budget available for each query, so cost management is important in evaluating agents. Otherwise, researchers may be motivated to develop very costly agents just to get to the top of the leaderboard. The Princeton researchers propose using a technique to visualize the evaluation results as a Pareto curve of accuracy versus inference cost, and then jointly optimize agents on these two metrics.

The researchers evaluated the accuracy and cost tradeoffs of different prompting techniques and agent patterns presented in different papers.

“For substantially similar accuracy, costs can vary by almost two orders of magnitude,” the researchers wrote. “However, the cost of running these agents is not the primary metric reported in any of these papers.”

The researchers claim that optimizing both metrics results in “low-cost agents while maintaining accuracy.” Joint optimization also allows researchers and developers to balance the fixed and variable costs of running an agent. For example, they could spend more on optimizing the agent's design while reducing variable costs by using fewer in-context learning examples in the agent's prompts.

The researchers tested their joint optimization on HotpotQA, a popular question-answering benchmark, and found that their joint optimization formulation achieves an optimal balance between accuracy and inference cost.

“Evaluation of useful agents should take cost into account, even if one is ultimately not concerned with cost and is only interested in identifying innovative agent designs,” the researchers write. “Accuracy alone cannot identify progress, as progress can be improved by scientifically non-meaningful methods such as retries.”

Model Development and Downstream Applications

Another issue researchers point out is the difference between evaluating models for research purposes and developing downstream applications. In research, accuracy is often the main focus, and inference costs are largely ignored. However, when developing real-world applications with AI agents, inference costs play a key role in deciding which models and techniques to use.

Inference costs for AI agents are difficult to assess. For example, different model providers may charge different amounts for the same model. Meanwhile, the cost of API calls changes periodically and may vary depending on developer decisions. For example, some platforms charge different amounts for bulk API calls.

To address this issue, the researchers created a website that calibrates the comparison of models based on token pricing.

They also conducted a case study of NovelQA, a benchmark for question answering tasks with very long texts. They found that benchmarks for model evaluation can be misleading when used for downstream evaluation. For example, in the original NovelQA study, Search Augmentation Generation (RAG) appears to perform much worse than long-context models in real-world scenarios. Their findings showed that the accuracy of the RAG and long-context models was roughly comparable, but the long-context model was 20 times more expensive.

Overfitting is a problem

When learning new tasks, machine learning (ML) models often find shortcuts that allow them to score highly on benchmarks. A typical type of shortcut is “overfitting”, where a model finds a way to cheat the benchmark test and deliver results that do not apply to the real world. Researchers have found that overfitting is a serious problem for agent benchmarks because they tend to be small, typically consisting of only a few hundred examples. This problem is more serious than data contamination in the training of the underlying model, because knowledge of the test examples can be programmed directly into the agent.

To address this issue, the researchers suggest that benchmark developers should create and maintain a hold-out test set consisting of examples that cannot be memorized during training and can only be solved by properly understanding the target task. After analyzing 17 benchmarks, they found that many lacked a suitable hold-out dataset, allowing agents to unintentionally take shortcuts.

“Surprisingly, we find that many agent benchmarks do not include hold-out test sets,” the researchers write. “In addition to creating test sets, benchmark developers should consider keeping test sets secret to prevent LLM contamination and agent overfitting.”

We also found that different types of holdout samples were needed based on the desired level of task generality that the agent would accomplish.

“Benchmark developers must do their best to ensure that shortcuts are not possible,” the researchers wrote. “We believe this is the responsibility of benchmark developers, not agent developers, since designing a benchmark that does not allow shortcuts is much easier than checking for each agent whether it takes shortcuts.”

The researchers tested WebArena, a benchmark that evaluates the performance of AI agents in solving problems on different websites. They found several shortcuts in the training dataset that allowed the agent to overfit the task in a way that would easily break with small changes in the real world. For example, the agent could make assumptions about the structure of web addresses without considering that they might change in the future or that they might not work on different websites.

These errors can inflate accuracy estimates and lead to over-optimism about the agents' capabilities, the researchers warn.

Because AI agents are a new field, the researcher and developer community still has a lot to learn about how to test the limits of these new systems that may soon become a key part of everyday applications.

“Benchmarking AI agents is new and best practices have yet to be established, making it difficult to distinguish true progress from hype,” the researchers wrote. “Our contention is that agents are sufficiently different from models that benchmarking practice needs to be rethought.”

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *