MMAU: Overall benchmark for agent functionality across diverse domains

Machine Learning


Recent advances in large-scale language models (LLMS) have increased the demand for comprehensive benchmarks, assessing their capabilities as human-like agents. While existing benchmarks are useful, they focus on specific application scenarios and emphasize task completion, but do not allow analysis of the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply identify where the obstacles arise. Furthermore, setting up these environments requires considerable effort, and can lead to reliability and reproducibility issues, especially for interactive tasks. To address these limitations, we present a large-scale multitasking agent understanding (MMAU) benchmark with comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains including tool usage, directed acyclic graph (DAG) QA, data science and machine learning coding, contest-level programming and mathematics, covering five key features: understanding, inference, planning, problem solving and self-correction. The MMAU features a total of 20 meticulously designed tasks, including 3K or more different prompts, providing a comprehensive framework for assessing the strength and limitations of LLM agents. Testing 18 representative models at the MMAU provides a deep, insightful analysis. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents, but also improves performance interpretability.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *