Insider Brief
- Stanford University researchers introduced a new method of assessment for large-scale language models (LLMS) that reduce costs and increase fairness by assigning difficulty scores to benchmark questions, according to a paper published at the International Conference on Machine Learning.
- Funded by MacArthur Foundation, Stanford Hai, and Google Inc., this method uses item response theory, a standardized test concept, to select an adaptive subset of questions that reduces assessment costs by up to 80% while providing more accurate comparisons between AI models.
- Applied to 22 datasets and 172 models across medicine, mathematics and law, the system improves the integrity of AI assessments by identifying and removing previously seen questions, supporting more transparent and reliable AI development, allowing you to track model safety metrics over time.
New ways to evaluate artificial intelligence models promise to reduce costs and improve equity, according to researchers at Stanford University, who developed the approach funded by the MacArthur Foundation, Stanford High and Google Inc.
The method, detailed in a paper presented at the International Conference on Machine Learning, presents an adaptive question selection system that evaluates the difficulty of benchmark questions and compares the performance of language models more accurately.
As AI developers release more and more advanced language models, they often advocate for improved performance based on benchmark tests. These assessments using large banks of test questions usually require extensive human reviews, which can be time consuming and expensive. According to researchers at Stanford University, the assessment process costs as much or more than the model training itself. Furthermore, practical limitations will ensure that developers only use a subset of questions. This can distort the outcome when simple questions are overrepresented.
The Stanford team, led by Assistant Computer Science Professor Sanmi Koyejo, has developed a system that assigns difficulty scores to benchmark questions using item response theory, a concept that has long been adopted in standardized testing. This allows the evaluator to compare model results, level the arena between models, and explain the difficulty of the question when comparing them, reducing the likelihood of misleading results due to ease of test sets.
“The important observation we're making is that we have to explain how difficult the question is,” said the principal investigator and assistant professor of computer science in the Faculty of Engineering.
The researchers have applied approaches to 22 data sets and 172 different language models, demonstrating adaptability across a variety of domains, including medicine, mathematics, and law. The system uses questions generated by AI adjusted to a specific difficulty level. This reduces costs and automates replenishment of your question bank. This method also allows for the identification and removal of previously seen or “contaminated” questions from the dataset, improving integrity of the evaluation.
Co-author Sang Truong, a doctoral candidate at the Stanford Institute of Artificial Intelligence, emphasized that this adaptation method can reduce valuation costs by up to 80% in some cases, while providing a more consistent comparison. Additionally, the system was able to detect subtle changes in safety metrics in versions of GPT-3.5, highlighting its ability to track performance shifts over time. Safety in this context refers to the robustness of the model against manipulation, exploitation, and other vulnerabilities, the researchers noted.
Stanford University researchers argue that better assessment tools benefit both AI developers and end users by improving diagnosis and providing a more transparent assessment of AI models. By reducing costs and increasing fairness in model evaluation, the system helps accelerate AI development while increasing trust in technology.
In addition to Stanford, contributors to the research include collaborators from the University of California, Berkeley and the University of Illinois at Urbana-Champaign. Koyejo and co-author Bo Li are partnering with Virtue AI, who supported the project.
“And for everyone else,” Koejo said. “That means faster advancements and greater confidence in the rapidly evolving tools of artificial intelligence.”
