Evaluation of AI language models has become more effective and efficient

AI News


As newer versions of artificial intelligence language models unfold with increasing frequency, many argue for improved performance. However, demonstrating that the new model actually outperforms the last model remains an elusive and expensive challenge for the field.

Usually, developers expose new models to a set of benchmark questions to prove their temperament and to improve their confidence that new models are actually superior. Potentially hundreds of thousands of such benchmark questions are stored in question banks, and answers must be reviewed by humans and add time and cost to the process. Because practical constraints make it impossible for all models to ask all benchmark questions, developers choose a subset and introduce the risk of overestimating improvements based on softer questions. Stanford University researchers have now introduced a cost-effective way to make these assessments in new papers presented at the International Conference on Machine Learning.

“The important observation we're making is that we have to explain how difficult the question is,” said Sanmi Koyejo, assistant professor of computer science in the engineering department who led the research. “Some models may do better or worse just for the good of a draw. We predict that and adjust it to make a more unbiased comparison.”

“This assessment process can often cost as much or more than the training itself,” the co-author added Sang Truong, a doctoral candidate at Stanford Artificial Intelligence Labs (Sail). “We have built an infrastructure that allows us to adaptively select a subset of questions based on difficulty.

Apples and oranges

To achieve their goals, Koyejo, Truong and colleagues borrowed a decades-old concept from education known as item response theory, which takes into account the difficulty of questioning when scoring test takers. Koyejo compares it to standardized tests such as SAT and other types of adaptive testing work. All correct or incorrect answers will change the following question:

Researchers use linguistic models to analyze questions, score them on difficulty, reducing costs by half and sometimes by more than 80%. Its difficulty score allows researchers to compare the relative performance of the two models.

To build large, diverse and calibrated question banks in a cost-effective way, researchers use the generators of AI to create question generators that can be fine-tuned to the desired level of difficulty. This helps to automate the replenishment of question banks and culling of “contaminated” questions from the database.

It's fast and fair

In a better design question, the author says that others in this field can make better performance ratings with a much smaller subset of queries. This approach is faster, fairer and cheaper.

The new approach works across the knowledge domain, from medicine and mathematics to law. Koyejo tested the system against 22 datasets and 172 language models, and found it easy to adapt to both new models and questions. Their approach was able to represent a subtle shift in safety in GPT 3.5 over time. Initially retreated in several variations tested in 2023. Language model safety is a robustness metric in which the model is robust against data manipulation, adversarial attacks, exploitation, and other risks.

If once there was a costly and inconsistent outlook to reliably evaluate language models, the new item response theory approach places rigorous, scalable and adaptive assessments within reach. For developers, this means better diagnosis and more accurate performance ratings. For users, that means fairer and more transparent model evaluations.

“And for everyone else,” Koejo said. “That means faster advancements and greater confidence in the rapidly evolving tools of artificial intelligence.”

/Public release. This material of the Organization of Origin/Author is a point-in-time nature and may be edited for clarity, style and length. Mirage.news does not take any institutional position or aspect, and all views, positions and conclusions expressed here are the views of the authors alone.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *