Learn from other domains to advance AI evaluation and testing

AI News


ALT Text: Stylized circuit board pattern with interconnected nodes and light blue colour lines. In the background there is a diagonal line that creates a converging perspective towards the right side of the image.

As generative AI becomes more capable and widely deployed, familiar questions from governance of other transformational technologies are resurfaced. What opportunities, capabilities, risks, and impacts should be assessed? Who should make the assessment and at what stage of the technology lifecycle? What tests or measurements should I use? And how can you know if the results are reliable?

Recent research and reports from Microsoft (Opens in a new tab)British AI Security Institute (Opens in a new tab), New York Times (Opens in a new tab)and MIT Technology Review (Opens in a new tab) It highlights the gap between AI models and systems evaluation methods. These gaps also form the fundamental context of recent international expert consensus reports: the first International AI Safety Report (Opens in a new tab) (2025) and Singapore Consensus (Opens in a new tab) (2025). These gaps lead to more reliable ratings that help guide deployment decisions, inform policies and deepen trust by closing at a pace that suits AI innovation.

Today we are launching a limited series podcast, AI Tests and Evaluation: Learning from Science and Industryshare insights from domains working on test and measurement questions. Over four episodes, host Kathleen Sullivan talks with academic experts in genome editing, cybersecurity, drugs and medical devices to find out which technical and regulatory procedures have helped them bridge the assessment gap and gain public trust.

Spotlight: Blog Post

Medfuzz: Exploring the robustness of LLMS against the problems of medical challenges

Medfuzz tests LLM by destroying benchmark assumptions and reveals vulnerabilities to enhance real-world accuracy.


We also share written case studies from experts along with top-level lessons applied to AI. At the end of the podcast series, we offer Microsoft's deeper reflection on the next step towards a more reliable and reliable approach to AI assessment.

Lessons from 8 case studies

Our research into risk assessment, testing, and assurance models in other domains began in December 2024 at Microsoft's responsible AI office. (Opens in a new tab) We have gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical equipment, nanoscience, nuclear energy, and pharmaceuticals. In putting together this group, we drew up the learning and feedback we received in our e-books. Global Governance: AI Goals and Lessons Learned (Opens in a new tab), Therefore, we studied high-level goals and institutional approaches that have been used in the past for cross-border governance.

Risk assessment and testing approaches vary widely from case study to case study, but there was one consistent top-level takeout. The assessment framework always reflects trade-offs between various policy objectives such as safety, efficiency, and innovation.

Experts in all eight areas noted that policymakers had to weigh trade-offs when designing their assessment frameworks. These frameworks need to explain both the limitations of current science and the need for agility in the face of uncertainty. They similarly agreed that, as explained by cybersecurity expert Stewart Baker, it often reflects the “DNA” of the historical moments they were created, and that it is important because it is difficult to scale or undo later.

Strict pre-deployment testing systems, such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals, provide strong safety assurances, but can be resource-intensive and slow to adapt. These regimes often emerge in response to well-documented obstacles and are supported by decades of regulatory infrastructure and detailed technical standards.

In contrast, fields marked with dynamic and complex interdependencies (such as cybersecurity and bank stress testing) between the tested system and its external environment are based on a more adaptive governance framework. Tests can be used to generate actionable insights about risk rather than primarily as a trigger for regulatory enforcement.

Furthermore, in drugs where interdependencies are active and focused on pre-deployment testing, experts highlighted potential trade-offs between downstream risk and post-market surveillance of efficacy assessments.

These variations in risk profile differences, technology type, maturity of assessment science, placement of expertise in the ecosystem of evaluators, and domain-wide approaches that originate from contexts that may be among other factors, among other factors, inform AI take-out.

Apply risk assessment and governance lessons to AI

While there is no analogy that fits perfectly into the AI ​​context, the genome editing and nanoscience cases provide interesting insights into general purpose technologies like AI. In this case, the risks vary greatly depending on how the technology is applied.

Experts highlighted the benefits of a governance framework that is more flexible and tailored to the context of a particular use case and application. In these areas it is difficult to define risk thresholds and to define design evaluation frameworks in abstracts. Once the technology is applied to a particular use case and context-specific variables are known, the risk becomes more visible and evaluable.

These and other insights have also helped to distill qualities essential to ensure that testing is a reliable governance tool across the domain.

  1. Strictness In defining what is being considered and why it is important. This requires a detailed specification of what is being measured and requires an understanding of how the deployment context affects the outcome.
  2. Standardization of how to conduct tests to achieve valid and reliable results. This requires providing methodological guidance and establishing technical standards to ensure quality and consistency.
  3. Interpretability Test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in the way test results are understood, contextualized, and used.

Towards a strong foundation for AI testing

Establishing a robust foundation for AI assessment and testing requires efforts to improve rigor, standardization, and interpretability, and to accommodate rapid technological advances and evolving scientific understandings.

Taking lessons from other general purpose technologies, this fundamental task must also be pursued on both AI models and systems. Test models continue to be important, and reliable assessment tools that provide assurances for system performance enable the widespread adoption of AI, including high-risk scenarios. A powerful feedback loop for AI models and systems assessment not only accelerates the progress of methodological challenges, but also brings the most appropriate and efficient opportunities, capabilities, risks and impacts to assess at points along the AI ​​development and deployment lifecycle.

Acknowledgments

We would like to thank the following external experts for their contribution to our research programme on AI Testing and Evaluation Lessons: Mateo Avoy, Paul Alp, Geronimo Poleto Antonacci, Stewart Baker, Daniel Benamoosig, Pablo Cantaro, Daniel Carpenter, Alta Charo, Genifele Green Martin, and Green Martin, minssen.

Case studies

Civil Aviation: Testing in aircraft design and manufacturingPaul Alp

Cybersecurity: Cybersecurity Standards and Testing – Lessons for AI Safety and Securityby Stewart Baker

Financial Services (Bank Stress Test): Evolving use of bank stress testsJudge Kathryn

Genome Edit: Governance of genome editing in human therapy and agricultural applicationsAlta Charo and Andy Greenfield

Medical Devices: Testing Medical Devices: Regulatory Requirements, Evolution, and Lessons for AI Governance, By Mateo Aboy and Timo Minssen

Nanoscience: Nanoscience and nanotechnology regulatory environment and application to future AI regulationsJennifer Dionne

Nuclear power: Testing in the nuclear industryPablo Cantero and Geronimo Pollette Antonacci

Pharmaceuticals: History and evolution of testing in drug regulationsby Daniel Benamouzig and Daniel Carpenter





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *