Introduction to the Frontier Safety Framework

Approaches to analyze and mitigate future risks posed by advanced AI models

Google DeepMind has always pushed the boundaries of AI and developed models that transform our understanding of what's possible. We believe that AI technologies are on the horizon and will provide society with valuable tools to help address critical global challenges such as climate change, drug discovery, and economic productivity. At the same time, as we continue to advance the frontiers of AI capabilities, we recognize that these breakthroughs may ultimately come with new risks that exceed those posed by current models. .

Today, we are introducing the Frontier Safety Framework. It is a set of protocols to proactively identify future AI capabilities that have the potential to cause serious harm and introduce mechanisms to detect and mitigate them. Our framework focuses on serious risks stemming from strong model-level capabilities, such as superior institutions and advanced cyber capabilities. It is designed to complement our tuning research, which trains models to act in accordance with human values and societal goals, and Google's existing suite of AI responsibility and safety practices.

This framework is exploratory, and we expect it to evolve significantly as we learn from its implementation, deepen our understanding of AI risk and assessment, and collaborate with industry, academia, and government. Although these risks are beyond the reach of the current model, we hope that implementing and improving the framework will help prepare us to deal with them. We aim to fully implement this initial framework by early 2025.

framework

The first version of the framework announced today builds on our research on assessing critical capabilities in frontier models and follows a new approach: responsible capability scaling. The framework has three main components.

Identify features of the model that can pose significant harm. To do this, we investigate the pathways by which a model can cause significant damage in high-risk areas and determine the minimum level of functionality that a model must have to play a role in causing such damage. Decide. We call these “critical capability levels” (CCLs) and they guide our assessment and mitigation approaches.
Periodically evaluate your frontier model to detect when it reaches these critical feature levels. To do this, we develop a set of model evaluations called “early warning evaluations” that alert you when a model approaches its CCL, and run the evaluations frequently enough to notice before that threshold is reached. To do.
Apply the mitigation plan if the model passes the early warning evaluation. This requires consideration of the overall balance of benefits and risks, as well as the intended deployment context. These mitigations primarily focus on security (preventing model leakage) and deployment (preventing exploitation of critical functionality).

Risk areas and mitigation levels

Our first set of critical capability levels is based on research in four areas: autonomy, biosecurity, cybersecurity, and machine learning research and development (R&D). Our initial research suggests that future underlying model features are most likely to pose significant risks in these areas.

Regarding autonomy, cybersecurity, and biosecurity, our primary goal is to assess the extent to which threat actors are able to carry out harmful activities with significant consequences using models with advanced capabilities. . In machine learning research and development, whether models with such capabilities enable the proliferation of models with other important capabilities, or whether they allow rapid and unmanageable escalation of AI capabilities. will be the focus. As we learn more about these and other risk areas, we expect these CCLs to evolve and some CCLs to be added at higher levels or in other risk areas.

We have also outlined a set of security and deployment mitigations to allow you to adjust the strength of mitigations for each CCL. Higher-level security mitigations provide better protection against leaking model weights, and higher-level deployment mitigations provide tighter control over critical functionality. However, these measures can also slow the rate of innovation and reduce the broad accessibility of capabilities. Striking the right balance between mitigating risk and promoting access and innovation is paramount to the responsible development of AI. By weighing the overall benefits and risks and considering the context of model development and deployment, we aim to ensure responsible AI progress that unlocks transformative potential while protecting against unintended consequences. I am.

investing in science

The research underlying the framework is in its early stages and progressing rapidly. We have made significant investments in our Frontier Safety team. The Frontier Safety team coordinated the cross-functional efforts behind the framework. Their mission is to advance the science of frontier risk assessment and refine the framework based on improved knowledge.

The team developed an evaluation suite to assess the risks posed by key features and road-tested them on state-of-the-art models, with a particular focus on autonomous LLM agents. Their recent paper describing these evaluations also explores mechanisms that may shape future “early warning systems.” It describes a technical approach to assessing how close a model is to succeeding at tasks it is currently failing at, including predictions about future capabilities by a dedicated prediction team.

Stay true to AI principles

We regularly review and evolve our framework. In particular, we will continue to pilot the framework and refine our understanding of risk domains, CCLs, and deployment contexts while tailoring specific mitigations to CCLs.

At the heart of our work are Google's AI principles, which are committed to pursuing broader benefits while mitigating risk. As our systems improve and their capabilities increase, measures like the Frontier Safety Framework ensure our practices continue to meet these promises.

We look forward to working with industry, academia, and government stakeholders to develop and refine the framework. We hope that sharing our approach will facilitate work with other collaborators to agree on standards and best practices for assessing the safety of future generations of AI models. I am.

Source link