Anthropic talks AI red teaming

The security and safety of AI tools is a topic that is becoming increasingly important as the technology expands its impact on the world. Achieving this goal requires a multifaceted approach, but red teaming techniques can play a key role in ensuring the security of AI tools.

Specifically, red teaming is the process of testing a system to identify vulnerabilities, a non-malicious process aimed at finding problems before hackers do.

Anthropic recently published an article outlining the insights the company has gained in the process of testing AI systems, which they hope will spark a discussion about how to properly red team with AI and how the world needs a more standardized practice of red teaming.

New technology, new rules

One of the big issues with AI security in general, and with technology in general, is the current lack of standardized practices. Specifically, Anthropic noted that the lack of standardization “complicates the situation.”

For example, Anthropic points out that developers may use different techniques to evaluate the same type of threat model. Using the same techniques doesn't solve the problem, as developers may run the red team process differently.

Moreover, the solutions to many of these problems are not as simple as they may seem. There are currently no industry-wide disclosure standards. An article in Tech Policy Press discusses the Pandora's box vs. protective shield dilemma: Sharing the results of red team efforts in academic papers has many benefits, but doing so can provide adversaries with a blueprint for exploitation.

While this is a general discussion that should be had in the AI space going forward, Anthropic outlined the specific red teaming techniques it tried.

Domain-specific red teaming

- Trust and Security: Policy Vulnerability Testing
- National Security: Red Teaming the Frontline Threats
- Regional: Multilingual and Multicultural Red Teams

Using language models for red teaming

New Red Teaming Techniques

Open-Ended General Red Team

- Crowdsourced Red Teaming for Common Hazards
- Community-based red team exercises against common risks and system limitations

Anthropic dives deep into each of these topics, but the company's focus on red teaming in new modalities is particularly interesting: AI is focused on text inputs, rather than other forms of media like photos, videos, scientific graphs, etc. Red teaming in these multimodal environments is challenging, but helps identify risks and failure modes.

Anthropic's Claude 3 family of models is multimodal, providing users with more flexible applications but also introducing new risks in the form of fraud, threats to child safety and violent extremism.

Before deploying Claude 3, Anthropic asked its reliability and safety team to conduct red team testing of the system for both text- and image-based risks, and also worked with external red team members to evaluate the ability of Claude 3 to reject harmful input.

While multimodal red teaming clearly has the benefit of catching failure modes before public deployment, Anthropic also points out the benefits that can be gained from end-to-end system testing. Many AI models are actually systems of interrelated components and capabilities. This includes models, harm classifiers, prompt-based interventions, and more. Multimodal red teaming is an effective way to stress test the end-to-end resiliency of AI systems and understand overlapping safety features.

Of course, red teaming with a multimodal approach does not come without challenges. First, security teams need deep expertise in high-risk areas such as dangerous weapons, which is a rare skill. Additionally, multimodal red teaming exercises may require viewing graphic images instead of reading text-only content, which poses risks to the health of red team members and requires additional safety considerations.

Red teaming is a complex process, and multimodality is just one topic Anthropic covered in its broader report, but it's clear that the world needs a standardized approach to AI safety and security.

Source link