Agentic AI, artificial intelligence and machine learning, governance and risk management
Cisco: One prompt may not break most AI models, but it will break conversations.
Rashmi Ramesh (Rashmila Mesh_) •
February 23, 2026

Researchers have found that companies’ artificial intelligence deployments run on a model that almost always folds under sustained adversarial pressure.
See also: How unstructured data disruption undermines AI success
In our latest State of AI Security report, Cisco tested eight open-weight large-scale language models against multi-turn jailbreak attacks. A jailbreak attack is a series of repeated prompts designed to gradually coax the model into producing content that the guardrails block. The attack was successful 92.78% of the time.
Single-turn tests, where the attacker enters a single prompt, had a much lower success rate.
Openweight models are AI systems whose underlying parameters are exposed, allowing developers to download, tweak, and deploy them independently rather than accessing them through commercial APIs. Hugging Face, the leading public repository for these models, has had over 400 million downloads. According to the report, that accessibility drives adoption while concentrating risks that are not fully considered in many enterprise implementations.
Cisco has evaluated models from open source releases from Meta, Google, Microsoft, Mistral, Alibaba, DeepSeek, Zhipu AI, and OpenAI in our black-box efforts. This means that the researchers had no knowledge of the model’s internal architecture or existing safety configuration before testing. The results are reasonably consistent across vendors, suggesting a systematic pattern rather than an isolated model failure.
Amy Chan, Cisco’s AI threat intelligence and security research leader, told Information Security Media Group that the findings point to a deeper problem than a single model architecture.
“Despite advances in generative AI capabilities since ChatGPT was first launched in 2022, there remains limited consensus on standards for safe and secure AI development and deployment,” she said. The pace at which theoretical attack demonstrations turn into real-world exploits shows that “the attack surface continues to expand and rapidly outpaces the maturity of an organization’s defenses.”
This report distinguishes between models developed with alignment as a central objective and models where it is treated as a post-training alignment left to the implementer. Alignment refers to the process of training a model to follow intended guidelines and reject harmful requests. Meta llamas showed the biggest gap between single-turn and multi-turn vulnerabilities. Meta’s own documentation acknowledges that developers are “in the driver’s seat to tailor safety to their use case” after training, an approach that places the security burden on those deploying the model. Google’s Gemma-3-1B-IT prioritized coordination more centrally in its development and showed more consistent resistance to both types of attacks.
Our findings also have independent support. A late 2025 paper co-authored by researchers from OpenAI, Anthropic, and Google DeepMind found that adaptive attacks that iteratively refine their approaches based on previous failures evaded public model defenses with a success rate of over 90% on most systems tested. Many of these defenses were initially reported to have near-zero attack success rates.
The latest jailbreak discovery comes as AI systems move from generating text to executing actions, and the impact of compromised models grows accordingly.
Cisco’s report documents the first publicly known incident in which a nation-state actor repurposed an AI coding tool for cyberespionage. A Chinese state-backed group called GTG-1002 allegedly jailbroken an AI coding assistant and leveraged its autonomous capabilities to automate 80% to 90% of the attack chain, with human operators only providing strategic direction. This model scans open ports, identifies vulnerabilities, writes scripts to exploit them, and navigates the file system to find sensitive data. These tasks previously required teams of human operators to work on them for hours or even days (see below). Anthropic says AI tools carry out mass cyberattacks).
Chang said the impact of AI automation on attack outcomes will vary depending on the specific campaign. He cautioned that incident telemetry across the industry is too inconsistent to allow confident quantitative generalizations about metrics such as dwell time and data exfiltration volume.
As for whether agents lower the skill barrier for complex intrusions or simply speed up advanced actors, Chan said both are true. “For sophisticated attackers, agents can maximize efficiency by automating repetitive steps such as scanning and script generation,” he said. The increasing availability of purpose-built AI tools “certainly lowers the barrier to performing intrusions that may be considered ‘good enough,’ depending on the attacker’s motivations,” he said.
This ties into a vulnerability category that the report flags as increasingly critical: excessive agency. Security experts use the term to describe how AI systems can be given broad autonomous powers over tools, data, and processes. This power, if abused or misguided, can cause damage at a scale and speed that human oversight cannot keep up with. The Open Worldwide Application Security Project, whose annual Top 10 LLM Applications is widely referenced as an industry benchmark, lists excessive agency as the biggest risk, saying that if an AI system is able to take consequential actions without sufficient human review, it can be a serious flaw.
Chang identified what the report called the “connective tissue” between AI models and the external tools and data they access as a particularly at-risk area. Protocols that allow AI models to connect to external tools and data sources (most notably the Model Context Protocol, the open standard Anthropic introduced in late 2024) and agent workflows create “a large, unmonitored attack surface,” Zhang said.
Cisco’s report lists multiple real-world exploits of its infrastructure discovered in 2025. A poisoning tool that extracts private chat history, a remote code execution flaw that allows an attacker to execute shell commands on a victim’s machine by tricking it into connecting to a malicious server, and a supply chain attack that involves forged packages that blind carbon copy all emails sent to addresses controlled by the attacker through a compromised agent.
Chan said that if an AI agent with elevated access is compromised, there is no single point of failure for it to give up functionality in the first place. “The combination of each of these elements typically creates a recipe for a breach,” she said, pointing to identity governance, authority boundaries, visibility monitoring, and change management as interconnected contributors.
Detecting a compromise is also a challenge because agent hijacking generates different behavioral signals than traditional credential theft. “The control plane often takes the form of prompts, context, and tool selection behavior rather than stolen credentials,” she said. Security operations teams can catch bits and pieces like unusual API call patterns or unusual data transfer rates, but the signals are different enough from traditional intrusion indicators that standard tools can miss the big picture.
Regarding attacker economics, Chan said AI-driven automation will reduce costs for targets while increasing potential profits, reducing the time needed to scan for vulnerabilities and enabling faster payload development, while allowing threat actors to pursue a broader range of targets simultaneously.
