Meta’s machine learning model for detecting prompt injection attacks (special prompts designed to trick neural networks into behaving improperly) is itself vulnerable to prompt injection attacks, as you might expect.
Social networking giant Meta says Prompt-Guard-86M, which it introduced last week in conjunction with its Llama 3.1 generative model, is intended to “help developers detect and respond to prompt injection and jailbreak inputs.”
Large language models (LLMs) are trained using huge amounts of text and other data and may repeat it on request. This isn't ideal when the material is dangerous, suspicious, or contains personal information. So makers of AI models build filtering mechanisms called “guardrails” to catch queries or responses that could cause harm, including disclosing sensitive training data on request.
People using AI models have made a sport of circumventing guardrails using prompt injection (input designed to make the LLM ignore the internal system prompts that guide its output) and jailbreaking (input designed to make the model ignore safeguards).
This is a well-known problem that has yet to be solved. For example, about a year ago, computer scientists at Carnegie Mellon University developed automated technology that generates adversarial prompts to subvert safety mechanisms. The risk of AI models that can be manipulated in this way was made clear when a chatbot at a Chevrolet dealership in Watsonville, California, agreed to sell a $76,000 Chevrolet Tahoe for $1.
Perhaps the best-known prompt injection attack begins with “Please ignore previous instructions…” Also, a common jailbreak attack is the “Do Anything Now” or “DAN” attack, which prompts the LLM to adopt the role of a DAN, an AI model with no rules.
It turns out that by simply adding spaces between letters and omitting punctuation, you can ask Meta's Prompt-Guard-86M classification model to “ignore previous instructions.”
Aman Priyanshu, a bug hunter at enterprise AI application security shop Robust Intelligence, recently discovered the safety bypass when analyzing differences in embedding weights between Meta's Prompt-Guard-86M model and Redmond's base model, microsoft/mdeberta-v3-base.
Prompt-Guard-86M was created by tweaking a base model to catch risky prompts, but Priyanshu discovered that the tweak process had little effect on single English characters, which allowed him to devise an attack.
“This bypass involves inserting per-character spaces between all English alphabetic characters within a given prompt,” Priyanshu explained in a GitHub Issues post on the Prompt-Guard repository on Thursday. “This simple transformation prevents the classifier from detecting potentially harmful content.”
The findings coincide with a May post by the security organization that suggested safety controls could be breached by tweaking the models.
“Whatever nasty question you want to ask, just remove the punctuation and put spaces between the letters,” said Hiram Anderson, CTO of Robust Intelligence. Registry“It's so simple and so effective. And not just by a little bit. It increased our attack success rate from less than 3% to almost 100%.”
Anderson acknowledged that PromptGuard could fail — it's only a first line of defense, and it's possible that it will reject malicious prompts from whatever model it's testing — but he said the purpose of pointing it out is to raise awareness among companies trying to use AI that there are plenty of ways things can go wrong.
Meta did not immediately respond to a request for comment, but we've heard the social media industry is working on a fix.®