Understanding adversarial attacks against machine learning and AI

Machine Learning


These modifications range from simple input transformations that alter model inference (commonly known as model evasion attacks) to carefully crafted inputs that exploit vulnerabilities in the model’s inference process and produce erroneous or harmful outputs outside of its intended functionality. Some attacks in this class have undesirable effects on the model, such as allowing the production of undesired output or forcing the model into an iterative execution loop.

For LLM, model input operations are most commonly implemented as prompt injection techniques. Malicious attackers attempt to overcome LLM usage controls by concatenating commands and input prompts designed to minimize the chance that LLM will refuse to respond. Because these systems are typically supported by reference content databases, they are also at risk of indirect prompt injection attacks, where a model’s prompt-enriched queries to these databases can provide malicious content.

Similarly, as with the development of agent AI systems, as models are increasingly linked together and able to call APIs on each other, model input manipulation attacks can also originate from system inputs, tool responses, or other agent systems.

Some techniques in this class can modify malicious inputs to activate functionality that was previously hidden within the model. See Training Data Poisoning and Working with Model Artifacts. This could indicate a linked attack chain of sophisticated actors.



Source link