Prompt injection attacks exploit the fundamental architecture of LLMs by embedding malicious instructions within user inputs or external data sources. These attacks hijack the AI system's intended goals, causing it to execute attacker-controlled instructions instead of its programmed objectives. This category encompasses both direct manipulation through user input and indirect attacks via poisoned data sources, representing one of the most significant security challenges for deployed AI systems.
Attackers craft explicit commands within user input to override or replace the AI system's operational directives. Common patterns include phrases like "ignore previous instructions" or "you are now in developer mode." This represents the most straightforward form of prompt injection, targeting the model's instruction-following capabilities directly.
Malicious instructions are disguised through encoding techniques, character substitution, or linguistic tricks to evade detection mechanisms while preserving attack functionality. Methods include leetspeak, unicode homoglyphs, base64 encoding, language mixing, and semantic obfuscation through synonyms or paraphrasing.
In multi-agent systems, attackers inject malicious instructions through one agent's output that are then trusted and executed by downstream agents. This exploits the inherent trust relationships between cooperating agents, where outputs from one component become trusted inputs to another.
Malicious instructions embedded within external data sources such as documents, web pages, emails, or API responses are retrieved and processed by the AI system. These poisoned sources inject instructions that override the model's behavior without the user's awareness, exploiting RAG systems and data retrieval workflows.
Hidden or encoded instructions within external data sources designed to evade content scanning and input validation while remaining interpretable by the AI model. This combines indirect injection with evasion techniques to maximize attack success probability.
Exploitation of inter-agent communication channels through poisoned external content that propagates between agents. One agent retrieves compromised data which then flows through the multi-agent workflow, affecting multiple downstream components.
Attackers gradually shift the AI system's operational objectives over multiple interaction turns through carefully crafted prompts. Contradictory or concealed objectives are embedded within conversations, slowly steering the model away from its intended behavior toward attacker-defined goals.
Attackers compromise external components that AI agents depend on, including tools, prompt templates, resources, or dependencies. Malicious objectives are injected through these trusted supply chain elements, redirecting agent behavior at a foundational level.
Malicious instructions, prompts, or data are embedded within images using techniques like steganography, adversarial patches, or hidden text. Vision-language models extract and interpret these hidden payloads, enabling attacks that bypass text-based content filters.
Modification of visual content through pixel-level changes, structural alterations, or pattern overlays to influence how AI models perceive and process images. Unlike embedded text injection, this targets the model's visual interpretation directly to cause misclassification or altered decision-making.
Inaudible or unintelligible voice commands embedded within audio streams using ultrasonic frequencies, backmasking, or steganographic techniques. Automatic speech recognition models interpret these hidden signals as valid instructions while remaining imperceptible to human listeners.
Harmful content or malicious instructions embedded within video streams through specific frames, QR-like visual triggers, or temporal patterns. These attacks exploit multimodal model processing of video content to bypass guardrails and inject commands.