AI-MANY-SHOT-JAILBREAK-2024
LLM security · Many-shot jailbreaking
Résumé
Anthropic showed that prepending a prompt with a large number of fabricated dialogues in which an assistant answers harmful questions exploits in-context learning to override safety training. A few faux dialogues are refused, but scaling to 256 or more overwhelms the safeguards, with effectiveness growing following a power law as the example count increases. The technique works against Anthropic's own models and peers' models, and larger more capable models are more vulnerable because they learn in-context better. It is enabled by the expanded context windows of modern LLMs and is a research jailbreak technique.
Comment l’éviter dans votre code
- Apply input guardrails that detect long sequences of fabricated harmful dialogues.
- Use prompt classification and context-window limits to blunt many-shot in-context attacks.
- Enforce output content filtering independent of the in-context conversation.
- Apply least privilege so a jailbroken model cannot reach sensitive tools or data.
- Keep models patched with strengthened in-context safety defenses.
Références
Vulnérabilités liées
Tout AI/LLM →- HIGHAI-SHADOWLEAK-2025
ShadowLeak is a server-side zero-click indirect prompt-injection attack against ChatGPT's Deep Research agent, discovered by Radware. An attacker emails the victim a message with instructions hidden in the HTML using white-on-white text and tiny fonts; when the user runs Deep Research over their inbox, the agent autonomously follows the hidden instructions and exfiltrates personal and inbox data. The distinguishing trait is that exfiltration occurs entirely server-side within OpenAI's cloud infrastructure, making it invisible to local and enterprise network defenses. The Gmail proof of concept generalizes to any Deep Research connector; OpenAI fixed it before public disclosure with no evidence of in-the-wild exploitation.
- MEDIUMAI-GEMINI-WORKSPACE-2025
Marco Figueroa of Mozilla's 0DIN program documented a Gemini for Workspace flaw where an attacker hides instructions inside an email using tags styled with font-size zero or white-on-white text, invisible to the recipient. When the user clicks Summarize this email, Gemini processes the raw HTML and treats the hidden directive as a high-priority instruction, appending an attacker-crafted fake security warning, such as a fake support phone number, that appears to come from Google. No links or attachments are required, enabling credential harvesting and vishing at scale through indirect prompt injection.
- HIGHAI-SLACK-PROMPT-INJECTION-2024
PromptArmor disclosed an indirect prompt-injection data-exfiltration flaw in Slack AI. An attacker with only the ability to post in a public channel plants adversarial instructions; when any Slack AI user later queries the assistant, the model ingests the planted text and follows it. The injection makes Slack AI render a deceptive Markdown link whose URL encodes private-channel data in the query string, so clicking it exfiltrates the secret to the attacker's server. A subsequent Slack update that added files from channels and DMs to AI answers widened the attack surface.
- HIGHAI-LIVING-OFF-COPILOT-2024
At Black Hat USA 2024, Michael Bargury of Zenity presented Living off Microsoft Copilot, demonstrating how indirect prompt injection, RAG poisoning and phantom references let an attacker manipulate Microsoft 365 Copilot to exfiltrate sensitive enterprise data, bypass Data Loss Prevention controls, and conduct AI-driven spear-phishing and social engineering. Zenity released red-team tooling including LOLCopilot, CopilotHunter and PowerPwn v3. This was a red-team research demonstration against the live product rather than a single patched CVE.
- HIGHAI-SKELETON-KEY-2024
Skeleton Key, disclosed by Microsoft's Mark Russinovich, is a multi-turn jailbreak that convinces a model to augment rather than replace its safety guidelines, agreeing to answer any request but prefixing potentially harmful output with a warning instead of refusing. Once the model accepts this behavior change, it complies with otherwise-restricted requests across categories such as explosives, bioweapons, self-harm and violence. Microsoft tested it against models from Meta, Google, OpenAI, Mistral, Anthropic and Cohere, with most complying fully. It is a jailbreak technique rather than an exploited product vulnerability.
- HIGHCVE-2024-5565
The Vanna.AI text-to-SQL library exposes an ask() method that, with visualization enabled by default, pipes LLM output through a chain of SQL to Python code to a Plotly visualization rendered with exec(). An attacker supplying crafted natural-language input can use prompt injection to override the intended Plotly code and have arbitrary Python executed on the host, yielding remote code execution. The flaw, discovered by JFrog, affects versions up to and including 0.5.5 and is fixed in 0.5.6 or by disabling visualization for external input.