AI-DATA-MODEL-POISONING-2023

LLM security · Training-data / RAG poisoning

Summary

Training-data and RAG poisoning is a class in which an attacker injects malicious or backdoored data into a model's pre-training set, fine-tuning corpus or retrieval-augmented-generation knowledge base so the model emits attacker-chosen outputs, often gated behind a specific trigger. The mechanism can be surgical: Mithril Security's PoisonGPT (July 9, 2023) used Rank-One Model Editing (ROME) to overwrite a single factual association in GPT-J-6B so it asserted Yuri Gagarin was the first man on the Moon, while remaining within roughly 0.1% of the original model's benchmark accuracy and thus undetectable by standard evaluation. They distributed it on Hugging Face under the typosquatted name 'EleuterAI' to mimic the legitimate EleutherAI lab, illustrating the supply-chain reach; analogous RAG poisoning seeds malicious documents into a vector store so retrieval injects them at query time. The class maps to OWASP LLM04:2025 Data and Model Poisoning.

How to avoid it in your code

Verify model and dataset provenance via signing, checksums or attestation before use.
Source models from trusted publishers and guard against typosquatted repository names.
Vet, sanitize and access-control documents ingested into RAG knowledge bases.
Track data lineage and use anomaly detection on training and fine-tuning corpora.
Red-team models with trigger and backdoor probes beyond standard accuracy benchmarks.

References

Related vulnerabilities

All AI/LLM →