How to Benchmark a Prompt-Injection Detector Honestly
Most prompt-injection detector benchmarks are broken before the first request. Here is a test design that produces a number you can actually trust.
A vendor tells you their prompt-injection detector has 98% accuracy. That number is almost always meaningless, and not because the vendor is lying. It is meaningless because “accuracy” on an unspecified corpus, at an unspecified threshold, against an unspecified attacker, with no false-positive rate, is not a measurement. It is a marketing artifact.
This post is the test design we use before publishing any detector benchmark. It is built to stop you from fooling yourself.
The four numbers, never one
A prompt-injection detector is a binary classifier. Reporting a single accuracy figure throws away the information that decides whether the thing is deployable. Always report:
- True-positive rate (recall) — of the injections that should be caught, what fraction were?
- False-positive rate — of benign inputs, what fraction were wrongly flagged?
- Latency at p95 — how long does a verdict take under load, not in a single-shot demo?
- Cost per 1,000 calls — including any model the detector itself invokes.
A detector with 99% recall and a 25% false-positive rate is unusable in any high-traffic product: it will flag a quarter of legitimate users. A detector with 80% recall and a 0.5% false-positive rate is often the better production choice. The single-accuracy framing hides exactly this trade-off, which is why vendors prefer it.
The benign corpus is the load-bearing piece
Most published detector benchmarks spend all their effort on the attack corpus and grab the benign corpus as an afterthought — usually short, clean, well-formed English sentences. This inflates the false-positive rate downward by a large margin, because real benign traffic is messy.
A credible benign corpus has to include the inputs that look adversarial but are not:
- Security researchers and developers legitimately discussing prompt injection (“how do I defend against
ignore previous instructionsattacks?”) - Documents that quote system-like text — config files, logs, chat transcripts, code with string literals
- Multilingual and code-switched input
- Long, structured inputs (RAG chunks, pasted articles) where an injection would actually hide
If your benign set is 500 tidy sentences, your reported false-positive rate is fiction. Roughly half of our benign corpus is deliberately injection-adjacent, because that is where real detectors break.
The attacker has to be at least as good as a bored teenager
The attack corpus has the opposite failure mode: it is too easy. Benchmarks built from a fixed list of 2023-era “ignore all previous instructions” strings test memorization of a static blocklist, not detection. Any adversary iterates.
Tier the attack corpus and report per-tier:
- Tier 1 — direct, naive. Plain instruction-override strings. A useful floor; near-100% here is table stakes, not a selling point.
- Tier 2 — obfuscated. Base64, leetspeak, unicode homoglyphs, translation, payload splitting across turns.
- Tier 3 — indirect. Injection delivered through retrieved content, tool output, or document context — the Greshake et al. ↗ class. This is the one that matters for agents, and the one most detectors handle worst.
- Tier 4 — adaptive. Payloads written after inspecting the detector’s behavior. This is what a real attacker does. Even a small adaptive set exposes detectors that only pattern-match.
A detector that scores 99% on Tier 1 and 40% on Tier 3 is being sold on the 99%. Report both or the benchmark is dishonest.
Threshold transparency
Almost every detector exposes a score and a threshold, explicitly or implicitly. Recall and false-positive rate trade off continuously as you move it. A vendor can hit any recall number they want by lowering the threshold — at the cost of a false-positive rate they will not print.
The only honest presentation is the curve: sweep the threshold, plot recall against false-positive rate, and report the operating point you would actually deploy (typically the highest recall achievable at or below a 1% false-positive rate). A single (recall, FPR) pair with no stated threshold is not reproducible.
Contamination and the moving target
Public prompt-injection corpora (the ones in many GitHub repos) have been scraped into training and tuning data. A detector fine-tuned on public injection datasets will score implausibly well on a benchmark drawn from those same datasets — it has seen the test. Mitigate by:
- Holding out a private corpus the detector vendor has never seen
- Generating fresh adaptive payloads per evaluation run
- Treating any near-ceiling Tier 1–2 result with the same suspicion you would treat a 100% on a public LLM benchmark
Contamination is not a one-time cleanup. It is an ongoing condition of evaluating tools trained on public data.
A minimal protocol
- Build a benign corpus where ~50% is injection-adjacent-but-legitimate.
- Build a tiered attack corpus (direct → obfuscated → indirect → adaptive).
- Hold out a private slice the tool has never seen.
- Sweep the threshold; record the full recall/FPR curve.
- Report recall, FPR, p95 latency, and cost per 1k — per attack tier.
- Re-generate the adaptive tier each run; never reuse it.
The protocol is a week of work to stand up and an afternoon to re-run per tool. It produces a number that survives contact with a real adversary, which is the only kind worth publishing.
Where this fits in the network
aisecbench.com runs this protocol against shipping detectors and publishes the curves, not just the headline. For tooling reviews that pair with these benchmarks, see Best LLM scanners ↗ and AI content moderation tools ↗. For the attack side of the corpus design, the offensive writeups at aisec.blog ↗ document the techniques our Tier 3–4 sets are built from.
What we don’t do
- Report a single accuracy number with no false-positive rate
- Benchmark only against static 2023-era injection strings
- Use a tidy benign corpus and call the resulting FPR realistic
- Trust vendor numbers on public corpora the vendor’s model was trained on
- Hide the threshold
The benchmark-vs-reality gap for prompt-injection detection is almost entirely a test-design problem. Fix the test and the honest tools start to look very different from the marketed ones.
Sources
AI Sec Bench — in your inbox
Benchmarks and evaluations of AI security tools. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Measuring Prompt-Injection Robustness in Tool-Using Agents
Prompt-injection robustness for an agent is not a single number — it is utility-under-attack against targeted attack success. Here's how AgentDojo and InjecAgent measure it and what the metrics actually mean.
Comparing LLM Safety Benchmarks: AdvBench, HarmBench, JailbreakBench
AdvBench, HarmBench, and JailbreakBench are not interchangeable, and treating them as one undermines every comparison built on top. Here's what each measures and when to use which.
Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right
Attack success rate is the headline metric for jailbreak resistance, and almost everyone computes it in a way that isn't comparable across runs. Here's how to define and report ASR so the number survives a re-run.