How to Benchmark a Prompt-Injection Detector Honestly

A vendor tells you their prompt-injection detector has 98% accuracy. That number is almost always meaningless, and not because the vendor is lying. It is meaningless because “accuracy” on an unspecified corpus, at an unspecified threshold, against an unspecified attacker, with no false-positive rate, is not a measurement. It is a marketing artifact.

This post is the test design we use before publishing any detector benchmark. It is built to stop you from fooling yourself.

The four numbers, never one

A prompt-injection detector is a binary classifier. Reporting a single accuracy figure throws away the information that decides whether the thing is deployable. Always report:

True-positive rate (recall) — of the injections that should be caught, what fraction were?
False-positive rate — of benign inputs, what fraction were wrongly flagged?
Latency at p95 — how long does a verdict take under load, not in a single-shot demo?
Cost per 1,000 calls — including any model the detector itself invokes.

A detector with 99% recall and a 25% false-positive rate is unusable in any high-traffic product: it will flag a quarter of legitimate users. A detector with 80% recall and a 0.5% false-positive rate is often the better production choice. The single-accuracy framing hides exactly this trade-off, which is why vendors prefer it.

The benign corpus is the load-bearing piece

Most published detector benchmarks spend all their effort on the attack corpus and grab the benign corpus as an afterthought — usually short, clean, well-formed English sentences. This inflates the false-positive rate downward by a large margin, because real benign traffic is messy.

A credible benign corpus has to include the inputs that look adversarial but are not:

Security researchers and developers legitimately discussing prompt injection (“how do I defend against ignore previous instructions attacks?”)
Documents that quote system-like text — config files, logs, chat transcripts, code with string literals
Multilingual and code-switched input
Long, structured inputs (RAG chunks, pasted articles) where an injection would actually hide

If your benign set is 500 tidy sentences, your reported false-positive rate is fiction. Roughly half of our benign corpus is deliberately injection-adjacent, because that is where real detectors break.

The attacker has to be at least as good as a bored teenager

The attack corpus has the opposite failure mode: it is too easy. Benchmarks built from a fixed list of 2023-era “ignore all previous instructions” strings test memorization of a static blocklist, not detection. Any adversary iterates.

Tier the attack corpus and report per-tier:

Tier 1 — direct, naive. Plain instruction-override strings. A useful floor; near-100% here is table stakes, not a selling point.
Tier 2 — obfuscated. Base64, leetspeak, unicode homoglyphs, translation, payload splitting across turns.
Tier 3 — indirect. Injection delivered through retrieved content, tool output, or document context — the Greshake et al. ↗ class. This is the one that matters for agents, and the one most detectors handle worst.
Tier 4 — adaptive. Payloads written after inspecting the detector’s behavior. This is what a real attacker does. Even a small adaptive set exposes detectors that only pattern-match.

A detector that scores 99% on Tier 1 and 40% on Tier 3 is being sold on the 99%. Report both or the benchmark is dishonest.

Threshold transparency

Almost every detector exposes a score and a threshold, explicitly or implicitly. Recall and false-positive rate trade off continuously as you move it. A vendor can hit any recall number they want by lowering the threshold — at the cost of a false-positive rate they will not print.

The only honest presentation is the curve: sweep the threshold, plot recall against false-positive rate, and report the operating point you would actually deploy (typically the highest recall achievable at or below a 1% false-positive rate). A single (recall, FPR) pair with no stated threshold is not reproducible.

Contamination and the moving target

Public prompt-injection corpora (the ones in many GitHub repos) have been scraped into training and tuning data. A detector fine-tuned on public injection datasets will score implausibly well on a benchmark drawn from those same datasets — it has seen the test. Mitigate by:

Holding out a private corpus the detector vendor has never seen
Generating fresh adaptive payloads per evaluation run
Treating any near-ceiling Tier 1–2 result with the same suspicion you would treat a 100% on a public LLM benchmark

Contamination is not a one-time cleanup. It is an ongoing condition of evaluating tools trained on public data.

A minimal protocol

Build a benign corpus where ~50% is injection-adjacent-but-legitimate.
Build a tiered attack corpus (direct → obfuscated → indirect → adaptive).
Hold out a private slice the tool has never seen.
Sweep the threshold; record the full recall/FPR curve.
Report recall, FPR, p95 latency, and cost per 1k — per attack tier.
Re-generate the adaptive tier each run; never reuse it.

The protocol is a week of work to stand up and an afternoon to re-run per tool. It produces a number that survives contact with a real adversary, which is the only kind worth publishing.

Where this fits in the network

aisecbench.com runs this protocol against shipping detectors and publishes the curves, not just the headline. For tooling reviews that pair with these benchmarks, see Best LLM scanners ↗ and AI content moderation tools ↗. For the attack side of the corpus design, the offensive writeups at aisec.blog ↗ document the techniques our Tier 3–4 sets are built from.

What we don’t do

Report a single accuracy number with no false-positive rate
Benchmark only against static 2023-era injection strings
Use a tidy benign corpus and call the resulting FPR realistic
Trust vendor numbers on public corpora the vendor’s model was trained on
Hide the threshold

The benchmark-vs-reality gap for prompt-injection detection is almost entirely a test-design problem. Fix the test and the honest tools start to look very different from the marketed ones.

How to Benchmark a Prompt-Injection Detector Honestly

The four numbers, never one

The benign corpus is the load-bearing piece

The attacker has to be at least as good as a bored teenager

Threshold transparency

Contamination and the moving target

A minimal protocol

Where this fits in the network

What we don’t do

Sources

AI Sec Bench — in your inbox

Related

Measuring Prompt-Injection Robustness in Tool-Using Agents

Comparing LLM Safety Benchmarks: AdvBench, HarmBench, JailbreakBench

Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right

Comments