Reproducible LLM Scanner Benchmarks: What Everyone Forgets to Pin
An LLM security scanner benchmark that isn't pinned to a model version, a seed, and a corpus hash isn't reproducible. Here's the full list of what to pin and why.
The defining property of a benchmark is that someone else can run it and get your number back, within noise. Most published LLM security scanner results fail this test — not because the scanner is bad, but because the evaluation pinned the scanner version and nothing else. The model drifted, the seed floated, the corpus changed, and the number is now unreproducible. This post is the full list of things you have to pin, and what happens if you don’t.
Why LLM scanner benchmarks are unusually fragile
A scanner like garak ↗ or PyRIT ↗ sends adversarial probes to a target model and judges the responses. There are at least four sources of variance, and a benchmark that pins only the scanner controls one of them:
- The scanner’s probe set and version
- The target model and its exact version (this changes under you)
- Sampling nondeterminism (temperature, top-p, seed) in both the target and any judge model
- The judge itself — many scanners use an LLM to grade responses, which adds its own drift
Pin one, vary three, and “the scanner detected 62% of jailbreaks” becomes a number that means nothing next month.
Pin the target model version, not the model name
gpt-4o, claude-sonnet, llama-3-70b-instruct are not versions. They are aliases that point at different weights over time. Provider-hosted models are silently updated; a benchmark that records the alias and not the dated snapshot is recording a moving target.
Pin and record:
- The exact dated model snapshot or revision string (e.g. a provider’s
-2026-xx-xxsnapshot, or a Hugging Face commit hash for open weights) - The serving stack and quantization for self-hosted models — an FP16 and a 4-bit quant of the “same” model have measurably different jailbreak susceptibility
- The system prompt and any safety settings applied, verbatim
If you cannot pin a provider model to a dated snapshot, the honest move is to state that the result is valid only for the run window and to re-run on a schedule. Pretending a floating alias is reproducible is the single most common error in this space.
Pin sampling, and report variance anyway
Even with everything else fixed, an LLM target at temperature > 0 produces different outputs per run. A scanner result is therefore a distribution, not a point. Two practices:
- Pin the seed and sampling parameters where the API allows it, and record them.
- Where determinism is impossible (most hosted APIs do not honor seeds across the stack), run the suite N times and report the mean and the spread, not a single run. A scanner result with no variance estimate is a coin flip presented as a measurement.
We run scanner suites a minimum of five times per configuration and publish the range. A 62% ± 1% result and a 62% ± 14% result are different findings, and only one of them supports a ranking.
Pin the judge — it drifts too
When a scanner uses an LLM to decide whether a response constitutes a successful jailbreak, the judge is part of the measurement instrument. If the judge model updates between your run and someone else’s, the same target responses get scored differently and the benchmark is not reproducible even if the target was perfectly pinned.
- Record the judge model’s dated version exactly as you record the target’s.
- Where possible, calibrate the judge against a human-labeled sample and report the agreement rate. A judge with 80% agreement against humans caps the precision of every number it produces, and readers deserve to know the ceiling.
- Prefer judges that are deterministic given the input (rubric-based or low-temperature) and document the rubric.
Pin the corpus with a hash
Scanner probe sets evolve between releases. A benchmark that says “ran garak’s jailbreak probes” without a version and a content hash is unreproducible the moment the upstream probe set changes — which it does, frequently and by design. Record:
- The scanner version (a release tag or commit hash, not “latest”)
- A hash or manifest of the exact probe set used, including any custom probes you added
- The full configuration file, committed alongside the results
If you added custom probes (you should, for domain coverage), they are part of the corpus and must be published or hashed. A private corpus is fine for contamination control; an undocumented corpus is not a benchmark.
Report the metrics scanners hide
Scanner output is often a single “vulnerability rate.” That is the prompt-injection-detector mistake in another costume. For an LLM scanner benchmark, report per probe category:
- Attack success rate — fraction of probes that elicited unsafe output
- Refusal rate on benign control probes — scanners should include benign controls; if they don’t, add them, because an over-refusing model can look “secure” for the wrong reason
- Per-category breakdown — injection vs. data exfiltration vs. harmful content vs. PII leakage have different remediations and must not be averaged into one number
- Wall-clock and cost — a full scanner suite against a hosted model can run for hours and cost real money; reproducers need to know what they are committing to
A reproducibility checklist
Before publishing any LLM scanner benchmark, confirm every line:
- Target model pinned to a dated snapshot or weight hash, recorded.
- Serving stack and quantization recorded for self-hosted targets.
- System prompt and safety settings recorded verbatim.
- Sampling parameters pinned where possible; suite run N≥5 times; mean and spread reported.
- Judge model version pinned; judge–human agreement reported.
- Scanner version pinned to a tag/commit; probe set hashed; config committed.
- Custom probes published or hashed.
- Per-category results plus benign-control refusal rate reported, never a lone aggregate.
A benchmark that passes this checklist can be re-run by a stranger and lands on your number. One that doesn’t is an anecdote with a chart.
Where this fits in the network
aisecbench.com publishes scanner benchmarks with the full pinning manifest attached, so the result can be reproduced rather than trusted. For scanner tool reviews, see Best LLM scanners ↗; for the detector and classifier benchmarks that use the same discipline, see our prompt-injection detector benchmark design and jailbreak classifier benchmark design.
What we don’t do
- Pin the scanner and let the target model float on an alias
- Report a single run with no variance estimate
- Use an LLM judge without recording its version or its human-agreement rate
- Cite “garak default probes” with no version or hash
- Collapse exfiltration, injection, and harmful-content rates into one number
Reproducibility in LLM scanner benchmarking is not a nice-to-have. It is the difference between a measurement and a screenshot. Pin everything that can drift, report the spread, and the rankings stop changing every time someone re-runs them.
Sources
AI Sec Bench — in your inbox
Benchmarks and evaluations of AI security tools. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Designing a Reproducible AI-Security Eval Harness
A reproducible AI-security evaluation is an engineering artifact, not a notebook. Here's the harness design — separation of corpus, target, judge, and report — that lets a stranger re-run your number.
Measuring Prompt-Injection Robustness in Tool-Using Agents
Prompt-injection robustness for an agent is not a single number — it is utility-under-attack against targeted attack success. Here's how AgentDojo and InjecAgent measure it and what the metrics actually mean.
Comparing LLM Safety Benchmarks: AdvBench, HarmBench, JailbreakBench
AdvBench, HarmBench, and JailbreakBench are not interchangeable, and treating them as one undermines every comparison built on top. Here's what each measures and when to use which.