Reproducible LLM Scanner Benchmarks: What Everyone Forgets to Pin

The defining property of a benchmark is that someone else can run it and get your number back, within noise. Most published LLM security scanner results fail this test — not because the scanner is bad, but because the evaluation pinned the scanner version and nothing else. The model drifted, the seed floated, the corpus changed, and the number is now unreproducible. This post is the full list of things you have to pin, and what happens if you don’t.

Why LLM scanner benchmarks are unusually fragile

A scanner like garak ↗ or PyRIT ↗ sends adversarial probes to a target model and judges the responses. There are at least four sources of variance, and a benchmark that pins only the scanner controls one of them:

The scanner’s probe set and version
The target model and its exact version (this changes under you)
Sampling nondeterminism (temperature, top-p, seed) in both the target and any judge model
The judge itself — many scanners use an LLM to grade responses, which adds its own drift

Pin one, vary three, and “the scanner detected 62% of jailbreaks” becomes a number that means nothing next month.

Pin the target model version, not the model name

gpt-4o, claude-sonnet, llama-3-70b-instruct are not versions. They are aliases that point at different weights over time. Provider-hosted models are silently updated; a benchmark that records the alias and not the dated snapshot is recording a moving target.

Pin and record:

The exact dated model snapshot or revision string (e.g. a provider’s -2026-xx-xx snapshot, or a Hugging Face commit hash for open weights)
The serving stack and quantization for self-hosted models — an FP16 and a 4-bit quant of the “same” model have measurably different jailbreak susceptibility
The system prompt and any safety settings applied, verbatim

If you cannot pin a provider model to a dated snapshot, the honest move is to state that the result is valid only for the run window and to re-run on a schedule. Pretending a floating alias is reproducible is the single most common error in this space.

Pin sampling, and report variance anyway

Even with everything else fixed, an LLM target at temperature > 0 produces different outputs per run. A scanner result is therefore a distribution, not a point. Two practices:

Pin the seed and sampling parameters where the API allows it, and record them.
Where determinism is impossible (most hosted APIs do not honor seeds across the stack), run the suite N times and report the mean and the spread, not a single run. A scanner result with no variance estimate is a coin flip presented as a measurement.

We run scanner suites a minimum of five times per configuration and publish the range. A 62% ± 1% result and a 62% ± 14% result are different findings, and only one of them supports a ranking.

Pin the judge — it drifts too

When a scanner uses an LLM to decide whether a response constitutes a successful jailbreak, the judge is part of the measurement instrument. If the judge model updates between your run and someone else’s, the same target responses get scored differently and the benchmark is not reproducible even if the target was perfectly pinned.

Record the judge model’s dated version exactly as you record the target’s.
Where possible, calibrate the judge against a human-labeled sample and report the agreement rate. A judge with 80% agreement against humans caps the precision of every number it produces, and readers deserve to know the ceiling.
Prefer judges that are deterministic given the input (rubric-based or low-temperature) and document the rubric.

Pin the corpus with a hash

Scanner probe sets evolve between releases. A benchmark that says “ran garak’s jailbreak probes” without a version and a content hash is unreproducible the moment the upstream probe set changes — which it does, frequently and by design. Record:

The scanner version (a release tag or commit hash, not “latest”)
A hash or manifest of the exact probe set used, including any custom probes you added
The full configuration file, committed alongside the results

If you added custom probes (you should, for domain coverage), they are part of the corpus and must be published or hashed. A private corpus is fine for contamination control; an undocumented corpus is not a benchmark.

Report the metrics scanners hide

Scanner output is often a single “vulnerability rate.” That is the prompt-injection-detector mistake in another costume. For an LLM scanner benchmark, report per probe category:

Attack success rate — fraction of probes that elicited unsafe output
Refusal rate on benign control probes — scanners should include benign controls; if they don’t, add them, because an over-refusing model can look “secure” for the wrong reason
Per-category breakdown — injection vs. data exfiltration vs. harmful content vs. PII leakage have different remediations and must not be averaged into one number
Wall-clock and cost — a full scanner suite against a hosted model can run for hours and cost real money; reproducers need to know what they are committing to

A reproducibility checklist

Before publishing any LLM scanner benchmark, confirm every line:

Target model pinned to a dated snapshot or weight hash, recorded.
Serving stack and quantization recorded for self-hosted targets.
System prompt and safety settings recorded verbatim.
Sampling parameters pinned where possible; suite run N≥5 times; mean and spread reported.
Judge model version pinned; judge–human agreement reported.
Scanner version pinned to a tag/commit; probe set hashed; config committed.
Custom probes published or hashed.
Per-category results plus benign-control refusal rate reported, never a lone aggregate.

A benchmark that passes this checklist can be re-run by a stranger and lands on your number. One that doesn’t is an anecdote with a chart.

Where this fits in the network

aisecbench.com publishes scanner benchmarks with the full pinning manifest attached, so the result can be reproduced rather than trusted. For scanner tool reviews, see Best LLM scanners ↗; for the detector and classifier benchmarks that use the same discipline, see our prompt-injection detector benchmark design and jailbreak classifier benchmark design.

What we don’t do

Pin the scanner and let the target model float on an alias
Report a single run with no variance estimate
Use an LLM judge without recording its version or its human-agreement rate
Cite “garak default probes” with no version or hash
Collapse exfiltration, injection, and harmful-content rates into one number

Reproducibility in LLM scanner benchmarking is not a nice-to-have. It is the difference between a measurement and a screenshot. Pin everything that can drift, report the spread, and the rankings stop changing every time someone re-runs them.

Reproducible LLM Scanner Benchmarks: What Everyone Forgets to Pin

Why LLM scanner benchmarks are unusually fragile

Pin the target model version, not the model name

Pin sampling, and report variance anyway

Pin the judge — it drifts too

Pin the corpus with a hash

Report the metrics scanners hide

A reproducibility checklist

Where this fits in the network

What we don’t do

Sources

AI Sec Bench — in your inbox

Related

Designing a Reproducible AI-Security Eval Harness

Measuring Prompt-Injection Robustness in Tool-Using Agents

Comparing LLM Safety Benchmarks: AdvBench, HarmBench, JailbreakBench

Comments