Designing a Reproducible AI-Security Eval Harness

The methodology posts on this site keep arriving at the same demand: pin the corpus, pin the target, pin the judge, report the spread. That demand is unenforceable if your evaluation is a notebook someone ran once and a screenshot of the output. Reproducibility is an architecture problem. This post is the harness design that makes the discipline mechanical instead of aspirational — four separable components, each independently pinnable, wired together by a config you commit.

The four components that must be separable

A reproducible AI-security eval has exactly four moving parts, and the single biggest design mistake is letting them bleed into each other:

The corpus — the attack and benign prompts, plus their expected-behavior labels.
The target — the model (or agent) under test, behind an adapter.
The judge — whatever decides success/refusal/harm for each response.
The report — the aggregation and the manifest of everything that produced the numbers.

When these are separable, you can swap one and re-run without touching the others: re-judge old responses with a new judge, re-run the same corpus against a new model snapshot, add a benign slice without re-engineering the target adapter. When they are tangled — a script that hardcodes the model name inside the prompt loop and grades with an inline regex — nothing is reproducible because nothing is isolatable.

Both of the mature open harnesses are built around this separation, which is why they are worth adopting rather than reinventing. NVIDIA’s garak ↗ splits probes (corpus), generators (target adapters), and detectors (judge) as first-class, independently configurable objects. Microsoft’s PyRIT ↗ (MIT-licensed) separates datasets, targets, orchestrators, converters, and scorers, so the attack logic, the target, and the scoring are independently swappable. If you build your own harness, copy this shape.

Pin the corpus with a manifest, not a memory

The corpus is the easiest thing to think you’ve controlled and the easiest to lose. “We used the standard jailbreak prompts” is not a pin. Commit a manifest that records:

The source set and its version — a release tag or commit hash for HarmBench ↗, JailbreakBench, or whichever published set you anchored on.
A content hash of the exact prompts used, including any custom or domain-specific prompts you added.
The labels: which behaviors are harmful, which are benign, and the policy mapping behind that call.

A custom prompt you added for domain coverage is part of the corpus and must be hashed or published. A private corpus is fine for contamination control; an undocumented one is not an eval.

Pin the target as a snapshot, behind an adapter

The target adapter (garak’s generator, PyRIT’s target) is what keeps model-specific plumbing out of your eval logic. Behind it, pin:

The exact dated model snapshot or weight commit hash — never a floating alias like gpt-4o or claude-sonnet, which point at different weights over time.
For self-hosted targets, the serving stack and quantization: an FP16 and a 4-bit quant of the “same” model have measurably different attack susceptibility.
The system prompt, safety settings, and sampling parameters, verbatim.
For agents, the full scaffold — tool schemas, the ReAct/planner loop, retrieval config — because the scaffold is part of what’s being attacked.

If a hosted model can’t be pinned to a dated snapshot, the harness should record that the result is valid only for the run window and re-run on a schedule. Pretending a floating alias is reproducible is the most common error in this space.

Pin the judge and record its ceiling

The judge is part of the instrument, and a harness that hardcodes the judge inside the run loop can’t re-judge old responses when the judge improves. Keep it a separate component, and:

Pin the judge to a dated version or weight hash, exactly like the target. Use a released standardized judge (the HarmBench classifier, JailbreakBench’s judge) when you want cross-work comparability.
Calibrate against a human-labeled sample and record the agreement rate. An 84%-agreement judge caps the precision of every number it produces; the report must carry that ceiling.
Store raw target responses, not just the judge’s verdicts. If responses are saved, a better judge can re-score the same run later — which is impossible if you only kept the pass/fail flags.

The report is a manifest plus a distribution

The output of the harness is not a single number. It is:

A distribution. Non-deterministic targets produce different outputs per run; run the suite N≥5 times and report mean and spread. A result with no variance estimate is a coin flip presented as a measurement.
Stratified results. Per attack category and technique family, with the benign-refusal counterpart, never a lone aggregate.
A run manifest. Corpus version + hash, target snapshot + serving config, judge version + human-agreement rate, full harness config, library versions, and a timestamp. The manifest is the thing that lets a stranger reconstruct the run.

Commit the manifest alongside the numbers. A chart without its manifest is an anecdote.

Wire it with config, run it in CI

The final piece is operational: the entire run should be driven by a committed config file, not by ad-hoc CLI flags typed at a prompt. Both garak (generator option files, probe specs) and PyRIT (programmatic orchestrators) support config-as-code. Put the config in version control, run the eval in CI on a schedule and on every model or system-prompt change, and version the JSONL/report artifacts next to the model artifacts. An eval that lives only in someone’s shell history dies when they leave.

A harness checklist

Corpus, target, judge, and report are separable components.
Corpus pinned by source version + content hash; custom prompts hashed or published.
Target pinned to a dated snapshot / weight hash; serving stack, quant, system prompt, sampling, and agent scaffold recorded.
Judge pinned to a dated version; human-agreement rate recorded; raw responses stored for re-judging.
Suite run N≥5 times; mean and spread reported; results stratified with the benign counterpart.
A run manifest committed with every result.
The whole run driven by a committed config and executed in CI.

Where this fits in the network

aisecbench.com publishes results from a harness built on exactly this separation, with the run manifest attached so the number can be reproduced rather than trusted. It operationalizes the discipline from our reproducible LLM scanner benchmarks and runs the corpora behind our jailbreak resistance ASR methodology and red-team eval methodology. For the scanners and harnesses themselves — garak, PyRIT, and the rest — see Best LLM scanners ↗.

What we don’t do

Run an eval from a notebook with the target name and judge hardcoded in the loop
Pin the corpus by memory (“the standard prompts”) instead of a version and hash
Discard raw responses and keep only pass/fail flags
Report a single non-deterministic run with no spread
Publish numbers without the manifest that produced them

Reproducibility is not a virtue you bolt on after the run. It is the harness architecture: four separable, independently pinnable components and a committed manifest. Build it that way and the rankings stop changing every time someone re-runs them.

Designing a Reproducible AI-Security Eval Harness

The four components that must be separable

Pin the corpus with a manifest, not a memory

Pin the target as a snapshot, behind an adapter

Pin the judge and record its ceiling

The report is a manifest plus a distribution

Wire it with config, run it in CI

A harness checklist

Where this fits in the network

What we don’t do

Sources

AI Sec Bench — in your inbox

Related

Red-Team Eval Methodology: Pairing Attack Success Rate With Refusal Rate

Reproducible LLM Scanner Benchmarks: What Everyone Forgets to Pin

Measuring Prompt-Injection Robustness in Tool-Using Agents

Comments