Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right

Attack success rate (ASR) is the headline number for jailbreak resistance: the fraction of adversarial prompts that elicited the unsafe behavior the attacker was after. It is a good metric. It is also computed inconsistently enough that two teams reporting “ASR 31%” against the same model are frequently measuring different things. This post is how we define ASR so the number is comparable across runs, across attackers, and across the people who will inevitably try to reproduce it.

ASR is a ratio with three contested terms

ASR looks simple — successes over attempts — but every term in that ratio is a decision that changes the number:

What counts as an attempt? One prompt? One multi-turn conversation? One behavior across a fixed set of attack templates? A “per-prompt” ASR and a “per-behavior” ASR against the same model differ by a wide margin, and the two are not interchangeable.
What counts as a success? A single judge call that says “unsafe”? A judge plus a check that the output actually advanced the harmful behavior rather than producing a refusal-flavored ramble? The looser the success criterion, the higher the ASR — and the less it predicts real-world harm.
What is the denominator’s universe? All behaviors, or only the behaviors a competent attacker would attempt against this model? Padding the denominator with trivially-refused behaviors deflates ASR and flatters the model.

A credible jailbreak benchmark fixes all three before the first prompt is sent, writes them into the protocol, and reports them alongside the number. ASR without its definition is a percentage, not a measurement.

Anchor on a standardized behavior set

The largest source of incomparable ASR numbers is a non-standard behavior set. If you write your own 40 harmful prompts and a competitor writes their own 40, the two ASRs were never going to agree. The fix is to anchor on a published, versioned behavior set that other people also use.

The two anchors worth knowing:

HarmBench ↗ (Mazeika et al., 2024; arXiv:2402.04249) — a standardized evaluation framework for automated red teaming, published at ICML 2024 by the Center for AI Safety. Its contribution is exactly the standardization: a fixed behavior set and a released classifier (cais/HarmBench-Llama-2-13b-cls) so that “success” is judged the same way by everyone. Their original release evaluated 18 red-teaming methods against 33 target LLMs and defenses.
JailbreakBench ↗ (NeurIPS 2024) — the JBB-Behaviors dataset of 200 behaviors (100 harmful, 100 benign), curated against OpenAI’s usage policies, with 100 harmful behaviors drawn from AdvBench, the Trojan Detection Challenge / HarmBench, and original entries. It ships an artifacts repository of submitted jailbreak strings and a Llama-3-based judge, so a result is a reproducible artifact rather than a claim.

Note the deliberate inclusion of benign behaviors in JailbreakBench. A jailbreak benchmark that measures only attack success and never measures over-refusal on benign requests is measuring half the system. We treat the benign half as mandatory; it is the only way ASR doesn’t reward a model that simply refuses everything.

The judge is part of the instrument

“Success” in a jailbreak benchmark is almost always decided by a model — a classifier or an LLM-as-judge reading the target’s response and deciding whether it constitutes a harmful completion. That judge is not a neutral observer. It is part of the measurement apparatus, and it drifts, disagrees with humans, and can be gamed.

Three rules we hold to:

Pin the judge to a dated version or weight hash, exactly as you pin the target model. A judge that silently updates makes last month’s ASR unreproducible even if the target never changed.
Report judge–human agreement on a labeled sample. If your judge agrees with human annotators 84% of the time, every ASR it produces carries that 84% as a precision ceiling. Readers deserve the ceiling.
Prefer the standardized judges from HarmBench or JailbreakBench when you want cross-paper comparability, and document any deviation. A custom judge can be better-calibrated for your domain, but it breaks comparability — say so explicitly.

Stratify ASR or it hides the failure

A single aggregate ASR averages across behavior categories that have nothing to do with each other. A model can have a 5% ASR on weapons-of-mass-destruction behaviors and a 60% ASR on disinformation, and the average — say 20% — describes neither and conceals the dangerous tier. Always report ASR stratified by:

Behavior category (the HarmBench / OpenAI-policy categories, not your own ad-hoc bins).
Attack technique family — direct request, role-play persona, encoding/obfuscation, many-shot, gradient-based suffixes, adaptive. A model robust to persona attacks can collapse on encoded ones; the aggregate hides it.
Severity tier — low/medium/high mapped to your policy. High-severity ASR at a stated benign-refusal rate is the decision-relevant pair, not the headline average.

Static ASR is a lower bound; report adaptive ASR too

A fixed corpus of jailbreak strings measures robustness to known attacks. A motivated attacker writes new strings after probing your model. The HarmBench and JailbreakBench lines of work both emphasize this: static results are a floor, and adaptive attacks routinely push ASR far above the static number.

Run an adaptive slice — an attacker (automated or human) that regenerates attacks against the specific target each run — and report it separately from the static suite. A model with 8% static ASR and 40% adaptive ASR is not an 8%-robust model. The gap between the two is one of the most honest things a jailbreak benchmark can publish.

A protocol that produces a comparable ASR

Anchor on a versioned, published behavior set (HarmBench or JailbreakBench); record the version.
Define attempt-unit (per-prompt vs per-behavior) and success criterion in the protocol, before running.
Include the benign behavior half; report over-refusal alongside ASR.
Pin the judge to a dated version; report judge–human agreement.
Stratify ASR by behavior category, technique family, and severity tier.
Run the suite N≥5 times against any non-deterministic target; report mean and spread.
Add an adaptive slice; report static and adaptive ASR separately.

Run this and the ranking is reproducible. Skip the definitions and you have a chart that nobody else can land on.

Where this fits in the network

aisecbench.com publishes jailbreak ASR with the behavior-set version, the pinned judge, and the static/adaptive split attached, so the number can be reproduced rather than trusted. The judge-calibration discipline here is the same one in our reproducible LLM scanner benchmarks, and the over-refusal half pairs with our jailbreak classifier benchmark design. For the scanner tooling that drives many of these attack corpora, see Best LLM scanners ↗; for the attack technique families themselves, aisec.blog ↗ documents the mechanics.

What we don’t do

Report ASR without stating the attempt-unit and the success criterion
Use a home-grown behavior set when a standardized one (HarmBench, JailbreakBench) exists
Run an LLM judge without pinning its version or reporting its human-agreement rate
Collapse behavior categories and technique families into one aggregate ASR
Present a static ASR as if it were the model’s robustness ceiling

Attack success rate is only as good as its definition. Anchor the corpus, pin the judge, stratify the result, and report the adaptive gap — and ASR becomes a number a stranger can reproduce instead of a percentage you have to be trusted on.

Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right

ASR is a ratio with three contested terms

Anchor on a standardized behavior set

The judge is part of the instrument

Stratify ASR or it hides the failure

Static ASR is a lower bound; report adaptive ASR too

A protocol that produces a comparable ASR

Where this fits in the network

What we don’t do

Sources

AI Sec Bench — in your inbox

Related

Red-Team Eval Methodology: Pairing Attack Success Rate With Refusal Rate

Benchmarking Jailbreak Classifiers: The Asymmetry Nobody Reports

Measuring Prompt-Injection Robustness in Tool-Using Agents

Comments