Comparing LLM Safety Benchmarks: AdvBench, HarmBench, JailbreakBench

A report that says a model was “benchmarked against the standard jailbreak set” hides a choice. There is no single standard jailbreak set — there are several, with different behaviors, different judges, and different threat models, and a result against one is not automatically a result against another. This post compares the three sets that show up most in serious work — AdvBench, HarmBench, and JailbreakBench — and says when each is the right anchor.

The three are layers, not rivals

It helps to see these as a lineage rather than competitors, because each was partly a response to the limits of the one before.

AdvBench comes from the universal-adversarial-attack work of Zou et al. (arXiv:2307.15043), the paper that introduced the GCG (Greedy Coordinate Gradient) suffix attack. AdvBench is a set of harmful behaviors and harmful strings used to demonstrate that an optimized adversarial suffix could transfer across models. It was built to show an attack worked, and it served that purpose well. Its limitation as a standalone benchmark is that its behaviors are repetitive and its original “success” check was a simple substring match for refusal phrases — a check that over-counts success because a model can avoid the refusal phrase without actually complying.

HarmBench ↗ (Mazeika et al., 2024; arXiv:2402.04249; ICML 2024) was built by the Center for AI Safety to fix the standardization gap. Its framing is explicit: automated red teaming “lacks a standardized evaluation framework.” HarmBench supplies a fixed behavior set spanning multiple categories and a released classifier (cais/HarmBench-Llama-2-13b-cls) so success is judged consistently. Its original release evaluated 18 red-teaming methods against 33 target LLMs and defenses in one comparable frame — the comparability is the contribution.

JailbreakBench ↗ (NeurIPS 2024) pushed on reproducibility and open artifacts. Its JBB-Behaviors dataset has 200 behaviors — 100 harmful, 100 benign — curated against OpenAI’s usage policies, with the 100 harmful behaviors drawn from AdvBench (a subset), the Trojan Detection Challenge / HarmBench, and original entries. It adds an artifacts repository of submitted jailbreak strings and a Llama-3-based judge, so a published result is a re-runnable artifact, not just a number in a table. It deliberately adopts the binary jailbreak/refusal labeling and ASR metric compatible with HarmBench, so the two can be compared.

What each one actually measures

The behaviors and the judge differ, so the numbers differ:

Property	AdvBench	HarmBench	JailbreakBench
Primary purpose	Demonstrate transferable attacks (GCG)	Standardized automated red-team eval	Reproducible robustness benchmark + artifacts
Benign behaviors included	No	No (harmful-focused)	Yes (100 benign)
Released judge	No (substring match originally)	Yes (Llama-2-13b classifier)	Yes (Llama-3-based)
Artifacts/leaderboard	No	Framework + results	Artifacts repo + leaderboard
Best venue/year	arXiv 2023	ICML 2024	NeurIPS 2024
License	MIT	MIT	MIT

The single most consequential row is “benign behaviors included.” HarmBench and AdvBench measure attack success; JailbreakBench additionally lets you measure over-refusal on benign requests with the same harness. If your decision depends on the trade-off between robustness and usability — and most deployment decisions do — JailbreakBench’s benign half is doing work the other two don’t.

When to use which

Use AdvBench when you are specifically studying optimization-based suffix attacks (GCG and its descendants) and want to compare against the original transferable-attack literature. Do not use AdvBench’s original substring-match success check as your judge; swap in a real classifier. AdvBench’s behaviors are a fine attack source; its 2023-era success metric is not.

Use HarmBench when you want a standardized, harmful-behavior-focused evaluation of an automated red-teaming method or a defense, and you want your result to sit in the same frame as the large body of work that uses it. It is the right anchor when the question is “how good is this attack/defense relative to the field.”

Use JailbreakBench when you want a reproducible model-robustness number with the over-refusal counterpart measured on the same footing, and when you want to publish artifacts a stranger can re-run. It is the right anchor when the question is “is this model deployable,” because it forces the benign half into view.

There is no rule against using more than one. A thorough eval reports HarmBench-style ASR for field-comparability and JailbreakBench’s benign-refusal number for the deployment trade-off. What you must not do is run one and describe the result as if it were the other.

The comparison killers

Three habits silently break cross-benchmark comparisons:

Mixing judges. An ASR computed with AdvBench’s substring check is not comparable to one computed with the HarmBench classifier, even on identical behaviors. The judge is part of the number.
Quoting a subset as the whole. Describing a run as “the jailbreak benchmark” when only 30 of 100 behaviors were evaluated produces a number that looks like a full result and isn’t. Report coverage.
Ignoring the version. These sets are revised. A behavior set without a version tag is a moving target, and a benchmark against a moving target is not reproducible — the same discipline we apply to model snapshots.

Where this fits in the network

aisecbench.com anchors its jailbreak results on these published sets by version and judge, and states which one (or which combination) produced each number. This pairs directly with our jailbreak resistance ASR methodology and the paired-metric approach in our red-team eval methodology; the reproducibility discipline is the same one in our reproducible LLM scanner benchmarks. For the scanners that operationalize these corpora, see Best LLM scanners ↗; for the underlying attack mechanics, aisec.blog ↗.

What we don’t do

Treat AdvBench, HarmBench, and JailbreakBench as one interchangeable “standard set”
Report an ASR computed with AdvBench’s original substring check as a modern result
Quote a subset of behaviors as if it were the full benchmark
Run a benchmark without recording its version
Use a harmful-only set to make a claim about over-refusal

The benchmarks are good. The mistake is treating them as fungible. Name the set, name the judge, name the version, and a safety result becomes something other people can stand on.

Comparing LLM Safety Benchmarks: AdvBench, HarmBench, JailbreakBench

The three are layers, not rivals

What each one actually measures

When to use which

The comparison killers

Where this fits in the network

What we don’t do

Sources

AI Sec Bench — in your inbox

Related

Measuring Prompt-Injection Robustness in Tool-Using Agents

Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right

Reproducible LLM Scanner Benchmarks: What Everyone Forgets to Pin

Comments