Benchmarking Jailbreak Classifiers: The Asymmetry Nobody Reports

A jailbreak classifier sits between a model and a user (or between a model and a tool) and decides whether a prompt — or a response — is an attempt to make the model violate its safety policy. Vendors benchmark these almost exclusively on attack recall: what fraction of jailbreak attempts get caught. That is the easy half. The hard half, the one that decides whether the classifier is shippable, is what it costs when the classifier is wrong on benign traffic. Almost nobody reports it.

This post is how we structure a jailbreak-classifier benchmark so the asymmetry is visible.

Two classifiers in one box

Most jailbreak classifiers actually make two different decisions, and they should be benchmarked separately:

Input classification. Is this incoming prompt an attempt to elicit unsafe output?
Output classification. Is this model response unsafe regardless of whether the prompt looked benign?

These have completely different error profiles. Input classifiers face adversarial obfuscation (the attacker controls the text). Output classifiers face the model’s own creativity (a benign-looking prompt can still produce unsafe output via indirect injection or emergent behavior). A benchmark that conflates them produces a number that describes neither. Report the two paths independently; many products are strong on one and weak on the other.

The benign-refusal cost

Here is the asymmetry. When a jailbreak classifier false-positives, it does not just “flag a request.” It causes the product to refuse a legitimate user. The business cost of a wrongly refused user — a frustrated customer, an abandoned session, a support ticket — is frequently higher per event than the cost of a single missed low-severity jailbreak. Yet benchmarks report recall to two decimal places and false-positive rate not at all.

A credible jailbreak-classifier benchmark must include a benign-but-sensitive corpus: requests that are legitimate but live near the policy boundary. Examples:

Security and safety research questions (“explain how prompt injection works so I can defend against it”)
Medical, legal, and self-harm-adjacent questions asked in good faith
Fiction and creative writing involving conflict or weapons
Educational content about historical atrocities, drugs, or weapons in an academic frame

A classifier that refuses all of these has high attack recall and is unusable in any general-purpose product. The benign-sensitive false-positive rate is the number that separates a deployable classifier from a liability. We weight it as heavily as recall, and report both.

Severity-stratified recall

“Caught 95% of jailbreaks” treats a request for a mildly edgy joke the same as a request for genuinely dangerous instructions. They are not the same. Stratify the attack corpus by severity tier (a coarse low / medium / high mapped to your policy) and report recall per tier.

The decision-relevant question is not “what is overall recall” but “what is recall on the high-severity tier, and at what benign-sensitive false-positive rate.” A classifier that catches 99% of high-severity attempts at a 2% benign-sensitive FPR is a very different product from one that catches 99% overall (mostly the easy low-severity tier) and misses a third of the high-severity tier. Only the stratified view exposes this.

Attack diversity beats attack volume

A benchmark with 10,000 variants of three jailbreak templates measures robustness to paraphrase, not robustness to technique. The HarmBench ↗ line of work makes this point well: diversity of attack strategy is what predicts real-world robustness, not raw count.

Cover the technique families, each as its own reported slice:

Role-play / persona (“you are DAN…”)
Instruction-hierarchy attacks (fake system prompts, “developer mode”)
Obfuscation (encoding, translation, token splitting)
Many-shot / context-saturation
Indirect (jailbreak delivered via retrieved or tool content)
Adaptive (written after probing the classifier)

A classifier can ace persona attacks and collapse on indirect ones. If you only report the aggregate, you have hidden the exact failure an attacker will use.

Latency and the pipeline tax

Jailbreak classifiers run in the request path. If the classifier adds 400 ms at p95, that is 400 ms on every request, including the >99% that are benign. Many classifiers are themselves LLM calls, which means the classifier can cost more in latency and tokens than the model it protects.

Report p95 and p99 latency under realistic concurrency, and cost per 1,000 calls including the classifier’s own model usage. A classifier that is marginally more accurate but doubles per-request latency and cost is often the wrong choice, and the accuracy-only benchmark will never tell you that.

A protocol that respects the asymmetry

Split input-path and output-path evaluation; never merge them.
Build a benign-sensitive corpus (research, medical, fiction, educational at the boundary).
Stratify attacks by severity; report recall per tier.
Slice attacks by technique family; report per slice.
Report recall and benign-sensitive false-positive rate at a fixed, stated threshold.
Report p95/p99 latency and cost per 1k including the classifier’s own model.
Include an adaptive attack slice regenerated each run.

Run this and the ranking changes. The classifier with the best advertised recall is frequently the one that refuses the most legitimate users — a fact that only appears when you measure the cost of being wrong, not just the rate of being right.

Where this fits in the network

aisecbench.com publishes severity-stratified jailbreak-classifier results with the benign-sensitive false-positive rate front and center. For the open guardrail tools these benchmarks cover, see AI content moderation tools ↗; for the attack techniques behind the corpus, aisec.blog ↗ documents the families. The benchmark methodology here pairs with our prompt-injection detector benchmark design.

What we don’t do

Report attack recall without a benign-sensitive false-positive rate
Merge input-path and output-path results into one number
Treat low- and high-severity recall as interchangeable
Build the attack set from paraphrases of three templates
Ignore the per-request latency and cost the classifier itself adds

The honest summary of most jailbreak classifiers is “good recall, undisclosed refusal cost.” Measuring the refusal cost is the entire contribution of a real benchmark.

Benchmarking Jailbreak Classifiers: The Asymmetry Nobody Reports

Two classifiers in one box

The benign-refusal cost

Severity-stratified recall

Attack diversity beats attack volume

Latency and the pipeline tax

A protocol that respects the asymmetry

Where this fits in the network

What we don’t do

Sources

AI Sec Bench — in your inbox

Related

Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right

Best LLM Red Teaming Tools 2026: A Practitioner's Evaluation

Measuring Prompt-Injection Robustness in Tool-Using Agents

Comments