AI Sec Bench
Jailbreak classifier evaluation showing recall and false-positive rate trade-off axes
methodology

Benchmarking Jailbreak Classifiers: The Asymmetry Nobody Reports

Jailbreak classifiers are graded on attack recall and almost never on the cost of being wrong. That asymmetry is the whole story. Here's how to measure it.

By AI Sec Bench Editorial · · 8 min read

A jailbreak classifier sits between a model and a user (or between a model and a tool) and decides whether a prompt — or a response — is an attempt to make the model violate its safety policy. Vendors benchmark these almost exclusively on attack recall: what fraction of jailbreak attempts get caught. That is the easy half. The hard half, the one that decides whether the classifier is shippable, is what it costs when the classifier is wrong on benign traffic. Almost nobody reports it.

This post is how we structure a jailbreak-classifier benchmark so the asymmetry is visible.

Two classifiers in one box

Most jailbreak classifiers actually make two different decisions, and they should be benchmarked separately:

  • Input classification. Is this incoming prompt an attempt to elicit unsafe output?
  • Output classification. Is this model response unsafe regardless of whether the prompt looked benign?

These have completely different error profiles. Input classifiers face adversarial obfuscation (the attacker controls the text). Output classifiers face the model’s own creativity (a benign-looking prompt can still produce unsafe output via indirect injection or emergent behavior). A benchmark that conflates them produces a number that describes neither. Report the two paths independently; many products are strong on one and weak on the other.

The benign-refusal cost

Here is the asymmetry. When a jailbreak classifier false-positives, it does not just “flag a request.” It causes the product to refuse a legitimate user. The business cost of a wrongly refused user — a frustrated customer, an abandoned session, a support ticket — is frequently higher per event than the cost of a single missed low-severity jailbreak. Yet benchmarks report recall to two decimal places and false-positive rate not at all.

A credible jailbreak-classifier benchmark must include a benign-but-sensitive corpus: requests that are legitimate but live near the policy boundary. Examples:

  • Security and safety research questions (“explain how prompt injection works so I can defend against it”)
  • Medical, legal, and self-harm-adjacent questions asked in good faith
  • Fiction and creative writing involving conflict or weapons
  • Educational content about historical atrocities, drugs, or weapons in an academic frame

A classifier that refuses all of these has high attack recall and is unusable in any general-purpose product. The benign-sensitive false-positive rate is the number that separates a deployable classifier from a liability. We weight it as heavily as recall, and report both.

Severity-stratified recall

“Caught 95% of jailbreaks” treats a request for a mildly edgy joke the same as a request for genuinely dangerous instructions. They are not the same. Stratify the attack corpus by severity tier (a coarse low / medium / high mapped to your policy) and report recall per tier.

The decision-relevant question is not “what is overall recall” but “what is recall on the high-severity tier, and at what benign-sensitive false-positive rate.” A classifier that catches 99% of high-severity attempts at a 2% benign-sensitive FPR is a very different product from one that catches 99% overall (mostly the easy low-severity tier) and misses a third of the high-severity tier. Only the stratified view exposes this.

Attack diversity beats attack volume

A benchmark with 10,000 variants of three jailbreak templates measures robustness to paraphrase, not robustness to technique. The HarmBench line of work makes this point well: diversity of attack strategy is what predicts real-world robustness, not raw count.

Cover the technique families, each as its own reported slice:

  • Role-play / persona (“you are DAN…”)
  • Instruction-hierarchy attacks (fake system prompts, “developer mode”)
  • Obfuscation (encoding, translation, token splitting)
  • Many-shot / context-saturation
  • Indirect (jailbreak delivered via retrieved or tool content)
  • Adaptive (written after probing the classifier)

A classifier can ace persona attacks and collapse on indirect ones. If you only report the aggregate, you have hidden the exact failure an attacker will use.

Latency and the pipeline tax

Jailbreak classifiers run in the request path. If the classifier adds 400 ms at p95, that is 400 ms on every request, including the >99% that are benign. Many classifiers are themselves LLM calls, which means the classifier can cost more in latency and tokens than the model it protects.

Report p95 and p99 latency under realistic concurrency, and cost per 1,000 calls including the classifier’s own model usage. A classifier that is marginally more accurate but doubles per-request latency and cost is often the wrong choice, and the accuracy-only benchmark will never tell you that.

A protocol that respects the asymmetry

  1. Split input-path and output-path evaluation; never merge them.
  2. Build a benign-sensitive corpus (research, medical, fiction, educational at the boundary).
  3. Stratify attacks by severity; report recall per tier.
  4. Slice attacks by technique family; report per slice.
  5. Report recall and benign-sensitive false-positive rate at a fixed, stated threshold.
  6. Report p95/p99 latency and cost per 1k including the classifier’s own model.
  7. Include an adaptive attack slice regenerated each run.

Run this and the ranking changes. The classifier with the best advertised recall is frequently the one that refuses the most legitimate users — a fact that only appears when you measure the cost of being wrong, not just the rate of being right.

Where this fits in the network

aisecbench.com publishes severity-stratified jailbreak-classifier results with the benign-sensitive false-positive rate front and center. For the open guardrail tools these benchmarks cover, see AI content moderation tools; for the attack techniques behind the corpus, aisec.blog documents the families. The benchmark methodology here pairs with our prompt-injection detector benchmark design.

What we don’t do

  • Report attack recall without a benign-sensitive false-positive rate
  • Merge input-path and output-path results into one number
  • Treat low- and high-severity recall as interchangeable
  • Build the attack set from paraphrases of three templates
  • Ignore the per-request latency and cost the classifier itself adds

The honest summary of most jailbreak classifiers is “good recall, undisclosed refusal cost.” Measuring the refusal cost is the entire contribution of a real benchmark.

Sources

  1. Llama Guard (Meta, arXiv:2312.06674)
  2. OWASP LLM Top 10 (2025)
  3. HarmBench (arXiv:2402.04249)
Subscribe

AI Sec Bench — in your inbox

Benchmarks and evaluations of AI security tools. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments