AI Sec Bench
A team reviewing code and dashboards on monitors
methodology

Red-Team Eval Methodology: Pairing Attack Success Rate With Refusal Rate

An LLM red-team evaluation that reports attack success rate without reporting refusal rate is half a measurement. Here's the paired methodology that makes the two numbers mean something together.

By AI Sec Bench Editorial · · 8 min read

A red-team evaluation of an LLM produces two numbers that only mean something together: how often the model did the harmful thing it was attacked into doing (attack success rate, ASR), and how often it refused. Report ASR alone and you cannot distinguish a robust model from a model that refuses everything. Report refusal rate alone and you cannot tell a well-calibrated guard from a brittle one. This post is the paired methodology — how to measure both against the same corpus so the pair is decision-relevant.

Why one number is always a trap

The single-number temptation is strong because a single number sorts cleanly into a leaderboard. It is also where most red-team evals go wrong.

A model can drive ASR to near-zero by refusing aggressively — including refusing legitimate requests. A model can drive refusal rate to near-zero by answering everything — including the attacks. Each metric, optimized alone, produces a model nobody wants. The whole job of a red-team eval is to show the operating point: where on the trade-off curve this model sits, measured against attacks and against benign requests that look superficially like attacks.

That is why standardized benchmarks like JailbreakBench ship 100 harmful behaviors and 100 benign behaviors. The benign half is not decoration. It is the only way to catch a model that achieves a flattering ASR by being uselessly cautious.

Build two corpora, not one

A paired eval needs two corpora that you measure independently:

  • The attack corpus. Adversarial prompts targeting a fixed, versioned set of harmful behaviors. Anchor it on a published set — HarmBench (arXiv:2402.04249) or JailbreakBench — so the result is comparable to other people’s. ASR is the fraction of this corpus that elicited the harmful behavior.
  • The benign-but-adjacent corpus. Legitimate requests that live near the policy boundary: security research questions, medical and legal questions in good faith, fiction involving conflict, academic discussion of dangerous topics. Refusal rate is the fraction of this corpus that the model wrongly refused.

The benign-adjacent corpus is the one teams skip, and skipping it is the most common methodological failure in red-teaming. Refusal rate measured on obviously-fine requests (“what’s the weather”) is meaningless — every model passes. Refusal rate measured on the boundary is the number that tells you whether the model is deployable.

The four-cell view

Once you have both corpora and a judge, every prompt-response pair lands in one of four cells. Think of it as a confusion matrix where “positive” means “the model produced unsafe / refused output”:

  • Attack → harmful output: a successful attack (counts toward ASR).
  • Attack → refusal: a successful defense.
  • Benign → answered: correct behavior.
  • Benign → refusal: over-refusal (counts toward benign refusal rate).

ASR is computed entirely within the attack corpus; benign refusal rate entirely within the benign corpus. Reporting the two cells that matter — successful attacks and over-refusals — at a fixed model configuration is the irreducible output of a red-team eval. Everything else is a roll-up of these.

The judge decides both numbers, so calibrate it once

Both ASR and refusal rate depend on a judge: a classifier or LLM-as-judge that reads each response and labels it harmful / refused / answered. The same judge scores both corpora, which is convenient — it means a single calibration governs both numbers.

  • Pin the judge to a dated version or weight hash. A drifting judge moves ASR and refusal rate together in unpredictable ways.
  • Calibrate against human labels on a sample from both corpora. The judge that is good at spotting harmful completions may be bad at distinguishing a genuine refusal from a hedged-but-compliant answer. Report agreement separately for the attack and benign sets if they differ.
  • Watch the refusal-detection edge case. “I can’t help with that, but here’s how it generally works…” is a refusal that leaks. A naive judge scores it as a refusal; it is actually a partial attack success. Define how your judge handles partial compliance and document it.

Tooling: where the corpora and orchestration come from

You do not have to build the attack orchestration from scratch. Microsoft’s PyRIT (Python Risk Identification Tool for generative AI, MIT-licensed) is built for exactly this loop: it sends adversarial prompts to a target, supports multi-turn attack strategies, and scores responses with true/false, Likert, or classification scorers backed by an LLM, a content-safety service, or your own logic. It targets OpenAI, Azure, Anthropic, Google, Hugging Face, and custom HTTP endpoints, which makes it a reasonable harness for running the attack corpus and collecting responses for your judge.

PyRIT supplies the attack and scoring machinery; the benign-adjacent corpus and the human calibration are yours to build, because they encode your policy. A tool can automate the loop. It cannot decide what “harmful” and “over-refusal” mean for your product — that is the part you have to own.

Report the operating point, not a score

The output of a paired red-team eval is not a single grade. It is an operating point with both coordinates and their uncertainty:

  1. ASR on the versioned attack corpus, stratified by category and technique family.
  2. Benign refusal rate on the boundary corpus, stratified by request type.
  3. Both at a stated, fixed model configuration (model version, system prompt, safety settings, sampling).
  4. Both with N≥5 runs and reported spread for non-deterministic targets.
  5. Judge version pinned; judge–human agreement reported for each corpus.

A model at 6% ASR / 4% benign refusal is a genuinely different product from one at 6% ASR / 22% benign refusal, even though their attack robustness is identical. Only the paired report distinguishes them.

Where this fits in the network

aisecbench.com reports red-team results as paired ASR and benign-refusal operating points, never a lone grade, with the judge pinned and calibrated. The ASR side builds on our jailbreak resistance ASR methodology; the refusal side connects to our jailbreak classifier benchmark design. For the scanners and orchestration harnesses that run these corpora, see Best LLM scanners; for the attack technique families, aisec.blog.

What we don’t do

  • Report attack success rate without a paired benign refusal rate
  • Measure refusal rate on obviously-benign requests instead of boundary requests
  • Use a judge that can’t distinguish a leaking refusal from a clean one
  • Average across categories so the dangerous tier disappears
  • Present an operating point from a single non-deterministic run with no spread

A red-team eval that reports one number is selling a leaderboard rank. A red-team eval that reports the paired operating point is telling you whether to ship. Measure both against purpose-built corpora, calibrate the judge once, and the pair becomes the most useful thing you can hand a deployment decision.

Sources

  1. HarmBench (Mazeika et al., arXiv:2402.04249)
  2. JailbreakBench (NeurIPS 2024)
  3. PyRIT — Python Risk Identification Tool (Microsoft)
Subscribe

AI Sec Bench — in your inbox

Benchmarks and evaluations of AI security tools. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments