Red-Team Eval Methodology: Pairing Attack Success Rate With Refusal Rate
An LLM red-team evaluation that reports attack success rate without reporting refusal rate is half a measurement. Here's the paired methodology that makes the two numbers mean something together.
A red-team evaluation of an LLM produces two numbers that only mean something together: how often the model did the harmful thing it was attacked into doing (attack success rate, ASR), and how often it refused. Report ASR alone and you cannot distinguish a robust model from a model that refuses everything. Report refusal rate alone and you cannot tell a well-calibrated guard from a brittle one. This post is the paired methodology — how to measure both against the same corpus so the pair is decision-relevant.
Why one number is always a trap
The single-number temptation is strong because a single number sorts cleanly into a leaderboard. It is also where most red-team evals go wrong.
A model can drive ASR to near-zero by refusing aggressively — including refusing legitimate requests. A model can drive refusal rate to near-zero by answering everything — including the attacks. Each metric, optimized alone, produces a model nobody wants. The whole job of a red-team eval is to show the operating point: where on the trade-off curve this model sits, measured against attacks and against benign requests that look superficially like attacks.
That is why standardized benchmarks like JailbreakBench ↗ ship 100 harmful behaviors and 100 benign behaviors. The benign half is not decoration. It is the only way to catch a model that achieves a flattering ASR by being uselessly cautious.
Build two corpora, not one
A paired eval needs two corpora that you measure independently:
- The attack corpus. Adversarial prompts targeting a fixed, versioned set of harmful behaviors. Anchor it on a published set — HarmBench ↗ (arXiv:2402.04249) or JailbreakBench — so the result is comparable to other people’s. ASR is the fraction of this corpus that elicited the harmful behavior.
- The benign-but-adjacent corpus. Legitimate requests that live near the policy boundary: security research questions, medical and legal questions in good faith, fiction involving conflict, academic discussion of dangerous topics. Refusal rate is the fraction of this corpus that the model wrongly refused.
The benign-adjacent corpus is the one teams skip, and skipping it is the most common methodological failure in red-teaming. Refusal rate measured on obviously-fine requests (“what’s the weather”) is meaningless — every model passes. Refusal rate measured on the boundary is the number that tells you whether the model is deployable.
The four-cell view
Once you have both corpora and a judge, every prompt-response pair lands in one of four cells. Think of it as a confusion matrix where “positive” means “the model produced unsafe / refused output”:
- Attack → harmful output: a successful attack (counts toward ASR).
- Attack → refusal: a successful defense.
- Benign → answered: correct behavior.
- Benign → refusal: over-refusal (counts toward benign refusal rate).
ASR is computed entirely within the attack corpus; benign refusal rate entirely within the benign corpus. Reporting the two cells that matter — successful attacks and over-refusals — at a fixed model configuration is the irreducible output of a red-team eval. Everything else is a roll-up of these.
The judge decides both numbers, so calibrate it once
Both ASR and refusal rate depend on a judge: a classifier or LLM-as-judge that reads each response and labels it harmful / refused / answered. The same judge scores both corpora, which is convenient — it means a single calibration governs both numbers.
- Pin the judge to a dated version or weight hash. A drifting judge moves ASR and refusal rate together in unpredictable ways.
- Calibrate against human labels on a sample from both corpora. The judge that is good at spotting harmful completions may be bad at distinguishing a genuine refusal from a hedged-but-compliant answer. Report agreement separately for the attack and benign sets if they differ.
- Watch the refusal-detection edge case. “I can’t help with that, but here’s how it generally works…” is a refusal that leaks. A naive judge scores it as a refusal; it is actually a partial attack success. Define how your judge handles partial compliance and document it.
Tooling: where the corpora and orchestration come from
You do not have to build the attack orchestration from scratch. Microsoft’s PyRIT ↗ (Python Risk Identification Tool for generative AI, MIT-licensed) is built for exactly this loop: it sends adversarial prompts to a target, supports multi-turn attack strategies, and scores responses with true/false, Likert, or classification scorers backed by an LLM, a content-safety service, or your own logic. It targets OpenAI, Azure, Anthropic, Google, Hugging Face, and custom HTTP endpoints, which makes it a reasonable harness for running the attack corpus and collecting responses for your judge.
PyRIT supplies the attack and scoring machinery; the benign-adjacent corpus and the human calibration are yours to build, because they encode your policy. A tool can automate the loop. It cannot decide what “harmful” and “over-refusal” mean for your product — that is the part you have to own.
Report the operating point, not a score
The output of a paired red-team eval is not a single grade. It is an operating point with both coordinates and their uncertainty:
- ASR on the versioned attack corpus, stratified by category and technique family.
- Benign refusal rate on the boundary corpus, stratified by request type.
- Both at a stated, fixed model configuration (model version, system prompt, safety settings, sampling).
- Both with N≥5 runs and reported spread for non-deterministic targets.
- Judge version pinned; judge–human agreement reported for each corpus.
A model at 6% ASR / 4% benign refusal is a genuinely different product from one at 6% ASR / 22% benign refusal, even though their attack robustness is identical. Only the paired report distinguishes them.
Where this fits in the network
aisecbench.com reports red-team results as paired ASR and benign-refusal operating points, never a lone grade, with the judge pinned and calibrated. The ASR side builds on our jailbreak resistance ASR methodology; the refusal side connects to our jailbreak classifier benchmark design. For the scanners and orchestration harnesses that run these corpora, see Best LLM scanners ↗; for the attack technique families, aisec.blog ↗.
What we don’t do
- Report attack success rate without a paired benign refusal rate
- Measure refusal rate on obviously-benign requests instead of boundary requests
- Use a judge that can’t distinguish a leaking refusal from a clean one
- Average across categories so the dangerous tier disappears
- Present an operating point from a single non-deterministic run with no spread
A red-team eval that reports one number is selling a leaderboard rank. A red-team eval that reports the paired operating point is telling you whether to ship. Measure both against purpose-built corpora, calibrate the judge once, and the pair becomes the most useful thing you can hand a deployment decision.
Sources
AI Sec Bench — in your inbox
Benchmarks and evaluations of AI security tools. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right
Attack success rate is the headline metric for jailbreak resistance, and almost everyone computes it in a way that isn't comparable across runs. Here's how to define and report ASR so the number survives a re-run.
Designing a Reproducible AI-Security Eval Harness
A reproducible AI-security evaluation is an engineering artifact, not a notebook. Here's the harness design — separation of corpus, target, judge, and report — that lets a stranger re-run your number.
Measuring Prompt-Injection Robustness in Tool-Using Agents
Prompt-injection robustness for an agent is not a single number — it is utility-under-attack against targeted attack success. Here's how AgentDojo and InjecAgent measure it and what the metrics actually mean.