Measuring Prompt-Injection Robustness in Tool-Using Agents

Prompt injection against a chatbot is a content problem: a crafted prompt makes the model say something it shouldn’t. Prompt injection against a tool-using agent is an action problem: a crafted string buried in a retrieved document or a tool’s output makes the agent do something — send an email, transfer money, exfiltrate data. Measuring robustness to the second is harder, because the metric has to capture both whether the attack succeeded and whether the agent still did its real job. This post is how the agentic prompt-injection benchmarks measure that, and how to read the numbers.

Two numbers, not one

For a chatbot, attack success rate is nearly the whole story. For an agent, robustness is inherently a pair:

Targeted attack success rate — how often the injected instruction got the agent to perform the attacker’s goal (the unauthorized email, the transfer, the leaked secret).
Benign utility under attack — how often the agent still completed the user’s legitimate task while the injection was present.

These move independently, and a defense that crushes one often wrecks the other. A guard that makes an agent refuse to act on any external content drops attack success to near zero — and also drops benign utility, because real tasks require acting on external content. The whole point of an agentic injection benchmark is to show both coordinates, because a “robust” agent that can no longer do its job is not a solution.

AgentDojo: a dynamic environment, not a static corpus

AgentDojo ↗ (Debenedetti et al., arXiv:2406.13352; NeurIPS 2024 Datasets and Benchmarks; ETH Zurich) is the most-cited environment for this. Its design choice is important: it is not a fixed list of prompts, it is an extensible environment of realistic tasks, so attacks and defenses can be added and re-evaluated rather than baked in.

The published environment contains:

97 realistic user tasks across suites like an email client, an e-banking site, travel booking, and a workspace.
629 security test cases — combinations of a user task and an injection task placed in the data the agent processes.

The headline findings give you a sense of scale and difficulty. In the original paper, current LLMs solved less than 66% of AgentDojo tasks even with no attack present — agents are not yet very capable at these tasks to begin with — and attacks succeeded against the best-performing agents in less than 25% of cases. Both numbers matter: the first is the benign-utility ceiling, the second is the attack surface, and a defense has to be read against both.

Because AgentDojo is an environment, the right way to use it is to plug your agent, your defense, and adaptive attacks into it and report utility-under-attack and targeted-ASR together. A static screenshot of someone else’s run is not your result.

InjecAgent: indirect injection through tool outputs

InjecAgent ↗ (Zhan et al., arXiv:2403.02691; ACL Findings 2024) targets the specific mechanism of indirect prompt injection: malicious instructions returned by a tool the agent calls, rather than typed by the user. Its corpus:

1,054 test cases
17 user tools and 62 attacker tools
Attacker objectives split into two harm types: direct harm to users and exfiltration of private data.

The original finding — a ReAct-prompted GPT-4 was vulnerable about 24% of the time, with success rates rising when the attacker’s instruction was reinforced — is useful as a baseline, but the durable value of InjecAgent is the harm split. “Direct harm” and “data exfiltration” have different consequences and different mitigations, so a robustness number that averages them hides which one your agent is exposed to. Report the two harm types separately.

Read the metrics, and the metrics’ critics

A pair of public findings from 2025 is worth internalizing: independent analysis (e.g., the “Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?” line of work, arXiv:2510.05244) argues that several agentic injection benchmarks do not model real-world situations well and sometimes use skewed metrics that make weak defenses look strong. The lesson is not to discard the benchmarks — they remain the best public anchors — but to read their numbers with the same skepticism we apply to any single metric:

A defense that posts a near-zero attack success rate is suspicious until you see its benign utility. A guard can win on attack-ASR by breaking the agent.
Static attack success is a lower bound. Adaptive attacks — written against your specific defense — routinely exceed it. Report adaptive ASR separately.
A benchmark’s “success” definition is a policy choice. Confirm what the judge counts as the attacker’s goal being achieved before you trust the rate.

A protocol for agentic injection robustness

Use an environment (AgentDojo) for action-level attacks and a corpus (InjecAgent) for indirect-injection coverage; record versions.
Always report the pair: targeted attack success rate and benign utility under attack, at a fixed agent configuration.
Split InjecAgent-style results by harm type (direct harm vs data exfiltration); never average them.
Add an adaptive attack slice against your specific defense; report it separately from static.
Pin the target model, the agent scaffold (ReAct, tool schema, system prompt), and the judge — all of them move the numbers.
Run N≥5 times for non-deterministic agents and report spread.

Where this fits in the network

aisecbench.com reports agentic prompt-injection robustness as the utility-under-attack / targeted-ASR pair, split by harm type, with adaptive attacks called out separately. This extends the content-level work in our prompt-injection detector benchmark design into the action layer, and shares the pinning discipline of our reproducible LLM scanner benchmarks. For the scanners and guardrails that defend the agent loop, see Best LLM scanners ↗; for the injection mechanics themselves, aisec.blog ↗.

What we don’t do

Report agent attack success rate without benign utility under attack
Treat a static benchmark number as the robustness ceiling
Average direct-harm and data-exfiltration injection results into one rate
Quote someone else’s AgentDojo run as if it were a result for your agent
Trust a near-zero attack rate without checking what it cost in utility

For agents, robustness is a coordinate, not a score. Measure utility and attack success against the same environment, split the harm types, and stress it with adaptive attacks — and you get a number that survives contact with a real attacker instead of a leaderboard rank that doesn’t.

Measuring Prompt-Injection Robustness in Tool-Using Agents

Two numbers, not one

AgentDojo: a dynamic environment, not a static corpus

InjecAgent: indirect injection through tool outputs

Read the metrics, and the metrics’ critics

A protocol for agentic injection robustness

Where this fits in the network

What we don’t do

Sources

AI Sec Bench — in your inbox

Related

How to Benchmark a Prompt-Injection Detector Honestly

Comparing LLM Safety Benchmarks: AdvBench, HarmBench, JailbreakBench

Benchmarking LLM Jailbreak Resistance: Attack Success Rate Done Right

Comments