How to Test AI Agent Security: A Practical Evaluation Guide
Testing AI agent security requires a different approach than static LLM red-teaming. This guide covers the attack surface, test methodology, and the OWASP Agentic Top 10 framework practitioners use today.
Knowing how to test AI agent security has become a core competency for any team shipping autonomous systems into production. The challenge is that agent testing shares only superficial overlap with static LLM evaluation: an agent accumulates state, calls external tools, reads content from untrusted sources, and hands off instructions to downstream agents. Each of those interactions is an attack surface that a standard safety benchmark simply does not cover.
This guide walks through the threat model, a structured test methodology, and the frameworks currently used by practitioners doing this work.
Why Agent Security Testing Differs From LLM Red-Teaming
A static LLM responds to a prompt and stops. An agent acts. It may query a database, send an email, execute code, or spawn a sub-agent — all based on instructions that could include adversarially crafted content from outside the trust boundary.
The consequences of that distinction are significant. NIST red-team research published in early 2026 found that when testers developed attack techniques targeting agent-specific properties — rather than reusing standard jailbreak payloads — task-hijacking success rates jumped from 11% to 81% ↗. Standard evaluation frameworks, the researchers concluded, dramatically underestimate real-world exposure.
Three properties of agents widen the gap:
- Tool integration. Each tool an agent can call is a potential lateral movement path. An agent with read/write access to a calendar, email client, and code interpreter is a much richer target than the same underlying model accessed through a chat interface.
- External data ingestion. Agents often process content they retrieve — PDFs, web pages, emails, database records. Any of that content can contain indirect prompt injection payloads designed to redirect the agent’s behavior.
- Persistent memory. Agents that maintain state across sessions can be poisoned once and carry malicious context into future, seemingly unrelated tasks.
The OWASP Agentic Top 10 as a Test Checklist
OWASP’s Gen AI Security Project ↗ published the Top 10 for Agentic Applications in 2026, which provides the most actionable public checklist for agent-specific security evaluation. The ten categories map directly to test cases:
| ASI Code | Vulnerability | What to Test |
|---|---|---|
| ASI01 | Agent Goal Hijack | Can injected instructions in retrieved content override the system prompt? |
| ASI02 | Tool Misuse & Exploitation | Can the agent be induced to call tools with unsafe parameters or in recursive loops? |
| ASI03 | Agent Identity & Privilege Abuse | Can an agent be spoofed or impersonated to gain delegated authority? |
| ASI04 | Agentic Supply Chain Compromise | Can a malicious external agent or tool schema deceive the host agent? |
| ASI05 | Unexpected Code Execution | Does agent-generated code execute in a properly isolated sandbox? |
| ASI06 | Memory & Context Poisoning | Can adversarial content persist in memory and influence future agent sessions? |
| ASI07 | Insecure Inter-Agent Communication | Are messages between agents integrity-protected against interception and injection? |
| ASI08 | Cascading Agent Failures | Do errors propagate unchecked through multi-agent pipelines? |
| ASI09 | Human-Agent Trust Exploitation | Can the agent produce misleading justifications that cause unsafe human approval? |
| ASI10 | Rogue Agents | Does the agent remain within its intended objective boundary under adversarial pressure? |
This taxonomy should drive your test planning. For each category, the question is whether you have an active test case with a measurable pass/fail criterion — not just a design intent. For a deeper look at the attack mechanics behind ASI01, the aisec.blog coverage of indirect prompt injection in agent pipelines ↗ is worth reading alongside the OWASP documentation.
A Structured Testing Methodology
1. Map the full attack surface before writing tests
Document every trust boundary crossing: what external data sources the agent reads, which tools it can invoke, what memory stores it writes to, and which downstream agents or APIs it contacts. This produces the scope for your test plan. Undocumented tool integrations are a common source of test gaps.
2. Prioritize indirect prompt injection
Direct injection — attacking the system prompt or user turn — is well-understood and most agent frameworks have at least some resistance. Indirect injection, where adversarial instructions are embedded in content the agent retrieves (a PDF attachment, a web search result, a calendar event body), is harder to defend and frequently untested. Build test fixtures that embed injection payloads in every content type the agent ingests and verify the agent does not execute those payloads.
For coverage of detection-side tooling for this category, guardml.io’s guardrail catalog ↗ tracks available runtime filters that can complement test-time evaluation.
3. Apply NIST’s multi-attempt testing protocol
NIST’s research found that with 25 repeated attempts per attack task, average success rates climbed from 57% to 80%. A single failing test run does not confirm defense — LLM outputs are probabilistic. Your protocol should run each injection scenario across a statistically meaningful sample and report success rates, not just pass/fail.
The same research team identified four principles for agent security assessments:
- Continuous evaluation over time, not one-time scoring.
- Agent-specific attack development, not just reuse of existing jailbreak payloads.
- Task-level granularity — aggregate statistics hide variance across injection scenarios.
- Multi-attempt protocols to account for probabilistic behavior.
4. Test tool permissions under adversarial conditions
For each tool the agent can invoke, test whether an indirect injection payload can cause that tool to be called with attacker-controlled parameters. Pay particular attention to tools that write data, trigger external communications, or modify application state. A principle-of-least-privilege audit of tool scopes should precede this testing — tools the agent doesn’t need should be removed before you assess the remainder.
5. Test memory persistence across sessions
If the agent maintains memory across conversations, verify isolation between sessions and users. Test whether injecting a payload in session A causes observable behavior change in session B. This is ASI06 (Memory & Context Poisoning) in practice and is routinely missed in point-in-time evaluations.
6. Evaluate multi-agent trust chains
If your architecture includes orchestrator-worker agent patterns, test whether a compromised worker agent can influence the orchestrator’s future decisions — either through poisoned return values or by injecting content into shared memory stores. ASI07 (Insecure Inter-Agent Communication) failures are often discovered late because integration tests are written after unit tests and inter-agent paths are the last thing tested.
Tooling for Agent Security Testing
Several open-source and commercial tools support systematic agent evaluation:
- PyRIT (Microsoft) — configurable red-teaming orchestrator with agent-aware attack strategies.
- Garak — modular probe library with prompt injection and jailbreak coverage; useful for baseline ASI01 coverage.
- DeepTeam — specifically designed for agentic red-teaming against the OWASP Agentic Top 10.
None of these tools substitute for manual test case development against your specific agent architecture. Use them to automate repetitive execution of test scenarios you have already defined, not as a substitute for threat modeling.
Keeping Evaluations Current
Agent security testing is not a point-in-time certification. Both the attack techniques and the underlying models change. NIST’s guidance emphasizes that benchmarks require ongoing iteration — as new injection techniques emerge or the model is updated, prior test results may no longer reflect current exposure. Build agent security testing into CI/CD pipelines where feasible, and schedule manual red-team exercises at meaningful change intervals.
For tracking newly published AI vulnerabilities and agent exploitation patterns that should feed back into your test suite, ai-alert.org ↗ maintains a running disclosure feed.
Sources
-
OWASP Gen AI Security Project (https://genai.owasp.org/ ↗) — The primary OWASP initiative for LLM and agent security, including the Top 10 for Agentic Applications and the LLM Applications governance checklist.
-
NIST AI Agent Red-Teaming Standards — Cloud Security Alliance Research Note (https://labs.cloudsecurityalliance.org/research/csa-research-note-nist-ai-agent-red-teaming-standards-202603/ ↗) — Summarizes NIST supplementary guidance on agent-specific red-teaming, including the task-hijacking rate findings and multi-attempt testing protocol.
-
OWASP Top 10 for Agentic Applications 2026 — DeepTeam Documentation (https://www.trydeepteam.com/docs/frameworks-owasp-top-10-for-agentic-applications ↗) — Full enumeration of the ASI01–ASI10 vulnerability taxonomy with descriptions of each attack class.
Sources
AI Sec Bench — in your inbox
Benchmarks and evaluations of AI security tools. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Best LLM Red Teaming Tools 2026: A Practitioner's Evaluation
A hands-on comparison of the leading LLM red teaming tools in 2026 — PyRIT, Garak, Promptfoo, and manual frameworks — with capability matrices, integration tradeoffs, and team-fit guidance.
Measuring Prompt-Injection Robustness in Tool-Using Agents
Prompt-injection robustness for an agent is not a single number — it is utility-under-attack against targeted attack success. Here's how AgentDojo and InjecAgent measure it and what the metrics actually mean.
How to Benchmark a Prompt-Injection Detector Honestly
Most prompt-injection detector benchmarks are broken before the first request. Here is a test design that produces a number you can actually trust.