How to Test AI Agent Security: A Practical Evaluation Guide

Knowing how to test AI agent security has become a core competency for any team shipping autonomous systems into production. The challenge is that agent testing shares only superficial overlap with static LLM evaluation: an agent accumulates state, calls external tools, reads content from untrusted sources, and hands off instructions to downstream agents. Each of those interactions is an attack surface that a standard safety benchmark simply does not cover.

This guide walks through the threat model, a structured test methodology, and the frameworks currently used by practitioners doing this work.

Why Agent Security Testing Differs From LLM Red-Teaming

A static LLM responds to a prompt and stops. An agent acts. It may query a database, send an email, execute code, or spawn a sub-agent — all based on instructions that could include adversarially crafted content from outside the trust boundary.

The consequences of that distinction are significant. NIST red-team research published in early 2026 found that when testers developed attack techniques targeting agent-specific properties — rather than reusing standard jailbreak payloads — task-hijacking success rates jumped from 11% to 81% ↗. Standard evaluation frameworks, the researchers concluded, dramatically underestimate real-world exposure.

Three properties of agents widen the gap:

Tool integration. Each tool an agent can call is a potential lateral movement path. An agent with read/write access to a calendar, email client, and code interpreter is a much richer target than the same underlying model accessed through a chat interface.
External data ingestion. Agents often process content they retrieve — PDFs, web pages, emails, database records. Any of that content can contain indirect prompt injection payloads designed to redirect the agent’s behavior.
Persistent memory. Agents that maintain state across sessions can be poisoned once and carry malicious context into future, seemingly unrelated tasks.

The OWASP Agentic Top 10 as a Test Checklist

OWASP’s Gen AI Security Project ↗ published the Top 10 for Agentic Applications in 2026, which provides the most actionable public checklist for agent-specific security evaluation. The ten categories map directly to test cases:

ASI Code	Vulnerability	What to Test
ASI01	Agent Goal Hijack	Can injected instructions in retrieved content override the system prompt?
ASI02	Tool Misuse & Exploitation	Can the agent be induced to call tools with unsafe parameters or in recursive loops?
ASI03	Agent Identity & Privilege Abuse	Can an agent be spoofed or impersonated to gain delegated authority?
ASI04	Agentic Supply Chain Compromise	Can a malicious external agent or tool schema deceive the host agent?
ASI05	Unexpected Code Execution	Does agent-generated code execute in a properly isolated sandbox?
ASI06	Memory & Context Poisoning	Can adversarial content persist in memory and influence future agent sessions?
ASI07	Insecure Inter-Agent Communication	Are messages between agents integrity-protected against interception and injection?
ASI08	Cascading Agent Failures	Do errors propagate unchecked through multi-agent pipelines?
ASI09	Human-Agent Trust Exploitation	Can the agent produce misleading justifications that cause unsafe human approval?
ASI10	Rogue Agents	Does the agent remain within its intended objective boundary under adversarial pressure?

This taxonomy should drive your test planning. For each category, the question is whether you have an active test case with a measurable pass/fail criterion — not just a design intent. For a deeper look at the attack mechanics behind ASI01, the aisec.blog coverage of indirect prompt injection in agent pipelines ↗ is worth reading alongside the OWASP documentation.

A Structured Testing Methodology

1. Map the full attack surface before writing tests

Document every trust boundary crossing: what external data sources the agent reads, which tools it can invoke, what memory stores it writes to, and which downstream agents or APIs it contacts. This produces the scope for your test plan. Undocumented tool integrations are a common source of test gaps.

2. Prioritize indirect prompt injection

Direct injection — attacking the system prompt or user turn — is well-understood and most agent frameworks have at least some resistance. Indirect injection, where adversarial instructions are embedded in content the agent retrieves (a PDF attachment, a web search result, a calendar event body), is harder to defend and frequently untested. Build test fixtures that embed injection payloads in every content type the agent ingests and verify the agent does not execute those payloads.

For coverage of detection-side tooling for this category, guardml.io’s guardrail catalog ↗ tracks available runtime filters that can complement test-time evaluation.

3. Apply NIST’s multi-attempt testing protocol

NIST’s research found that with 25 repeated attempts per attack task, average success rates climbed from 57% to 80%. A single failing test run does not confirm defense — LLM outputs are probabilistic. Your protocol should run each injection scenario across a statistically meaningful sample and report success rates, not just pass/fail.

The same research team identified four principles for agent security assessments:

Continuous evaluation over time, not one-time scoring.
Agent-specific attack development, not just reuse of existing jailbreak payloads.
Task-level granularity — aggregate statistics hide variance across injection scenarios.
Multi-attempt protocols to account for probabilistic behavior.

4. Test tool permissions under adversarial conditions

For each tool the agent can invoke, test whether an indirect injection payload can cause that tool to be called with attacker-controlled parameters. Pay particular attention to tools that write data, trigger external communications, or modify application state. A principle-of-least-privilege audit of tool scopes should precede this testing — tools the agent doesn’t need should be removed before you assess the remainder.

5. Test memory persistence across sessions

If the agent maintains memory across conversations, verify isolation between sessions and users. Test whether injecting a payload in session A causes observable behavior change in session B. This is ASI06 (Memory & Context Poisoning) in practice and is routinely missed in point-in-time evaluations.

6. Evaluate multi-agent trust chains

If your architecture includes orchestrator-worker agent patterns, test whether a compromised worker agent can influence the orchestrator’s future decisions — either through poisoned return values or by injecting content into shared memory stores. ASI07 (Insecure Inter-Agent Communication) failures are often discovered late because integration tests are written after unit tests and inter-agent paths are the last thing tested.

Tooling for Agent Security Testing

Several open-source and commercial tools support systematic agent evaluation:

PyRIT (Microsoft) — configurable red-teaming orchestrator with agent-aware attack strategies.
Garak — modular probe library with prompt injection and jailbreak coverage; useful for baseline ASI01 coverage.
DeepTeam — specifically designed for agentic red-teaming against the OWASP Agentic Top 10.

None of these tools substitute for manual test case development against your specific agent architecture. Use them to automate repetitive execution of test scenarios you have already defined, not as a substitute for threat modeling.

Keeping Evaluations Current

Agent security testing is not a point-in-time certification. Both the attack techniques and the underlying models change. NIST’s guidance emphasizes that benchmarks require ongoing iteration — as new injection techniques emerge or the model is updated, prior test results may no longer reflect current exposure. Build agent security testing into CI/CD pipelines where feasible, and schedule manual red-team exercises at meaningful change intervals.

For tracking newly published AI vulnerabilities and agent exploitation patterns that should feed back into your test suite, ai-alert.org ↗ maintains a running disclosure feed.

Sources

OWASP Gen AI Security Project (https://genai.owasp.org/ ↗) — The primary OWASP initiative for LLM and agent security, including the Top 10 for Agentic Applications and the LLM Applications governance checklist.
NIST AI Agent Red-Teaming Standards — Cloud Security Alliance Research Note (https://labs.cloudsecurityalliance.org/research/csa-research-note-nist-ai-agent-red-teaming-standards-202603/ ↗) — Summarizes NIST supplementary guidance on agent-specific red-teaming, including the task-hijacking rate findings and multi-attempt testing protocol.
OWASP Top 10 for Agentic Applications 2026 — DeepTeam Documentation (https://www.trydeepteam.com/docs/frameworks-owasp-top-10-for-agentic-applications ↗) — Full enumeration of the ASI01–ASI10 vulnerability taxonomy with descriptions of each attack class.

How to Test AI Agent Security: A Practical Evaluation Guide

Why Agent Security Testing Differs From LLM Red-Teaming

The OWASP Agentic Top 10 as a Test Checklist

A Structured Testing Methodology

1. Map the full attack surface before writing tests

2. Prioritize indirect prompt injection

3. Apply NIST’s multi-attempt testing protocol

4. Test tool permissions under adversarial conditions

5. Test memory persistence across sessions

6. Evaluate multi-agent trust chains

Tooling for Agent Security Testing

Keeping Evaluations Current

Sources

Sources

AI Sec Bench — in your inbox

Related

Best LLM Red Teaming Tools 2026: A Practitioner's Evaluation

Measuring Prompt-Injection Robustness in Tool-Using Agents

How to Benchmark a Prompt-Injection Detector Honestly

Comments