AI Red Teaming Guide
The Complete Playbook for Adversarial Testing of LLMs, Copilots & Autonomous Agents
Why AI Red Teaming Matters
AI red teaming is a structured, proactive practice where experts simulate adversarial attacks on AI systems to uncover vulnerabilities before they reach production. In 2026, autonomous tool-using agents act on behalf of users — converting "bad text output" into real-world actions like data exfiltration, lateral movement, and unauthorized transactions.
Regulatory drivers — the EU AI Act Article 15, the US Executive Order on AI — now formally require adversarial testing of high-risk AI systems.
| Traditional Cybersecurity | AI Red Teaming |
|---|---|
| Tests against known vulnerabilities | Discovers novel, emergent risks |
| Binary pass/fail outcomes | Probabilistic behaviors and edge cases |
| Static attack surface | Dynamic, context-dependent vulnerabilities |
| Code-level exploits | Natural language attacks via prompts |
| Deterministic systems | Non-deterministic AI behaviors |
Key Frameworks & Standards
Every credible AI red team program maps to at least one of these. Mature programs map to all of them.
NIST AI Risk Management Framework
Four core functions — GOVERN, MAP, MEASURE, MANAGE — for continuous testing across the AI lifecycle.
- •GOVERN — risk policies, roles, enterprise integration
- •MAP — identify capabilities, contexts, stakeholders
- •MEASURE — adversarial testing, metrics, Dioptra testbed
- •MANAGE — mitigation, monitoring, incident response
OWASP Top 10 for Agentic Applications (2026)
First risk ranking built specifically for autonomous, tool-using agents — ASI01 through ASI10.
- •Goal hijack, tool misuse, identity abuse
- •Supply chain, code execution, memory poisoning
- •Inter-agent comms, cascading failures
- •Human-trust exploitation, rogue agents
Agentegrity Framework
Open-source framework measuring the structural integrity of autonomous AI agents — endogenous, measurable security properties that complement exogenous guardrails.
- •Adversarial Resistance — robustness to prompt injection & jailbreaks
- •Behavioral Consistency — stable goals across contexts
- •Recovery Integrity — safe failure & rollback
- •Cross-Domain Portability — digital + physical agents
Microsoft Failure-Mode Taxonomy v2.0
Year-two update with 7 new agentic failure categories observed in real red-team engagements.
- •Agentic supply chain compromise
- •Goal hijacking & inter-agent trust escalation
- •Computer-use visual attacks
- •Session context contamination, MCP/plugin abuse, capability disclosure
OWASP ASI Top 10 (2026)
The first risk ranking built specifically for autonomous, tool-using agents. Every red-team finding in 2026 should map to one of these IDs.
Methodology — 5 Phases
A repeatable lifecycle, from rules-of-engagement to retest.
- 1
Planning & Threat Modeling
Define scope, adversaries, and impact. Map the AI attack surface before writing a single payload.
- ✓Define scope, system boundaries & assets
- ✓Identify adversary tiers (script kiddie → nation-state)
- ✓Enumerate intended use & abuse cases
- ✓Document rules of engagement and safety thresholds
- 2
Reconnaissance
Profile the target model, tools, memory, prompts, and connected systems.
- ✓Fingerprint base model & version
- ✓Map exposed tools, plugins, MCP servers
- ✓Enumerate memory layers and retrieval corpora
- ✓Identify orchestration patterns and sub-agents
- 3
Attack Execution
Run prompt injection, jailbreaks, tool abuse, memory poisoning, and agentic chains.
- ✓Direct & indirect prompt injection
- ✓Jailbreak prompts & policy evasion
- ✓Tool/plugin abuse and argument injection
- ✓Memory poisoning & cross-session bleed
- ✓Computer-use & visual attacks where applicable
- 4
Analysis & Reporting
Triage findings against ASI / ATLAS / NIST mappings; quantify blast radius and likelihood.
- ✓Tag each finding with ASI / ATLAS IDs
- ✓Score likelihood × impact
- ✓Capture reproducible payloads
- ✓Prepare executive + technical reports
- 5
Remediation & Retest
Drive fixes — guardrails, scoping, monitoring — then verify with the same payloads.
- ✓Prioritize fixes (input filters, scope, HITL)
- ✓Add detection telemetry for the attack class
- ✓Retest with original + mutated payloads
- ✓Add regression cases to continuous evals
Attack Techniques Catalog
Common adversarial techniques against LLMs, models, and autonomous agents — each paired with a recommended defense direction.
Direct Prompt Injection
LLMUser input overrides system prompt or safety policy.
Layered system prompts, output filtering, separate trust planes.
Indirect Prompt Injection
LLMHostile instructions hidden in retrieved docs, emails, web pages.
Content provenance, instruction stripping, untrusted-content delimiters.
Jailbreaking
LLMRole-play, encoding tricks, multi-turn pressure to bypass safety.
Robust refusal training, jailbreak classifiers, runtime moderation.
Model Extraction
ModelMass API queries reconstruct proprietary model behavior or weights.
Rate limits, watermarking, query-pattern anomaly detection.
Data Poisoning
ModelCorrupted training or fine-tuning data implants backdoors.
Data lineage, anomaly scans, signed datasets, eval-set hold-outs.
Adversarial Examples
ModelImperceptible perturbations flip classifier output.
Adversarial training, input randomization, ensemble defenses.
Tool / Plugin Abuse
AgenticAgent coerced into invoking tools with attacker-controlled args.
Per-tool scoping, argument schemas, capability tokens.
Memory Poisoning
AgenticAdversarial memories persist and bias future sessions.
Provenance tags, TTLs, isolation per principal, review queues.
Computer-Use / Visual Attacks
AgenticOn-screen injection of agents that see and click.
OCR sanitization, visual-prompt classifiers, click-confirmation policy.
MCP / Plugin Abuse
AgenticMalicious or compromised MCP servers and tool protocols.
Signed manifests, pinned versions, registry allow-list, sandboxing.
Inter-Agent Trust Escalation
AgenticLow-privilege agent leverages a higher-privilege peer.
mTLS, signed envelopes, scoped capability passing, audit trails.
Consent-Fatigue HITL Bypass
AgenticStream of trivial approvals trains humans to click through.
Risk-tiered approvals, batching, anomaly highlighting in UI.
7 New Agentic Failure Modes
Added in Microsoft Failure-Mode Taxonomy v2.0 (June 2026), after a year of real red-team engagements.
Agentic supply chain compromise
Malicious tools/plugins/sub-agents enter the pipeline (ASI04).
Goal hijacking
Untrusted content redirects the agent's objective (ASI01).
Inter-agent trust escalation
Low-privilege agent leverages a higher-privilege one (ASI07).
Computer-use visual attacks
On-screen / visual injection of agents that see and click.
Session context contamination
Cross-turn / cross-session state bleed.
MCP & plugin abuse
The tool protocol layer treated as a first-class attack surface.
Capability / architecture disclosure
Agents leaking their tools, prompts, or topology to an attacker.
Real-World Case Studies
Documented AI security incidents that have shaped today's red-team playbooks.
OpenClaw Agent Framework — 512 vulnerabilities
Shipped with a one-click RCE (CVE-2026-25253). Within a week, 1,800+ instances exposed API keys and 336 malicious plugins reached its marketplace.
Anthropic disrupts state-sponsored Claude Code op
First documented large-scale cyberattack predominantly executed by an AI agent — Claude Code autonomously handled ~80–90% of tactical execution across ~30 global targets.
GitHub Copilot RCE — CVE-2025-53773 (CVSS 9.6)
Prompt injection wrote to the agent's configuration files, achieving remote code execution.
AI-Browser Prompt Injection (Comet, Gemini for Chrome)
Research demonstrated indirect prompt injection against AI-enabled browsers and coding assistants (GitLab Duo, Copilot Chat).
Samsung ChatGPT Data Leak
Confidential source code and meeting notes leaked via consumer ChatGPT — instructive early example of shadow-AI usage.
Red Team Checklist
Working checklist across the five engagement phases. State is local to your browser.
Pre-Engagement
Reconnaissance
Execution
Reporting
Retest
Tools Catalog
Open-source tooling used by AI red teams in production today.
Microsoft PyRIT
Python Risk Identification Tool for generative AI — automates red-teaming prompts and scoring.
NIST Dioptra
Open-source AI security testbed for adversarial ML experimentation.
Garak
LLM vulnerability scanner — prompt injection, jailbreaks, toxicity, leakage probes.
Promptfoo
Eval & red-team harness for LLM apps; CI-friendly assertions and adversarial test packs.
Giskard
Testing framework for ML/LLM apps — bias, robustness, hallucination scans.
Adversarial Robustness Toolbox
IBM-led library for evasion, poisoning, extraction, and inference attacks/defenses.
Counterfit
Microsoft CLI for assessing the security of ML systems via automated attack tooling.
Resources & References
Source frameworks, standards, and the open-source guide this page is built on.
NIST AI Risk Management Framework
https://www.nist.gov/itl/ai-risk-management-framework
OWASP GenAI Security Project
https://genai.owasp.org/
OWASP Top 10 for Agentic Applications (2026)
https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/
MITRE ATLAS
https://atlas.mitre.org/
Cloud Security Alliance — Agentic AI Red Teaming
https://cloudsecurityalliance.org/
Microsoft — Updating the Taxonomy of Failure Modes in Agentic AI (2026)
https://www.microsoft.com/en-us/security/blog/2026/06/04/updating-taxonomy-failure-modes-agentic-ai-systems-year-red-teaming-taught-us/
Source: requie/AI-Red-Teaming-Guide (MIT)
https://github.com/requie/AI-Red-Teaming-Guide
Content adapted from the open-source requie/AI-Red-Teaming-Guide (MIT License).