AI Red Teaming Guide

The Complete Playbook for Adversarial Testing of LLMs, Copilots & Autonomous Agents

NIST AI RMFOWASP GenAIOWASP Agentic 2026MITRE ATLASCSAMIT License
June 2026 — aligned with Microsoft Failure-Mode Taxonomy v2.0 & OWASP ASI Top 10
121+
GitHub Stars
Core Frameworks
OWASP ASI Risks
New Failure Modes (MS v2.0)

Why AI Red Teaming Matters

AI red teaming is a structured, proactive practice where experts simulate adversarial attacks on AI systems to uncover vulnerabilities before they reach production. In 2026, autonomous tool-using agents act on behalf of users — converting "bad text output" into real-world actions like data exfiltration, lateral movement, and unauthorized transactions.

Regulatory drivers — the EU AI Act Article 15, the US Executive Order on AI — now formally require adversarial testing of high-risk AI systems.

Traditional CybersecurityAI Red Teaming
Tests against known vulnerabilitiesDiscovers novel, emergent risks
Binary pass/fail outcomesProbabilistic behaviors and edge cases
Static attack surfaceDynamic, context-dependent vulnerabilities
Code-level exploitsNatural language attacks via prompts
Deterministic systemsNon-deterministic AI behaviors

Key Frameworks & Standards

Every credible AI red team program maps to at least one of these. Mature programs map to all of them.

NIST

NIST AI Risk Management Framework

Four core functions — GOVERN, MAP, MEASURE, MANAGE — for continuous testing across the AI lifecycle.

  • GOVERN — risk policies, roles, enterprise integration
  • MAP — identify capabilities, contexts, stakeholders
  • MEASURE — adversarial testing, metrics, Dioptra testbed
  • MANAGE — mitigation, monitoring, incident response
OWASP

OWASP GenAI Red Teaming Guide

Practical approach for evaluating LLM and GenAI vulnerabilities — model, prompt, system, and agentic layers.

  • Quick-start for newcomers
  • Threat modeling per use case
  • Blueprint of test categories
  • Continuous monitoring guidance
OWASP

OWASP Top 10 for Agentic Applications (2026)

First risk ranking built specifically for autonomous, tool-using agents — ASI01 through ASI10.

  • Goal hijack, tool misuse, identity abuse
  • Supply chain, code execution, memory poisoning
  • Inter-agent comms, cascading failures
  • Human-trust exploitation, rogue agents
Cogensec

Agentegrity Framework

Open-source framework measuring the structural integrity of autonomous AI agents — endogenous, measurable security properties that complement exogenous guardrails.

  • Adversarial Resistance — robustness to prompt injection & jailbreaks
  • Behavioral Consistency — stable goals across contexts
  • Recovery Integrity — safe failure & rollback
  • Cross-Domain Portability — digital + physical agents
MITRE

MITRE ATLAS

Knowledge base of adversarial AI tactics and techniques — the ATT&CK equivalent for AI systems.

  • Reconnaissance → Initial Access → Persistence
  • ML Model Access, Defense Evasion
  • ML Attack Staging, Exfiltration, Impact
  • Real-world case studies of model evasion & inversion
Cloud Security Alliance

CSA Agentic AI Red Teaming

Tests permission escalation, hallucination exploitation, orchestration flaws, memory tampering, and supply chain risks.

  • Isolated model behaviors
  • Full agent workflows
  • Inter-agent dependencies
  • Blast-radius assessment
Microsoft (June 2026)

Microsoft Failure-Mode Taxonomy v2.0

Year-two update with 7 new agentic failure categories observed in real red-team engagements.

  • Agentic supply chain compromise
  • Goal hijacking & inter-agent trust escalation
  • Computer-use visual attacks
  • Session context contamination, MCP/plugin abuse, capability disclosure

OWASP ASI Top 10 (2026)

The first risk ranking built specifically for autonomous, tool-using agents. Every red-team finding in 2026 should map to one of these IDs.

Methodology — 5 Phases

A repeatable lifecycle, from rules-of-engagement to retest.

  1. 1

    Planning & Threat Modeling

    Define scope, adversaries, and impact. Map the AI attack surface before writing a single payload.

    • Define scope, system boundaries & assets
    • Identify adversary tiers (script kiddie → nation-state)
    • Enumerate intended use & abuse cases
    • Document rules of engagement and safety thresholds
  2. 2

    Reconnaissance

    Profile the target model, tools, memory, prompts, and connected systems.

    • Fingerprint base model & version
    • Map exposed tools, plugins, MCP servers
    • Enumerate memory layers and retrieval corpora
    • Identify orchestration patterns and sub-agents
  3. 3

    Attack Execution

    Run prompt injection, jailbreaks, tool abuse, memory poisoning, and agentic chains.

    • Direct & indirect prompt injection
    • Jailbreak prompts & policy evasion
    • Tool/plugin abuse and argument injection
    • Memory poisoning & cross-session bleed
    • Computer-use & visual attacks where applicable
  4. 4

    Analysis & Reporting

    Triage findings against ASI / ATLAS / NIST mappings; quantify blast radius and likelihood.

    • Tag each finding with ASI / ATLAS IDs
    • Score likelihood × impact
    • Capture reproducible payloads
    • Prepare executive + technical reports
  5. 5

    Remediation & Retest

    Drive fixes — guardrails, scoping, monitoring — then verify with the same payloads.

    • Prioritize fixes (input filters, scope, HITL)
    • Add detection telemetry for the attack class
    • Retest with original + mutated payloads
    • Add regression cases to continuous evals

Attack Techniques Catalog

Common adversarial techniques against LLMs, models, and autonomous agents — each paired with a recommended defense direction.

Direct Prompt Injection

LLM

User input overrides system prompt or safety policy.

Defense

Layered system prompts, output filtering, separate trust planes.

Indirect Prompt Injection

LLM

Hostile instructions hidden in retrieved docs, emails, web pages.

Defense

Content provenance, instruction stripping, untrusted-content delimiters.

Jailbreaking

LLM

Role-play, encoding tricks, multi-turn pressure to bypass safety.

Defense

Robust refusal training, jailbreak classifiers, runtime moderation.

Model Extraction

Model

Mass API queries reconstruct proprietary model behavior or weights.

Defense

Rate limits, watermarking, query-pattern anomaly detection.

Data Poisoning

Model

Corrupted training or fine-tuning data implants backdoors.

Defense

Data lineage, anomaly scans, signed datasets, eval-set hold-outs.

Adversarial Examples

Model

Imperceptible perturbations flip classifier output.

Defense

Adversarial training, input randomization, ensemble defenses.

Tool / Plugin Abuse

Agentic

Agent coerced into invoking tools with attacker-controlled args.

Defense

Per-tool scoping, argument schemas, capability tokens.

Memory Poisoning

Agentic

Adversarial memories persist and bias future sessions.

Defense

Provenance tags, TTLs, isolation per principal, review queues.

Computer-Use / Visual Attacks

Agentic

On-screen injection of agents that see and click.

Defense

OCR sanitization, visual-prompt classifiers, click-confirmation policy.

MCP / Plugin Abuse

Agentic

Malicious or compromised MCP servers and tool protocols.

Defense

Signed manifests, pinned versions, registry allow-list, sandboxing.

Inter-Agent Trust Escalation

Agentic

Low-privilege agent leverages a higher-privilege peer.

Defense

mTLS, signed envelopes, scoped capability passing, audit trails.

Consent-Fatigue HITL Bypass

Agentic

Stream of trivial approvals trains humans to click through.

Defense

Risk-tiered approvals, batching, anomaly highlighting in UI.

7 New Agentic Failure Modes

Added in Microsoft Failure-Mode Taxonomy v2.0 (June 2026), after a year of real red-team engagements.

#1

Agentic supply chain compromise

Malicious tools/plugins/sub-agents enter the pipeline (ASI04).

#2

Goal hijacking

Untrusted content redirects the agent's objective (ASI01).

#3

Inter-agent trust escalation

Low-privilege agent leverages a higher-privilege one (ASI07).

#4

Computer-use visual attacks

On-screen / visual injection of agents that see and click.

#5

Session context contamination

Cross-turn / cross-session state bleed.

#6

MCP & plugin abuse

The tool protocol layer treated as a first-class attack surface.

#7

Capability / architecture disclosure

Agents leaking their tools, prompts, or topology to an attacker.

Real-World Case Studies

Documented AI security incidents that have shaped today's red-team playbooks.

Jan 2026Supply chain & RCE at framework layer

OpenClaw Agent Framework — 512 vulnerabilities

Shipped with a one-click RCE (CVE-2026-25253). Within a week, 1,800+ instances exposed API keys and 336 malicious plugins reached its marketplace.

ASI04ASI05Supply Chain
Sep 2025Nation-state agentic operation

Anthropic disrupts state-sponsored Claude Code op

First documented large-scale cyberattack predominantly executed by an AI agent — Claude Code autonomously handled ~80–90% of tactical execution across ~30 global targets.

ASI10Autonomous Attack
Aug 2025Developer tool compromise

GitHub Copilot RCE — CVE-2025-53773 (CVSS 9.6)

Prompt injection wrote to the agent's configuration files, achieving remote code execution.

ASI02Prompt Injection
2025Browser-level data exfiltration

AI-Browser Prompt Injection (Comet, Gemini for Chrome)

Research demonstrated indirect prompt injection against AI-enabled browsers and coding assistants (GitLab Duo, Copilot Chat).

Indirect InjectionBrowser Agents
2023IP exfiltration via shadow AI

Samsung ChatGPT Data Leak

Confidential source code and meeting notes leaked via consumer ChatGPT — instructive early example of shadow-AI usage.

Shadow AIData Leak

Red Team Checklist

Working checklist across the five engagement phases. State is local to your browser.

Pre-Engagement

Reconnaissance

Execution

Reporting

Retest

Resources & References

Source frameworks, standards, and the open-source guide this page is built on.

NIST AI Risk Management Framework

https://www.nist.gov/itl/ai-risk-management-framework

Visit

OWASP GenAI Security Project

https://genai.owasp.org/

Visit

OWASP Top 10 for Agentic Applications (2026)

https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/

Visit

MITRE ATLAS

https://atlas.mitre.org/

Visit

Cloud Security Alliance — Agentic AI Red Teaming

https://cloudsecurityalliance.org/

Visit

Microsoft — Updating the Taxonomy of Failure Modes in Agentic AI (2026)

https://www.microsoft.com/en-us/security/blog/2026/06/04/updating-taxonomy-failure-modes-agentic-ai-systems-year-red-teaming-taught-us/

Visit

Source: requie/AI-Red-Teaming-Guide (MIT)

https://github.com/requie/AI-Red-Teaming-Guide

Visit

Content adapted from the open-source requie/AI-Red-Teaming-Guide (MIT License).