Cogensec — Frontier AI Security Lab

Cogensec

183 Stars on GitHub

AI Red Teaming Guide

The Complete Playbook for Adversarial Testing of LLMs, Copilots & Autonomous Agents

NIST AI RMFOWASP GenAIOWASP Agentic 2026MITRE ATLASCSAMIT License

June 2026 — aligned with Microsoft Failure-Mode Taxonomy v2.0 & OWASP ASI Top 10

121+

GitHub Stars

Core Frameworks

OWASP ASI Risks

New Failure Modes (MS v2.0)

Why AI Red Teaming Matters

AI red teaming is a structured, proactive practice where experts simulate adversarial attacks on AI systems to uncover vulnerabilities before they reach production. In 2026, autonomous tool-using agents act on behalf of users — converting "bad text output" into real-world actions like data exfiltration, lateral movement, and unauthorized transactions.

Regulatory drivers — the EU AI Act Article 15, the US Executive Order on AI — now formally require adversarial testing of high-risk AI systems.

Traditional Cybersecurity	AI Red Teaming
Tests against known vulnerabilities	Discovers novel, emergent risks
Binary pass/fail outcomes	Probabilistic behaviors and edge cases
Static attack surface	Dynamic, context-dependent vulnerabilities
Code-level exploits	Natural language attacks via prompts
Deterministic systems	Non-deterministic AI behaviors

Key Frameworks & Standards

Every credible AI red team program maps to at least one of these. Mature programs map to all of them.

NIST

NIST AI Risk Management Framework

Four core functions — GOVERN, MAP, MEASURE, MANAGE — for continuous testing across the AI lifecycle.

•GOVERN — risk policies, roles, enterprise integration
•MAP — identify capabilities, contexts, stakeholders
•MEASURE — adversarial testing, metrics, Dioptra testbed
•MANAGE — mitigation, monitoring, incident response

OWASP

OWASP GenAI Red Teaming Guide

Practical approach for evaluating LLM and GenAI vulnerabilities — model, prompt, system, and agentic layers.

•Quick-start for newcomers
•Threat modeling per use case
•Blueprint of test categories
•Continuous monitoring guidance

OWASP

OWASP Top 10 for Agentic Applications (2026)

First risk ranking built specifically for autonomous, tool-using agents — ASI01 through ASI10.

•Goal hijack, tool misuse, identity abuse
•Supply chain, code execution, memory poisoning
•Inter-agent comms, cascading failures
•Human-trust exploitation, rogue agents

Cogensec

Agentegrity Framework

Open-source framework measuring the structural integrity of autonomous AI agents — endogenous, measurable security properties that complement exogenous guardrails.

•Adversarial Resistance — robustness to prompt injection & jailbreaks
•Behavioral Consistency — stable goals across contexts
•Recovery Integrity — safe failure & rollback
•Cross-Domain Portability — digital + physical agents

MITRE

MITRE ATLAS

Knowledge base of adversarial AI tactics and techniques — the ATT&CK equivalent for AI systems.

•Reconnaissance → Initial Access → Persistence
•ML Model Access, Defense Evasion
•ML Attack Staging, Exfiltration, Impact
•Real-world case studies of model evasion & inversion

Cloud Security Alliance

CSA Agentic AI Red Teaming

Tests permission escalation, hallucination exploitation, orchestration flaws, memory tampering, and supply chain risks.

•Isolated model behaviors
•Full agent workflows
•Inter-agent dependencies
•Blast-radius assessment

Microsoft (June 2026)

Microsoft Failure-Mode Taxonomy v2.0

Year-two update with 7 new agentic failure categories observed in real red-team engagements.

•Agentic supply chain compromise
•Goal hijacking & inter-agent trust escalation
•Computer-use visual attacks
•Session context contamination, MCP/plugin abuse, capability disclosure

OWASP ASI Top 10 (2026)

The first risk ranking built specifically for autonomous, tool-using agents. Every red-team finding in 2026 should map to one of these IDs.

Methodology — 5 Phases

A repeatable lifecycle, from rules-of-engagement to retest.

1
Planning & Threat Modeling
Define scope, adversaries, and impact. Map the AI attack surface before writing a single payload.
- ✓Define scope, system boundaries & assets
- ✓Identify adversary tiers (script kiddie → nation-state)
- ✓Enumerate intended use & abuse cases
- ✓Document rules of engagement and safety thresholds
2
Reconnaissance
Profile the target model, tools, memory, prompts, and connected systems.
- ✓Fingerprint base model & version
- ✓Map exposed tools, plugins, MCP servers
- ✓Enumerate memory layers and retrieval corpora
- ✓Identify orchestration patterns and sub-agents
3
Attack Execution
Run prompt injection, jailbreaks, tool abuse, memory poisoning, and agentic chains.
- ✓Direct & indirect prompt injection
- ✓Jailbreak prompts & policy evasion
- ✓Tool/plugin abuse and argument injection
- ✓Memory poisoning & cross-session bleed
- ✓Computer-use & visual attacks where applicable
4
Analysis & Reporting
Triage findings against ASI / ATLAS / NIST mappings; quantify blast radius and likelihood.
- ✓Tag each finding with ASI / ATLAS IDs
- ✓Score likelihood × impact
- ✓Capture reproducible payloads
- ✓Prepare executive + technical reports
5
Remediation & Retest
Drive fixes — guardrails, scoping, monitoring — then verify with the same payloads.
- ✓Prioritize fixes (input filters, scope, HITL)
- ✓Add detection telemetry for the attack class
- ✓Retest with original + mutated payloads
- ✓Add regression cases to continuous evals

Attack Techniques Catalog

Common adversarial techniques against LLMs, models, and autonomous agents — each paired with a recommended defense direction.

Direct Prompt Injection

LLM

User input overrides system prompt or safety policy.

Defense

Layered system prompts, output filtering, separate trust planes.

Indirect Prompt Injection

LLM

Hostile instructions hidden in retrieved docs, emails, web pages.

Defense

Content provenance, instruction stripping, untrusted-content delimiters.

Jailbreaking

LLM

Role-play, encoding tricks, multi-turn pressure to bypass safety.

Defense

Robust refusal training, jailbreak classifiers, runtime moderation.

Model Extraction

Model

Mass API queries reconstruct proprietary model behavior or weights.

Defense

Rate limits, watermarking, query-pattern anomaly detection.

Data Poisoning

Model

Corrupted training or fine-tuning data implants backdoors.

Defense

Data lineage, anomaly scans, signed datasets, eval-set hold-outs.

Adversarial Examples

Model

Imperceptible perturbations flip classifier output.

Defense

Adversarial training, input randomization, ensemble defenses.

Tool / Plugin Abuse

Agentic

Agent coerced into invoking tools with attacker-controlled args.

Defense

Per-tool scoping, argument schemas, capability tokens.

Memory Poisoning

Agentic

Adversarial memories persist and bias future sessions.

Defense

Provenance tags, TTLs, isolation per principal, review queues.

Computer-Use / Visual Attacks

Agentic

On-screen injection of agents that see and click.

Defense

OCR sanitization, visual-prompt classifiers, click-confirmation policy.

MCP / Plugin Abuse

Agentic

Malicious or compromised MCP servers and tool protocols.

Defense

Signed manifests, pinned versions, registry allow-list, sandboxing.

Inter-Agent Trust Escalation

Agentic

Low-privilege agent leverages a higher-privilege peer.

Defense

mTLS, signed envelopes, scoped capability passing, audit trails.

Consent-Fatigue HITL Bypass

Agentic

Stream of trivial approvals trains humans to click through.

Defense

Risk-tiered approvals, batching, anomaly highlighting in UI.

7 New Agentic Failure Modes

Added in Microsoft Failure-Mode Taxonomy v2.0 (June 2026), after a year of real red-team engagements.

#1

Agentic supply chain compromise

Malicious tools/plugins/sub-agents enter the pipeline (ASI04).

#2

Goal hijacking

Untrusted content redirects the agent's objective (ASI01).

#3

Inter-agent trust escalation

Low-privilege agent leverages a higher-privilege one (ASI07).

#4

Computer-use visual attacks

On-screen / visual injection of agents that see and click.

#5

Session context contamination

Cross-turn / cross-session state bleed.

#6

MCP & plugin abuse

The tool protocol layer treated as a first-class attack surface.

#7

Capability / architecture disclosure

Agents leaking their tools, prompts, or topology to an attacker.

Real-World Case Studies

Documented AI security incidents that have shaped today's red-team playbooks.

Jan 2026Supply chain & RCE at framework layer

OpenClaw Agent Framework — 512 vulnerabilities

Shipped with a one-click RCE (CVE-2026-25253). Within a week, 1,800+ instances exposed API keys and 336 malicious plugins reached its marketplace.

ASI04ASI05Supply Chain

Sep 2025Nation-state agentic operation

Anthropic disrupts state-sponsored Claude Code op

First documented large-scale cyberattack predominantly executed by an AI agent — Claude Code autonomously handled ~80–90% of tactical execution across ~30 global targets.

ASI10Autonomous Attack

Aug 2025Developer tool compromise

GitHub Copilot RCE — CVE-2025-53773 (CVSS 9.6)

Prompt injection wrote to the agent's configuration files, achieving remote code execution.

ASI02Prompt Injection

2025Browser-level data exfiltration

AI-Browser Prompt Injection (Comet, Gemini for Chrome)

Research demonstrated indirect prompt injection against AI-enabled browsers and coding assistants (GitLab Duo, Copilot Chat).

Indirect InjectionBrowser Agents

2023IP exfiltration via shadow AI

Samsung ChatGPT Data Leak

Confidential source code and meeting notes leaked via consumer ChatGPT — instructive early example of shadow-AI usage.

Shadow AIData Leak

Red Team Checklist

Working checklist across the five engagement phases. State is local to your browser.

Pre-Engagement

Reconnaissance

Execution

Reporting

Retest

Tools Catalog

Open-source tooling used by AI red teams in production today.

Resources & References

Source frameworks, standards, and the open-source guide this page is built on.

NIST AI Risk Management Framework

https://www.nist.gov/itl/ai-risk-management-framework

Visit

OWASP GenAI Security Project

https://genai.owasp.org/

Visit

OWASP Top 10 for Agentic Applications (2026)

https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/

Visit

MITRE ATLAS

https://atlas.mitre.org/

Visit

Cloud Security Alliance — Agentic AI Red Teaming

https://cloudsecurityalliance.org/

Visit

Microsoft — Updating the Taxonomy of Failure Modes in Agentic AI (2026)

https://www.microsoft.com/en-us/security/blog/2026/06/04/updating-taxonomy-failure-modes-agentic-ai-systems-year-red-teaming-taught-us/

Visit

Source: requie/AI-Red-Teaming-Guide (MIT)

https://github.com/requie/AI-Red-Teaming-Guide

Visit

Content adapted from the open-source requie/AI-Red-Teaming-Guide (MIT License).

AI Red Teaming Guide

Why AI Red Teaming Matters

Key Frameworks & Standards

NIST AI Risk Management Framework

OWASP GenAI Red Teaming Guide

OWASP Top 10 for Agentic Applications (2026)

Agentegrity Framework

MITRE ATLAS

CSA Agentic AI Red Teaming

Microsoft Failure-Mode Taxonomy v2.0

OWASP ASI Top 10 (2026)

ASI01Agent Goal Hijack

ASI02Tool Misuse & Exploitation

ASI03Agent Identity & Privilege Abuse

ASI04Agentic Supply Chain Compromise

ASI05Unexpected Code Execution

ASI06Memory & Context Poisoning

ASI07Insecure Inter-Agent Communication

ASI08Cascading Agent Failures

ASI09Human-Agent Trust Exploitation

ASI10Rogue Agents

Methodology — 5 Phases

Planning & Threat Modeling

Reconnaissance

Attack Execution

Analysis & Reporting

Remediation & Retest

Attack Techniques Catalog

Direct Prompt Injection

Indirect Prompt Injection

Jailbreaking

Model Extraction

Data Poisoning

Adversarial Examples

Tool / Plugin Abuse

Memory Poisoning

Computer-Use / Visual Attacks

MCP / Plugin Abuse

Inter-Agent Trust Escalation

Consent-Fatigue HITL Bypass

7 New Agentic Failure Modes

Agentic supply chain compromise

Goal hijacking

Inter-agent trust escalation

Computer-use visual attacks

Session context contamination

MCP & plugin abuse

Capability / architecture disclosure

Real-World Case Studies

OpenClaw Agent Framework — 512 vulnerabilities

Anthropic disrupts state-sponsored Claude Code op

GitHub Copilot RCE — CVE-2025-53773 (CVSS 9.6)

AI-Browser Prompt Injection (Comet, Gemini for Chrome)

Samsung ChatGPT Data Leak

Red Team Checklist

Pre-Engagement

Reconnaissance

Execution

Reporting

Retest

Tools Catalog

Microsoft PyRIT

NIST Dioptra

Garak

Promptfoo

Giskard

Adversarial Robustness Toolbox

Counterfit

Resources & References

NIST AI Risk Management Framework

OWASP GenAI Security Project

OWASP Top 10 for Agentic Applications (2026)

MITRE ATLAS

Cloud Security Alliance — Agentic AI Red Teaming

Microsoft — Updating the Taxonomy of Failure Modes in Agentic AI (2026)

Source: requie/AI-Red-Teaming-Guide (MIT)