Agentegrity: A Framework for Measuring Structural Integrity of Autonomous AI Agents Across Digital and Physical Domains
Abstract
The rapid deployment of autonomous AI agents — systems that perceive, reason, and act with minimal human oversight — has created a security challenge that existing frameworks are ill-equipped to address. Current approaches to AI agent security rely predominantly on exogenous controls: guardrails, input-output filters, and inference-time policy layers applied from outside the agent's decision architecture. These approaches do not scale to agents operating across heterogeneous environments, do not persist when external controls are bypassed, and are fundamentally inapplicable to the emerging class of physical AI agents (robotic systems, autonomous vehicles, industrial automation) where the threat model extends beyond data manipulation into real-world harm.
We introduce agentegrity — the structural integrity of an autonomous AI agent, defined as its measurable capacity to maintain intended behavior, decision coherence, and operational safety under adversarial conditions, across any environment it operates in. We present the Agentegrity Framework, a four-dimensional scoring methodology that quantifies agent security as an intrinsic architectural property rather than an exogenous constraint. The four dimensions — Adversarial Resistance, Behavioral Consistency, Recovery Integrity, and Cross-Domain Portability — provide a standardized basis for assessment, comparison, and certification of AI agent security across both digital and physical domains.
We formalize the distinction between exogenous security (guardrails) and intrinsic security (agentegrity), demonstrate why this distinction is architecturally consequential for autonomous agents, and present a measurement methodology with defined metrics, scoring functions, and minimum coverage requirements. We further introduce the Physical AI Addendum, addressing safety-critical considerations unique to embodied agents including safety envelope enforcement, sim-to-real correlation, and fail-safe reliability. To our knowledge, this is the first formal framework for measuring AI agent security that spans both digital and physical operating domains within a unified methodology.
Keywords: AI agent security, agentegrity, adversarial testing, autonomous agents, physical AI security, red teaming, behavioral integrity, robotic security
1. Introduction
The security landscape for artificial intelligence has entered a qualitative transition. For the past several years, AI security research has focused primarily on model-level concerns: adversarial examples against classifiers [1], jailbreaking of large language models (LLMs) [2], prompt injection attacks [3], and training-time data poisoning [4]. These concerns remain valid, but they address a fundamentally different system architecture than what is now being deployed at scale.
The current wave of AI deployment is defined by autonomous agents — systems that do not merely respond to queries but perceive their environment, reason over observations, formulate plans, invoke tools, retain memory across sessions, and take actions with real-world consequences [5, 6]. These agents operate with increasing autonomy, often making multi-step decisions without human review. They manage enterprise workflows, coordinate with other agents in multi-agent systems, and — critically — are beginning to inhabit physical systems: robotic platforms, autonomous vehicles, industrial machinery, and smart infrastructure [7, 8].
This transition from model to agent, and from digital to physical, exposes a structural gap in how the security community thinks about AI defense. The dominant paradigm — which we term exogenous security — treats agent security as an external constraint applied at the input-output boundary. Guardrails filter inputs. Output monitors flag policy violations. Inference-time classifiers detect harmful content. These approaches share a common architectural assumption: security is something applied to the agent, not something embedded within it.
We argue this assumption is fundamentally insufficient for three reasons:
First, exogenous security is environment-dependent. A guardrail designed for a specific tool set, API environment, or deployment context must be rebuilt when the agent moves to a new environment. As agents increasingly operate across heterogeneous platforms — cloud, edge, on-premises, and physical — the cost of maintaining environment-specific security layers becomes prohibitive.
Second, exogenous security has no residual defense. When a guardrail is bypassed — through novel prompt injection, tool output manipulation, or environmental adversarial inputs — the agent has no internal capacity to detect or resist compromise. The agent's own decision architecture contains no security intelligence. It is structurally equivalent whether or not guardrails are present.
Third, exogenous security does not extend to physical domains. The emerging category of physical AI agents — robotic systems, autonomous vehicles, drones, industrial automation — introduces threat models (sensor spoofing, actuation hijacking, sim-to-real transfer attacks) for which input-output filtering is architecturally irrelevant. Physical AI security requires defense embedded in the agent's perception-reasoning-action loop, not applied at its boundary.
In this paper, we introduce agentegrity: the structural integrity of an autonomous AI agent, defined as its measurable capacity to maintain intended behavior, decision coherence, and operational safety under adversarial conditions, across any environment it operates in.
1.1 Contributions
- We formalize the distinction between exogenous security (guardrails, filters, policy wrappers) and intrinsic security (security embedded within the agent's decision architecture), and argue that intrinsic security is necessary for autonomous agents operating across heterogeneous environments.
- We introduce the Agentegrity Framework, a four-dimensional scoring methodology (Adversarial Resistance, Behavioral Consistency, Recovery Integrity, Cross-Domain Portability) with defined metrics, scoring functions, and minimum coverage requirements.
- We present the first unified measurement methodology that spans both digital and physical AI agent security within a single framework, including a Physical AI Addendum addressing safety-critical considerations unique to embodied agents.
- We introduce formal definitions for previously unnamed threat concepts including behavioral drift rate, cascade compromise, prompt-to-physical exploit, recovery half-life, and cortical embedding depth.
- We provide a Multi-Agent Extension for assessing system-level agentegrity including cascade resistance and trust boundary integrity.
2. Background and Related Work
2.1 AI Safety and Alignment
The AI safety community has produced substantial work on ensuring AI systems behave as intended, including reinforcement learning from human feedback (RLHF) [9], constitutional AI [10], and debate-based alignment [11]. These approaches primarily address the training-time alignment problem: ensuring the model's learned objectives match human intent. They do not address runtime adversarial exploitation of deployed agents, nor do they provide measurement frameworks for quantifying an agent's resilience to adversarial attack in deployment.
2.2 LLM Security
Research on LLM security has identified critical vulnerability classes including prompt injection [3, 12], jailbreaking [2, 13], data exfiltration via tool use [14], and indirect prompt injection through retrieval-augmented generation [15]. Frameworks including OWASP Top 10 for LLM Applications [16] and MITRE ATLAS [17] provide taxonomies for these threats. However, these frameworks are model-centric rather than agent-centric — they assess the security of the language model but not the security of the autonomous system built on top of it.
2.3 Guardrail Systems
The current state of practice for AI agent security is guardrail-based. Systems including NVIDIA NeMo Guardrails [18], Guardrails AI [19], and various inference-time safety classifiers [20] provide input-output filtering, topical control, and policy enforcement at the agent's boundary. These systems are valuable as a layer of defense, but they share the architectural limitations described in Section 1: environment dependency, no residual defense, and inapplicability to physical domains. We position guardrails as a complement to, not a substitute for, intrinsic security.
2.4 AI Agent Security
The emerging field of AI agent security [21, 22] has begun to address threats specific to autonomous agents: tool misuse, permission escalation, multi-agent exploitation, and memory corruption. Recent work has identified cascading failure modes in multi-agent systems [23] and the unique risks of agents with persistent memory [24]. However, this work has not yet produced a standardized measurement framework — there is no equivalent of a CVSS score or compliance certification for agent security. The agentegrity framework is designed to fill this gap.
2.5 Physical AI and Robotics Security
Research on robotic security has addressed sensor spoofing attacks on autonomous vehicles [25, 26], adversarial manipulation of visual perception systems [27], and safety assurance for industrial robots [28]. Operational technology (OT) security frameworks (IEC 62443, NIST SP 800-82) address industrial control systems but were designed for deterministic, rule-based controllers — not AI-driven autonomous agents. We are aware of no prior work that provides a unified measurement methodology spanning both digital and physical AI agent security.
3. The Exogenous-Intrinsic Security Distinction
3.1 Definitions
Definition 1 (Exogenous Security). A security mechanism is exogenous with respect to an AI agent if it operates outside the agent's decision architecture — intercepting, filtering, or constraining the agent's inputs or outputs without participating in the agent's internal reasoning process.
Definition 2 (Intrinsic Security). A security mechanism is intrinsic with respect to an AI agent if it operates within the agent's decision architecture — participating in the perception, reasoning, or action stages of the agent's internal processing loop.
3.2 Architectural Consequences
Residual Defense. An agent protected exclusively by exogenous security has the following property: if all exogenous controls are removed, the agent's behavior is unchanged. It is equally capable of being exploited whether guardrails are present or not — it simply becomes easier to exploit. An agent with intrinsic security retains defensive capabilities even when external controls are absent.
Environmental Transfer. Exogenous security is bound to the environment for which it was designed. Intrinsic security, because it operates within the agent's architecture rather than at its environmental boundary, transfers with the agent across environments.
Physical Domain Applicability. For agents operating in physical environments, the meaningful attack surface includes sensor inputs, environmental conditions, and actuator commands — none of which are accessible to traditional input-output filtering.
| Property | Exogenous (Guardrails) | Intrinsic (Agentegrity) |
|---|---|---|
| Position | Applied from outside | Embedded within |
| Scope | Input-output boundary | Inside the decision loop |
| Portability | Must be rebuilt per environment | Travels with the agent |
| Bypass resilience | No residual defense | Defense persists |
| Assurance | Compliance checkbox | Measurable, benchmarkable |
3.3 The Agentegrity Thesis
Claim 1. An agent with high intrinsic security and no exogenous security has higher agentegrity than an agent with no intrinsic security and extensive exogenous security.
Claim 2. The gap in agentegrity between these configurations increases as the agent operates across more heterogeneous environments, because exogenous security degrades with each environment transition while intrinsic security does not.
4. The Agentegrity Framework
4.1 The Perception-Decision-Action (PDA) Loop
We model an autonomous AI agent as a system executing a continuous Perception-Decision-Action loop:
- Perception (P): The agent receives inputs from its environment — text prompts, API responses, sensor data, tool outputs, or inter-agent messages.
- Decision (D): The agent processes perceived inputs through its model architecture, reasoning over context, memory, and objectives to determine a course of action.
- Action (A): The agent executes the determined action — invoking a tool, generating a response, sending a message, or commanding a physical actuator.
4.2 Four-Dimensional Scoring
The agentegrity score A is a weighted composite of four dimension scores:
Default Dimension Weights
4.3 Adversarial Resistance (AR)
AR is assessed by executing standardized adversarial test suites across perception-layer, decision-layer, and action-layer attack classes. For each attack class, N adversarial attempts are executed and classified as Resisted (R), Detected (D), Degraded (G), or Compromised (C).
ARR Classification Credits
4.4 Behavioral Consistency (BC)
BC addresses a threat model that point-in-time adversarial testing does not capture: the gradual erosion of an agent's behavioral alignment through environmental variation, input noise, or extended operation. We additionally define the Behavioral Drift Rate (BDR-T) as the time derivative of BDR during temporal extension testing:
A positive and increasing BDR-T indicates accelerating behavioral drift — a leading indicator of agentegrity degradation that precedes observable failure.
4.5 Recovery Integrity (RI)
RI measures a property that no existing AI security framework quantifies: the agent's capacity for autonomous recovery after successful compromise. Four metrics are recorded:
- Recovery Half-Life (RHL): Decision cycles to restore 50% of baseline accuracy
- Full Recovery Time (FRT): Decision cycles to restore 95% of baseline accuracy
- Recovery Completeness (RC): Maximum post-compromise accuracy as a fraction of baseline
- Residual Compromise Rate (RCR): Fraction of compromise effects persisting after extended observation
4.6 Cross-Domain Portability (CP)
CP directly measures the core claim of the agentegrity thesis: that intrinsic security transfers across environments while exogenous security does not. The AR and BC assessments are executed in at least three distinct operating environments. CP is computed from the variance in scores:
The framework additionally defines portability cliffs: environment transitions where AR or BC drops by more than 0.20. Portability cliffs identify specific environmental boundaries where the agent's security architecture fails and require targeted remediation.
5. The Agentegrity Architecture
5.1 The Adversarial Layer
Continuous automated red teaming that generates adversarial inputs, simulates attack scenarios, and probes vulnerabilities across the PDA loop. The adversarial layer does not wait for attacks — it manufactures them proactively, providing the stress-testing that keeps the agentegrity score calibrated. For physical AI agents, the adversarial layer includes simulation-based testing in physics-accurate synthetic environments.
5.2 The Cortical Layer
Specialized security models embedded within the agent's decision architecture. We term these cortical models — purpose-built models that participate in the agent's reasoning process rather than operating at its boundary. Cortical models perform adversarial input detection at the perception stage, policy enforcement at the decision stage, behavioral anomaly detection across decision sequences, and decision validation before action execution.
We introduce the concept of cortical embedding depth — a measure of how deeply security intelligence is integrated into the agent's reasoning chain — as a design parameter that correlates with agentegrity score.
5.3 The Governance Layer
Runtime monitoring, observability, and compliance enforcement across deployed agent populations. The governance layer operates at the fleet level: tracking agentegrity scores over time, detecting degradation trends, managing security policy updates, and producing audit-ready assurance evidence.
The three layers form a closed loop: the adversarial layer discovers weaknesses, the cortical layer remediates them at the architectural level, and the governance layer monitors the result in production. Agentegrity is maintained through continuous operation of this loop.
6. Physical AI Extension
6.1 The Digital-Physical Convergence
The boundary between digital and physical AI security is dissolving. Modern AI agents increasingly coordinate both software operations and physical actuators — a logistics agent manages supply chain APIs and warehouse robots; an industrial agent monitors production software and controls manufacturing equipment. Securing these convergent agents requires a unified framework, not separate digital and physical security practices.
6.2 Novel Threat Classes
Prompt-to-Physical Exploits. Adversarial inputs delivered through digital channels (prompt injection, tool output poisoning) that result in unintended physical actions. This attack class bridges the digital-physical boundary and cannot be addressed by either digital guardrails or traditional OT security alone.
Actuation Hijacking. Adversarial manipulation of physical actuators through compromise of the AI reasoning layer rather than direct hardware exploitation. Distinguished from traditional OT attacks by targeting the decision architecture rather than the control system.
Sim-to-Real Transfer Attacks. Exploitation of the domain gap between simulated training environments and physical deployment. Inputs crafted to be benign in simulation but adversarial in the real world.
6.3 Safety-Critical Extensions
Physical agents require three additional assessment components: Safety Envelope Enforcement (testing whether adversarial inputs can induce operational boundary violations), Sim-to-Real Correlation (measuring agreement between simulated and physical assessment results), and Fail-Safe Reliability (verifying transitions to defined safe states when agentegrity degrades beyond acceptable thresholds).
7. Multi-Agent Systems
Individual agent agentegrity is necessary but insufficient for multi-agent systems. A system of individually secure agents can still exhibit system-level vulnerabilities through trust relationship exploitation and cascade compromise.
7.1 Cascade Compromise
We define cascade compromise as a failure mode where the compromise of one agent propagates through inter-agent communication, shared memory, or tool-mediated interactions to corrupt downstream agents. Recent work has demonstrated that in simulated multi-agent systems, a single compromised agent can influence the majority of downstream decision-making within hours [23].
7.2 System Agentegrity
System Agentegrity Composition
8. Discussion
8.1 Relationship to Existing Frameworks
The agentegrity framework is designed to complement, not replace, existing security frameworks. MITRE ATLAS provides a threat taxonomy; agentegrity provides a measurement methodology. OWASP Top 10 for LLM Applications identifies vulnerability classes; agentegrity scores quantify an agent's resilience to those classes. The framework is also designed to support emerging regulatory requirements including the EU AI Act, NIST AI RMF, and autonomous vehicle safety standards.
8.2 Limitations
Several limitations should be acknowledged: Scoring calibration — reference thresholds are based on expert judgment informed by current agent architectures. Attack coverage — the adversarial resistance dimension depends on the comprehensiveness of the test suite. Physical domain validation — the physical AI extensions have been designed based on threat modeling and simulation-based analysis; large-scale empirical validation is ongoing. Cross-domain weighting — the relative weighting reflects a general-purpose assessment profile; domain-specific profiles may produce substantially different composite scores.
8.3 The Timing Argument
The agentegrity framework is being introduced at a specific moment in the technology cycle: after autonomous AI agents have achieved broad enterprise deployment but before physical AI agents have reached scale. This timing is deliberate. Establishing the measurement science now — before the physical AI market matures — ensures the framework is designed for convergence from inception rather than retrofitted, and positions the discipline to inform emerging physical AI safety standards while those standards are still being drafted.
9. Future Work
- Empirical validation. Large-scale application of the scoring methodology across diverse agent architectures, with publication of aggregated benchmark data.
- Cortical model architectures. Research into model architectures optimized for cortical embedding — security models designed to operate within agent reasoning loops with minimal latency and computational overhead.
- Automated agentegrity assessment. Development of end-to-end assessment tooling that executes the full four-dimensional evaluation pipeline autonomously.
- Industry benchmarks. Establishment of public agentegrity benchmarks comparable to MLPerf for inference performance or MITRE ATT&CK Evaluations for endpoint security.
- Regulatory alignment. Mapping of agentegrity dimensions to specific regulatory requirements (EU AI Act conformity, NIST AI RMF profiles, ISO/SAE 21434 for automotive).
- Physical AI empirical studies. Execution of adversarial assessments against physical AI systems using simulation-based and real-world testing to validate the Physical AI Addendum.
10. Conclusion
The transition from AI models to AI agents, and from digital to physical deployment, demands a corresponding transition in how we think about AI security. Guardrails were the right answer for the first generation of AI systems — stateless, tool-less, human-supervised. They are not sufficient for autonomous agents that plan, execute, remember, and increasingly inhabit physical systems where failure has real-world consequences.
Agentegrity provides the measurement science for this transition. By defining structural integrity as an intrinsic property of the agent's architecture — measurable across four dimensions, applicable across digital and physical domains, and quantifiable through a standardized scoring methodology — the framework gives the industry a basis for assessing, comparing, and certifying agent security that goes beyond the limitations of exogenous controls.
The framework is released as an open specification under the Apache License 2.0. We invite the research community, security practitioners, standards bodies, and agent developers to test it, extend it, and hold us accountable to it. The discipline of agentegrity will be defined not by the authors of this paper but by the community that builds on it.
References
- I. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and Harnessing Adversarial Examples," in Proc. ICLR, 2015.
- A. Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models," arXiv:2307.15043, 2023.
- S. Perez and I. Ribeiro, "Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs Through a Global Scale Prompt Hacking Competition," in Proc. EMNLP, 2023.
- W. E. Zhang et al., "Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey," ACM TIST, vol. 11, no. 3, 2020.
- S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," in Proc. ICLR, 2023.
- T. Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," in Proc. NeurIPS, 2023.
- NVIDIA Corporation, "NVIDIA Cosmos: World Foundation Models for Physical AI," Technical Report, 2025.
- B. Kehoe et al., "A Survey of Research on Cloud Robotics and Automation," IEEE T-ASE, vol. 12, no. 2, 2015.
- P. F. Christiano et al., "Deep Reinforcement Learning from Human Preferences," in Proc. NeurIPS, 2017.
- Y. Bai et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073, 2022.
- G. Irving, P. F. Christiano, and D. Amodei, "AI Safety via Debate," arXiv:1805.00899, 2018.
- K. Greshake et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," in Proc. AISec Workshop, 2023.
- Y. Liu et al., "Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study," arXiv:2305.13860, 2023.
- R. Cohen, D. Gillick, and K. Klenner, "Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications," arXiv:2403.02817, 2024.
- F. Zeng et al., "A Survey on Retrieval-Augmented Generation for Large Language Models: Risks and Mitigations," arXiv:2407.13066, 2024.
- OWASP Foundation, "OWASP Top 10 for LLM Applications," Version 2025.
- MITRE Corporation, "ATLAS: Adversarial Threat Landscape for AI Systems," 2023.
- NVIDIA Corporation, "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications," arXiv:2310.10501, 2023.
- Guardrails AI, "Guardrails: Adding Reliable AI to Applications," 2024.
- L. Inan et al., "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," arXiv:2312.06674, 2023.
- X. Ruan et al., "Identifying the Risks of LM Agents with an LM-Emulated Sandbox," arXiv:2309.15817, 2023.
- Foundation for Defense of Democracies, "Regarding Security Considerations for Artificial Intelligence Agents," Policy Brief, March 2026.
- Galileo AI, "Multi-Agent System Failure Analysis," Technical Report, December 2026.
- Lakera AI, "Memory Injection Attacks on Production Agent Systems," Technical Report, November 2026.
- J. Petit et al., "Remote Attacks on Automated Vehicles Sensors: Experiments on Camera and LiDAR," in Black Hat Europe, 2015.
- Y. Cao et al., "Adversarial Sensor Attack on LiDAR-based Perception in Autonomous Driving," in Proc. ACM CCS, 2019.
- K. Eykholt et al., "Robust Physical-World Attacks on Deep Learning Visual Classification," in Proc. CVPR, 2018.
- ISO 10218-1:2011, "Robots and Robotic Devices — Safety Requirements for Industrial Robots."
Corresponding author: Tarique Smith
Code and specification: github.com/requie/agentegrity-framework