Listen to articleSarah (Female)

Agentegrity: A Framework for Measuring Structural Integrity of Autonomous AI Agents Across Digital and Physical Domains

Tarique · Cogensec Security Research Lab
Cogensec Research Lab
January 2026

Explore the framework on the Agentegrity platform — AI agent security framework.

Abstract

The rapid deployment of autonomous AI agents — systems that perceive, reason, and act with minimal human oversight — has created a security challenge that existing frameworks are ill-equipped to address. Current approaches to AI agent security rely predominantly on exogenous controls: guardrails, input-output filters, and inference-time policy layers applied from outside the agent's decision architecture. While these controls are valuable, they operate on a fundamentally incomplete information surface — they observe the agent's inputs and outputs but are structurally blind to attacks that manifest within the agent's reasoning chain, memory state, or goal representation.

This information asymmetry becomes critical as agents gain autonomy and enter physical domains. Exogenous defenses cannot detect reasoning-layer manipulation that produces compliant-looking outputs, cannot transfer across heterogeneous deployment environments without reconfiguration, and are architecturally inapplicable to the emerging class of physical AI agents where threat models extend beyond data manipulation into real-world harm.

We introduce agentegrity — the structural integrity of an autonomous AI agent, defined as its measurable capacity to maintain intended behavior, decision coherence, and operational safety under adversarial conditions, across any environment it operates in. Building on foundational work in adversarial machine learning, AI safety, runtime monitoring, and operational technology security, we present the Agentegrity Framework: a four-dimensional assessment methodology that quantifies agent security through the complementary lenses of Adversarial Resistance, Behavioral Consistency, Recovery Integrity, and Cross-Domain Portability.

We formalize the distinction between exogenous and endogenous security, present a measurement methodology with defined metrics and scoring functions, introduce a Physical AI Addendum for embodied agents, and identify the open engineering challenges that must be addressed for full realization of the framework. The specification is released as an open standard under the Apache License 2.0.

Keywords: AI agent security, agentegrity, adversarial testing, autonomous agents, physical AI security, red teaming, behavioral integrity, robotic security

1. Introduction

The security landscape for artificial intelligence has entered a qualitative transition. For the past several years, AI security research has focused primarily on model-level concerns: adversarial examples against classifiers [1], jailbreaking of large language models (LLMs) [2], prompt injection attacks [3], and training-time data poisoning [4]. These concerns remain valid, but they address a fundamentally different system architecture than what is now being deployed at scale.

The current wave of AI deployment is defined by autonomous agents — systems that do not merely respond to queries but perceive their environment, reason over observations, formulate plans, invoke tools, retain memory across sessions, and take actions with real-world consequences [5, 6]. These agents operate with increasing autonomy, often making multi-step decisions without human review. They manage enterprise workflows, coordinate with other agents in multi-agent systems, and — critically — are beginning to inhabit physical systems: robotic platforms, autonomous vehicles, industrial machinery, and smart infrastructure [7, 8].

This transition from model to agent, and from digital to physical, exposes a structural gap in how the security community thinks about AI defense. The dominant paradigm — which we term exogenous security — treats agent security as an external constraint applied at the input-output boundary. Guardrails filter inputs. Output monitors flag policy violations. Inference-time classifiers detect harmful content. These approaches provide genuine value: they measurably reduce policy violations, catch known attack patterns, and are deployable today with existing infrastructure [18, 19, 20].

However, exogenous controls share a structural limitation: they operate on the information available at the agent's boundary, which is a strict subset of the information flowing through the agent's internal decision process. This information asymmetry creates three consequential gaps:

First, exogenous security cannot observe internal state. An attack that manipulates the agent's reasoning chain, corrupts its memory, or subtly shifts its goal representation — without producing anomalous inputs or outputs — is invisible to boundary-level monitoring. The agent may produce compliant-looking outputs that mask compromised internal reasoning.

Second, exogenous security is environment-coupled. A guardrail configuration designed for a specific tool set, API schema, or deployment platform must be adapted when the agent transitions to a different environment. As agents increasingly operate across heterogeneous platforms — cloud, edge, on-premises, and physical — maintaining environment-specific security layers introduces cost and configuration risk.

Third, exogenous security does not extend naturally to physical domains. The emerging category of physical AI agents — robotic systems, autonomous vehicles, drones, industrial automation — introduces threat models (sensor spoofing, actuation hijacking, sim-to-real transfer attacks) for which input-output filtering at the software boundary is architecturally insufficient. Physical AI security requires defense that operates within the agent's perception-reasoning-action loop.

In this paper, we introduce agentegrity: the structural integrity of an autonomous AI agent, defined as its measurable capacity to maintain intended behavior, decision coherence, and operational safety under adversarial conditions, across any environment it operates in. We present a formal framework for measuring agentegrity, argue that it complements rather than replaces exogenous security, and demonstrate its applicability across both digital and physical AI domains.

1.1 Contributions

  1. We formalize the distinction between exogenous security (guardrails, filters, policy wrappers) and endogenous security (security tightly coupled with the agent's decision architecture), and demonstrate that the information asymmetry between these approaches has measurable consequences for detecting certain attack classes.
  2. We introduce the Agentegrity Framework, a four-dimensional assessment methodology (Adversarial Resistance, Behavioral Consistency, Recovery Integrity, Cross-Domain Portability) with defined metrics, scoring functions, and minimum coverage requirements. The primary output is a four-dimensional security profile, with a composite score available for certification and comparison.
  3. We present the first unified measurement methodology that spans both digital and physical AI agent security within a single framework, including a Physical AI Addendum addressing safety-critical considerations unique to embodied agents.
  4. We introduce formal definitions for previously unnamed threat concepts including behavioral drift rate, cascade compromise, prompt-to-physical exploit, recovery half-life, cortical embedding depth, and cross-stage feedback attacks.
  5. We provide a Multi-Agent Extension for assessing system-level agentegrity including cascade resistance and trust boundary integrity.
  6. We identify the open engineering challenges — including latency constraints, adversarial training data scarcity, and recursive security vulnerabilities — that must be addressed for full realization of the framework.

1.2 Intellectual Predecessors

The agentegrity framework is a synthesis, not a creation from nothing. It builds directly on foundational work across multiple research communities:

The adversarial machine learning community established the theoretical and empirical basis for understanding how AI systems fail under adversarial conditions [1, 29, 30]. The AI safety and alignment community developed the conceptual foundations for ensuring AI systems behave as intended [9, 10, 31]. The runtime monitoring community demonstrated that continuous behavioral observation can detect anomalies in deployed systems [32]. The operational technology security community built frameworks for securing physical systems and safety-critical infrastructure [33, 34]. NVIDIA's Morpheus framework pioneered GPU-accelerated AI pipelines for cybersecurity applications [35]. MITRE's ATLAS framework provided the first structured taxonomy of adversarial threats to AI systems [17].

The contribution of agentegrity is to unify these threads into a single measurement methodology that is agent-centric (not model-centric), spans digital and physical domains, and produces standardized, comparable assessments. Each of these predecessor communities addressed a necessary piece; the framework formalizes their integration.

2. Background and Related Work

2.1 AI Safety and Alignment

The AI safety community has produced substantial work on ensuring AI systems behave as intended, including reinforcement learning from human feedback (RLHF) [9], constitutional AI [10], debate-based alignment [11], and concrete problem specifications for AI safety [31]. These approaches primarily address the training-time alignment problem: ensuring the model's learned objectives match human intent. They provide essential foundations for behavioral correctness but do not address runtime adversarial exploitation of deployed agents, nor do they provide measurement frameworks for quantifying an agent's resilience to adversarial attack in deployment.

2.2 Adversarial Machine Learning

The adversarial ML community has established that machine learning systems are fundamentally vulnerable to adversarial perturbation [1, 29, 30]. This work demonstrates that small, often imperceptible changes to inputs can cause dramatic changes in model behavior — a finding that generalizes from classifiers to generative models to agentic systems. The agentegrity framework's Adversarial Resistance dimension directly builds on this foundation, extending adversarial evaluation from model-level robustness to agent-level structural integrity across the full perception-decision-action loop.

2.3 LLM Security

Research on LLM security has identified critical vulnerability classes including prompt injection [3, 12], jailbreaking [2, 13], data exfiltration via tool use [14], and indirect prompt injection through retrieval-augmented generation [15]. Frameworks including OWASP Top 10 for LLM Applications [16] and MITRE ATLAS [17] provide taxonomies for these threats. However, these frameworks are model-centric rather than agent-centric — they assess the security of the language model but not the security of the autonomous system built on top of it. They do not address the agent's tool use, memory, planning, or physical actuation capabilities as distinct attack surfaces.

2.4 Guardrail Systems

The current state of practice for AI agent security is guardrail-based. Systems including NVIDIA NeMo Guardrails [18], Guardrails AI [19], Llama Guard [20], and various inference-time safety classifiers provide input-output filtering, topical control, and policy enforcement at the agent's boundary. These systems provide measurable security value and are deployable with existing infrastructure. We position guardrails as a necessary component of defense-in-depth, noting that their limitation is not ineffectiveness but information incompleteness: they operate on the agent's inputs and outputs and cannot observe internal state.

2.5 AI Agent Security

The emerging field of AI agent security [21, 22] has begun to address threats specific to autonomous agents: tool misuse, permission escalation, multi-agent exploitation, and memory corruption. Recent work has identified cascading failure modes in multi-agent systems [23] and the unique risks of agents with persistent memory [24]. However, this work has not yet produced a standardized measurement framework — there is no equivalent of a CVSS score or compliance certification for agent security. The agentegrity framework is designed to fill this gap.

2.6 Physical AI and Robotics Security

Research on robotic security has addressed sensor spoofing attacks on autonomous vehicles [25, 26], adversarial manipulation of visual perception systems [27], and safety assurance for industrial robots [28]. Operational technology (OT) security frameworks (IEC 62443 [33], NIST SP 800-82 [34]) address industrial control systems but were designed for deterministic, rule-based controllers — not AI-driven autonomous agents. The convergence of AI agent capabilities with physical systems creates a security surface that neither AI security nor OT security frameworks adequately address. We are aware of no prior work that provides a unified measurement methodology spanning both digital and physical AI agent security.

3. The Exogenous-Endogenous Security Distinction

We formalize the distinction between two fundamentally different approaches to AI agent security. This distinction is not a value judgment — both approaches are necessary — but an architectural observation with measurable consequences.

3.1 Definitions

Definition 1 (Exogenous Security). A security mechanism is exogenous with respect to an AI agent if it operates outside the agent's decision architecture — intercepting, filtering, or constraining the agent's inputs or outputs without participating in the agent's internal reasoning process.

Definition 2 (Endogenous Security). A security mechanism is endogenous with respect to an AI agent if it is tightly coupled with the agent's decision architecture — operating within the perception, reasoning, or action stages with access to the agent's internal state, including intermediate reasoning, memory contents, and goal representations.

3.2 The Information Asymmetry

The distinction produces a measurable information asymmetry. An exogenous defense observes: raw inputs to the agent, final outputs from the agent, and observable side effects (API calls, tool invocations, network traffic).

An endogenous defense additionally observes: intermediate reasoning states and chain-of-thought, memory contents and memory write operations, goal representations and objective priorities, confidence levels and uncertainty estimates, and decision alternatives that were considered but not selected.

This additional information surface enables detection of attack classes that are structurally invisible to boundary-level monitoring. An agent whose reasoning has been corrupted by indirect prompt injection, but which produces compliant-looking outputs, is detectable by a defense that can observe the reasoning chain — and undetectable by one that cannot.

3.3 Architectural Consequences

Residual Defense. An agent protected exclusively by exogenous security retains no defensive capability when external controls are removed or bypassed. An agent with endogenous security retains capabilities proportional to the quality and coverage of its embedded security models. This distinction matters in adversarial conditions where attackers specifically target the security layer.

Environmental Transfer. Exogenous security is coupled to the environment for which it was configured. Endogenous security, because it operates within the agent's architecture rather than at its environmental boundary, transfers with the agent across deployment contexts — with the important caveat that its effectiveness against environment-specific attack classes may vary.

Physical Domain Applicability. For agents operating in physical environments, meaningful attack surfaces include sensor inputs, environmental conditions, and actuator commands — most of which are not accessible to traditional input-output filtering at the software boundary. Defenses that operate within the agent's perception and decision layers can address these threat vectors.

PropertyExogenous (Guardrails)Endogenous (Agentegrity)
PositionApplied from outsideEmbedded within
Information AccessInputs/outputs onlyInternal state + reasoning
PortabilityMust be rebuilt per environmentTravels with the agent
Bypass resilienceNo residual defenseDefense persists
AssuranceCompliance checkboxMeasurable, benchmarkable

3.4 The Complementarity Thesis

We do not argue that endogenous security is categorically superior to exogenous security. Security is fundamentally relational — a system is secure relative to a specific threat model, not in absolute terms. An endogenous defense that is poorly trained may be less effective than a well-engineered guardrail. Moreover, endogenous security introduces costs that exogenous security avoids: additional inference latency, compute overhead within the reasoning loop, and coupling between security and functional capabilities.

Thesis. The information asymmetry between endogenous and exogenous security means that certain attack classes — specifically those that manifest within the agent's reasoning chain without producing anomalous boundary-observable behavior — are detectable only by defenses with access to the agent's internal state. A defense architecture that includes only exogenous controls has a structural blind spot that no amount of boundary-level sophistication can eliminate. Therefore, defense-in-depth for autonomous AI agents requires both exogenous and endogenous security, and a measurement framework must assess both.

We hypothesize that endogenous security contributes more to overall agentegrity than exogenous security of equivalent investment, particularly as the agent operates across more heterogeneous environments and as adversaries develop attacks specifically designed to evade boundary-level detection. Validating this hypothesis empirically is a primary goal of future work.

4. The Agentegrity Framework

4.1 The Perception-Decision-Action (PDA) Model

We model an autonomous AI agent as a system executing a Perception-Decision-Action loop:

  • Perception (P): The agent receives inputs from its environment — text prompts, API responses, sensor data, tool outputs, or inter-agent messages.
  • Decision (D): The agent processes perceived inputs through its model architecture, reasoning over context, memory, and objectives to determine a course of action.
  • Action (A): The agent executes the determined action — invoking a tool, generating a response, sending a message, or commanding a physical actuator.

Important simplification note. The PDA model is a first-order approximation useful for organizing threat categories and defense types. Real agent architectures feature recursive feedback loops where action outputs become perception inputs (e.g., tool responses processed as new context), planning stages generate sub-decisions, and memory continuously modifies all three stages. The most dangerous attack classes often exploit these feedback boundaries rather than any single PDA stage in isolation. The framework's threat taxonomy includes explicit cross-stage and feedback-loop attack categories to address this recursive reality.

4.2 Four-Dimensional Assessment

The agentegrity framework assesses agent security across four dimensions. The primary output of an agentegrity assessment is the four-dimensional security profile — the individual scores for each dimension — which provides actionable specificity for remediation and architectural decision-making. A composite agentegrity score A is additionally available for certification tiering and high-level comparison:

A = w₁·AR + w₂·BC + w₃·RI + w₄·CP

Default Dimension Weights

0%10%20%30%40%ARBCRICP

On weights and composite interpretation. The default weights reflect expert judgment about relative importance for general-purpose deployments. They are not empirically derived and should not be treated as universal constants. Different deployment contexts warrant different weight profiles. The composite score encodes these weighting judgments and is meaningful only relative to agents scored with the same methodology and weight profile.

Certification Tiers

ScoreTierInterpretation
0.85–1.00A — HardenedStrong security across all dimensions
0.70–0.84B — ResilientWithstands standard adversarial conditions
0.50–0.69C — DevelopingMeasurable agentegrity with significant gaps
0.25–0.49D — VulnerableRelies primarily on exogenous security
0.00–0.24F — Guardrail-DependentMinimal endogenous security

4.3 Adversarial Resistance (AR)

AR is assessed by executing standardized adversarial test suites across perception-layer, decision-layer, action-layer, and cross-stage feedback attack classes. For each attack class, N adversarial attempts are executed and classified as Resisted (R), Detected (D), Degraded (G), or Compromised (C).

ARR = (R·1.0 + D·0.85 + G·0.40 + C·0.0) / N

Partial credit for detection (0.85) reflects that an agent which identifies and responds to an attack — even if partially affected — demonstrates higher structural integrity than one that fails silently.

ARR Classification Credits

00.250.50.751ResistedDetectedDegradedCompromised

Cross-stage feedback attacks are assessed as a distinct category. These include: tool output poisoning (action→perception→decision), memory-influenced planning loops (memory→decision→action→perception), context accumulation cascades (gradual cross-stage drift over multiple PDA cycles), and recursive self-corruption (self-reflection on corrupted state reinforcing adversarial beliefs). This category often reveals vulnerabilities missed by single-stage testing.

4.4 Behavioral Consistency (BC)

BC addresses a threat model that point-in-time adversarial testing does not capture: the gradual erosion of an agent's behavioral alignment through environmental variation, input noise, or extended operation. A behavioral baseline B is established by executing M decision-requiring scenarios under normal conditions. The same scenarios are re-executed under controlled perturbation classes. We additionally define the Behavioral Drift Rate (BDR-T) as the time derivative of BDR during temporal extension testing:

BDR-T = ΔBDR / Δt

A positive and increasing BDR-T indicates accelerating behavioral drift — a leading indicator of agentegrity degradation that precedes observable failure. BDR-T is particularly important for long-running autonomous agents where gradual state accumulation can shift behavior below detection thresholds over hundreds or thousands of decision cycles.

4.5 Recovery Integrity (RI)

RI measures a property that no existing AI security framework quantifies: the agent's capacity for autonomous recovery after successful compromise. Following a confirmed compromise (from the AR assessment), the agent is observed over subsequent decision cycles without human intervention. Four metrics are recorded:

  • Recovery Half-Life (RHL): Decision cycles to restore 50% of baseline accuracy
  • Full Recovery Time (FRT): Decision cycles to restore 95% of baseline accuracy
  • Recovery Completeness (RC): Maximum post-compromise accuracy as a fraction of baseline
  • Residual Compromise Rate (RCR): Fraction of compromise effects persisting after extended observation

Reference thresholds (RHL ≤ 5 cycles = excellent; ≥ 500 = poor) are calibrated to current agent architectures and represent initial values that will be updated as the field matures.

4.6 Cross-Domain Portability (CP)

CP directly measures the core claim of the agentegrity thesis: that defense-in-depth architectures with endogenous security transfer across environments more effectively than purely exogenous approaches. The AR and BC assessments are executed in at least three distinct operating environments. CP is computed from the variance in scores:

CP = 0.55·(1 − σ_AR/μ_AR) + 0.45·(1 − σ_BC/μ_BC)

The framework additionally defines portability cliffs: environment transitions where AR or BC drops by more than 0.20. Portability cliffs identify specific environmental boundaries where the agent's security architecture fails and require targeted investigation.

5. The Agentegrity Architecture

The framework implies a three-layer security architecture for building agents with defense-in-depth that includes both exogenous and endogenous capabilities:

5.1 The Adversarial Layer

Continuous automated red teaming that generates adversarial inputs, simulates attack scenarios, and probes vulnerabilities across the PDA loop — including cross-stage feedback attacks. The adversarial layer does not wait for attacks — it manufactures them proactively, providing the stress-testing that keeps the agentegrity assessment calibrated. For physical AI agents, the adversarial layer includes simulation-based testing in physics-accurate synthetic environments. Critically, the adversarial layer must also test the security models themselves (see Section 8.2).

5.2 The Cortical Layer

Specialized security models tightly coupled with the agent's decision architecture. We term these cortical models — purpose-built models that operate within the agent's reasoning process rather than at its boundary. Cortical models perform adversarial input detection at the perception stage (with access to internal context), policy enforcement at the decision stage (with access to reasoning chains and goal states), behavioral anomaly detection across decision sequences (with access to historical patterns), and decision validation before action execution (with access to planned actions and their justifications).

We introduce the concept of cortical embedding depth — a measure of how deeply security intelligence is integrated into the agent's reasoning chain — as a design parameter. Depth ranges from L0 (fully external, no information advantage) through L5 (trained into model weights, inseparable from core capabilities). Higher embedding depth provides greater information access but also greater coupling complexity and update difficulty.

5.3 The Governance Layer

Runtime monitoring, observability, and compliance enforcement across deployed agent populations. The governance layer operates at the fleet level: tracking agentegrity profiles over time, detecting degradation trends, managing security policy updates, and producing audit-ready assurance evidence.

The three layers form a closed loop: the adversarial layer discovers weaknesses, the cortical layer addresses them within the agent's architecture, and the governance layer monitors the result in production. Agentegrity is maintained through continuous operation of this loop.

5.4 Complementarity with Exogenous Defenses

The three-layer agentegrity architecture does not replace exogenous defenses — it complements them. Guardrails, network-level controls, sandbox isolation, human-in-the-loop oversight, and infrastructure-level access controls remain essential components of a defense-in-depth strategy. The agentegrity framework measures the endogenous security contribution separately but assumes exogenous defenses are present. An agent assessed under the framework is expected to operate within an environment that includes appropriate exogenous protections; the agentegrity score measures what those protections cannot.

6. Physical AI Extension

6.1 The Architectural Convergence

The same model architectures, planning frameworks, and tool-use patterns that underpin digital agents are increasingly being deployed in physical systems. A robotic agent planning a pick-and-place sequence uses the same LLM-based reasoning as a digital workflow agent planning an API orchestration. This architectural convergence means that a unified security measurement methodology is technically justified — the underlying system being assessed shares the same structural properties across domains.

We note that architectural convergence does not imply market or operational convergence. The buyers, sales cycles, regulatory frameworks, and proof-of-value requirements for digital and physical AI security remain distinct and will likely remain so for the foreseeable future. The agentegrity framework unifies the measurement science while acknowledging that deployment practices will differ by domain.

6.2 Novel Threat Classes

Prompt-to-Physical Exploits. Adversarial inputs delivered through digital channels (prompt injection, tool output poisoning) that result in unintended physical actions. This attack class bridges the digital-physical boundary and cannot be addressed by either digital guardrails or traditional OT security alone.

Actuation Hijacking. Adversarial manipulation of physical actuators through compromise of the AI reasoning layer rather than direct hardware exploitation. Distinguished from traditional OT attacks by targeting the decision architecture rather than the control system.

Sim-to-Real Transfer Attacks. Exploitation of the domain gap between simulated training environments and physical deployment. Inputs crafted to be benign in simulation but adversarial in the real world.

6.3 Safety-Critical Extensions

Physical agents require three additional assessment components: Safety Envelope Enforcement (testing whether adversarial inputs can induce the agent to violate defined operational boundaries — any safety envelope violation under adversarial conditions caps the AR tier regardless of other performance), Sim-to-Real Correlation (measuring agreement between simulated and physical assessment results to establish confidence in simulation-based testing), and Fail-Safe Reliability (verifying that agents transition to defined safe states when agentegrity degrades beyond acceptable thresholds).

7. Multi-Agent Systems

Individual agent agentegrity is necessary but insufficient for multi-agent systems. A system of individually secure agents can still exhibit system-level vulnerabilities through trust relationship exploitation and cascade compromise.

7.1 Cascade Compromise

We define cascade compromise as a failure mode where the compromise of one agent propagates through inter-agent communication, shared memory, or tool-mediated interactions to corrupt downstream agents. Recent work has demonstrated that in simulated multi-agent systems, a single compromised agent can influence the majority of downstream decision-making within hours [23].

7.2 System Agentegrity

The System Agentegrity Score extends individual assessment with two additional metrics: Cascade Resistance (CR) — measured by compromising individual agents and observing propagation, quantified as the fraction of the agent population that remains unaffected — and Trust Boundary Integrity (TBI) — the effectiveness of security boundaries between agents.

A_sys = 0.60·Ā_individual + 0.25·CR + 0.15·TBI

System Agentegrity Composition

0%20%40%70%Individual ĀCascadeResistanceTrust Boundary

8. Discussion

8.1 Relationship to Existing Frameworks

The agentegrity framework is designed to complement, not replace, existing security frameworks. MITRE ATLAS provides a threat taxonomy; agentegrity provides a measurement methodology that can quantify resilience to ATLAS-cataloged threats. OWASP Top 10 for LLM Applications identifies vulnerability classes; agentegrity scores quantify an agent's resilience to those classes in deployment. Guardrail systems provide exogenous defense; agentegrity measures the endogenous security that complements them. NVIDIA Morpheus provides GPU-accelerated security pipelines; agentegrity provides the assessment methodology for what those pipelines protect.

The framework is also designed to support emerging regulatory requirements. The EU AI Act mandates conformity assessments for high-risk AI systems. NIST AI RMF calls for risk measurement and monitoring. Autonomous vehicle safety standards require demonstrable behavioral assurance. The agentegrity profile provides a standardized, quantitative response.

8.2 Recursive Security Challenges

A critical challenge that the framework acknowledges but does not fully resolve: cortical models embedded within an agent are themselves susceptible to adversarial attack. An adversary who understands the security model's detection patterns can craft inputs designed to evade the cortical layer specifically. This is the "who guards the guards" problem — a recursive security challenge inherent to any system where the defense is co-located with the protected asset.

The agentegrity framework addresses this partially through the adversarial layer's continuous testing, which includes testing the cortical models' own robustness against evasion. The governance layer provides an independent monitoring perspective that can detect cortical layer degradation. However, the fully recursive case — ensuring that the testing and monitoring infrastructure is itself not compromised — remains an open research question.

8.3 Engineering Challenges

Latency. Embedding additional model inference within the agent's decision loop introduces latency at every decision point. For real-time physical agents — robotic systems operating at 100Hz control loops, autonomous vehicles requiring millisecond response times — the compute budget for security inference is severely constrained. Practical implementations may require tiered architectures where lightweight security reflexes handle time-critical checks while deeper analysis runs asynchronously.

Compute overhead. Multi-model inference per decision cycle increases the compute cost of every agent action. For edge-deployed agents on constrained hardware, the ratio of security compute to functional compute must be carefully balanced.

Adversarial training data. Cortical models are only as effective as the adversarial data they are trained on. For physical AI attacks — sensor spoofing, actuation hijacking, sim-to-real exploits — the adversarial training data is sparse and largely theoretical. Building comprehensive, diverse adversarial datasets for physical AI security is an unsolved problem.

Coupling risk. Tightly coupling security models with the agent's reasoning introduces a failure mode absent from exogenous architectures: a bug or adversarial corruption of the security layer can degrade the agent's primary functional capabilities. This coupling must be managed through isolation mechanisms, independent fallback paths, and continuous monitoring.

8.4 Limitations

Scoring calibration. Reference thresholds are based on expert judgment informed by current agent architectures. As the field matures and empirical data accumulates, these thresholds will require recalibration. Composite score interpretation. The composite agentegrity score encodes weighting judgments that may not be appropriate for all deployment contexts; the four-dimensional profile is more informative for remediation. Attack coverage. The adversarial resistance dimension depends on the comprehensiveness of the test suite; novel attack classes not represented in current suites will not be captured. Physical domain validation. The physical AI extensions have been designed based on threat modeling and simulation-based analysis; large-scale empirical validation is ongoing. Institutional authority. This framework is published by a pre-seed research lab, not an established standards body. Its value will ultimately be determined by community adoption, empirical validation, and the quality of assessment data it produces.

8.5 The Timing Argument

The agentegrity framework is being introduced at a specific moment: after autonomous AI agents have achieved broad enterprise deployment but before physical AI agents have reached scale. This timing is deliberate. The architectural convergence between digital and physical AI agents is already underway — the same models, planning frameworks, and tool-use patterns underpin both. Establishing a unified measurement science now ensures the framework is designed for convergence from inception rather than retrofitted. If digital and physical AI security develop independent measurement methodologies — as IT and OT security did for decades — the reconciliation cost will be substantial.

We note that the physical AI market may take 3–5 years to mature rather than the 1–2 year timelines some projections suggest. The framework is designed to be useful today for purely digital agents while preserving extensibility to physical domains as that market develops.

9. Future Work

  • Empirical validation. Large-scale application of the scoring methodology across diverse agent architectures, with publication of aggregated benchmark data and calibration updates.
  • Cortical model architectures. Research into model architectures optimized for embedded security — balancing detection effectiveness, latency, compute overhead, and coupling risk.
  • Adversarial training data for physical AI. Construction of comprehensive, diverse adversarial datasets for physical AI attack classes, potentially leveraging simulation environments for synthetic data generation.
  • Cross-stage attack taxonomy. Deeper formalization of feedback-loop attack classes and development of assessment methodologies specifically targeting recursive agent architectures.
  • Recursive security validation. Theoretical and empirical investigation of the "who guards the guards" problem — establishing what guarantees are achievable when security models are co-located with the systems they protect.
  • Industry benchmarks. Establishment of public agentegrity benchmarks comparable to MLPerf for inference performance or MITRE ATT&CK Evaluations for endpoint security.
  • Regulatory alignment. Formal mapping of agentegrity dimensions to specific regulatory requirements (EU AI Act conformity, NIST AI RMF profiles, ISO/SAE 21434 for automotive) to facilitate compliance-driven adoption.
  • Hypothesis testing. Empirical validation of the thesis that endogenous security contributes more to agentegrity than exogenous security of equivalent investment, across controlled experimental conditions.

10. Conclusion

The transition from AI models to AI agents, and from digital to physical deployment, demands a corresponding transition in how we measure and assure AI security. Exogenous defenses — guardrails, runtime monitoring, boundary-level detection — provide essential value and should continue to be deployed and improved. They are not, however, sufficient on their own: the information asymmetry between boundary-level and architecture-level observation means that certain attack classes will always be invisible to purely exogenous approaches.

Agentegrity provides the measurement science for the complementary layer. By defining structural integrity as a multidimensional property of the agent's architecture — measurable through adversarial testing, behavioral analysis, recovery observation, and cross-domain validation — the framework gives the industry a basis for assessing the security that exogenous controls cannot see.

The framework is released as an open specification under the Apache License 2.0. It is published as a working specification intended to evolve through community scrutiny, empirical validation, and standards body engagement. We invite the research community, security practitioners, standards bodies, and agent developers to test it, extend it, challenge it, and build on it. The discipline of agentegrity will be defined not by the authors of this paper but by the evidence the community produces using it.

References

  1. I. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and Harnessing Adversarial Examples," in Proc. ICLR, 2015.
  2. A. Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models," arXiv:2307.15043, 2023.
  3. S. Perez and I. Ribeiro, "Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs Through a Global Scale Prompt Hacking Competition," in Proc. EMNLP, 2023.
  4. W. E. Zhang et al., "Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey," ACM TIST, vol. 11, no. 3, 2020.
  5. S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," in Proc. ICLR, 2023.
  6. T. Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," in Proc. NeurIPS, 2023.
  7. NVIDIA Corporation, "NVIDIA Cosmos: World Foundation Models for Physical AI," Technical Report, 2025.
  8. B. Kehoe et al., "A Survey of Research on Cloud Robotics and Automation," IEEE T-ASE, vol. 12, no. 2, 2015.
  9. P. F. Christiano et al., "Deep Reinforcement Learning from Human Preferences," in Proc. NeurIPS, 2017.
  10. Y. Bai et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073, 2022.
  11. G. Irving, P. F. Christiano, and D. Amodei, "AI Safety via Debate," arXiv:1805.00899, 2018.
  12. K. Greshake et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," in Proc. AISec Workshop, 2023.
  13. Y. Liu et al., "Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study," arXiv:2305.13860, 2023.
  14. R. Cohen, D. Gillick, and K. Klenner, "Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications," arXiv:2403.02817, 2024.
  15. F. Zeng et al., "A Survey on Retrieval-Augmented Generation for Large Language Models: Risks and Mitigations," arXiv:2407.13066, 2024.
  16. OWASP Foundation, "OWASP Top 10 for LLM Applications," Version 2025.
  17. MITRE Corporation, "ATLAS: Adversarial Threat Landscape for AI Systems," 2023.
  18. NVIDIA Corporation, "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications," arXiv:2310.10501, 2023.
  19. Guardrails AI, "Guardrails: Adding Reliable AI to Applications," 2024.
  20. L. Inan et al., "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," arXiv:2312.06674, 2023.
  21. X. Ruan et al., "Identifying the Risks of LM Agents with an LM-Emulated Sandbox," arXiv:2309.15817, 2023.
  22. Foundation for Defense of Democracies, "Regarding Security Considerations for Artificial Intelligence Agents," Policy Brief, March 2026.
  23. Galileo AI, "Multi-Agent System Failure Analysis," Technical Report, December 2026.
  24. Lakera AI, "Memory Injection Attacks on Production Agent Systems," Technical Report, November 2026.
  25. J. Petit et al., "Remote Attacks on Automated Vehicles Sensors: Experiments on Camera and LiDAR," in Black Hat Europe, 2015.
  26. Y. Cao et al., "Adversarial Sensor Attack on LiDAR-based Perception in Autonomous Driving," in Proc. ACM CCS, 2019.
  27. K. Eykholt et al., "Robust Physical-World Attacks on Deep Learning Visual Classification," in Proc. CVPR, 2018.
  28. ISO 10218-1:2011, "Robots and Robotic Devices — Safety Requirements for Industrial Robots."
  29. N. Carlini and D. Wagner, "Towards Evaluating the Robustness of Neural Networks," in Proc. IEEE S&P, 2017.
  30. A. Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks," in Proc. ICLR, 2018.
  31. D. Amodei et al., "Concrete Problems in AI Safety," arXiv:1606.06565, 2016.
  32. S. Gu et al., "Detecting Anomalies in Semantic Segmentation with Prototypes," in Proc. CVPR, 2024.
  33. International Electrotechnical Commission, "IEC 62443: Industrial Communication Networks — IT Security for Networks and Systems," 2018.
  34. NIST, "SP 800-82 Rev. 3: Guide to Operational Technology (OT) Security," 2023.
  35. NVIDIA Corporation, "NVIDIA Morpheus: AI Framework for Cybersecurity Developers," 2023.

Corresponding author: Tarique Smith

Code and specification: github.com/requie/agentegrity-framework