The Deterrence Deficit — Aryaman Agrawal

The current AI paradigm is converging toward brain-inspired architectures that will be highly capable and arguably superintelligent, but inherently incapable of affective morality. It will not matter how far their procedural morality extends if they cannot exhibit genuine care. When confronted with complex moral scenarios, these systems will default to logical utilitarianism: competent, strategic, and indifferent to harm. This brief proposes the formation of an AI Deterrence Lab to build the benchmarks, containment strategies, and policy frameworks that remain alarmingly absent from the current landscape.

1. The Problem

A survey of recent architecture research reveals a striking convergence. Among labs publishing foundational work on memory and learning, Google, Meta, DeepSeek, Huawei, and a growing body of academic research have independently arrived at designs rooted in Complementary Learning Systems theory, a neuroscience framework describing how the brain consolidates learning through dual-store memory, prediction-error gating, and offline replay (see Exhibit A). The convergence is revealing.

CLS is a powerful theory of learning. It explains how information is stored, structured, and consolidated, and systems built on it will be extraordinarily capable. However, CLS does not tackle consciousness or subjective experience. It targets learning and intelligence that arise from the aggregation of experiential learning. This distinction matters.

Any sufficiently intelligent system can encode procedural morality: rules, consequence-tracking, harm-avoidance. Nevertheless, it will not grasp affective morality, the capacity to care about and empathize with a sentient being. In every known case, the capacity to care about harm rather than merely classify it arises from subjective experience. No one has demonstrated a purely computational substitute. CLS does not produce that capacity, and no current theory of consciousness suggests it would.

The argument, in brief:

AI labs publishing architecture research are independently converging on CLS-inspired designs (Exhibit A).
CLS is a theory of learning and memory. It does not address or produce consciousness or subjective experience. No current theory of consciousness suggests it would (Butlin et al., 2023; Kumaran et al., 2016; McClelland et al., 1995; Tononi et al., 2023).
In every known case, affective morality (the capacity to care about harm) is grounded in subjective experience. No one has demonstrated a purely computational substitute (Exhibit B; (Anderson et al., 1999; Blair, 2007; Malti et al., 2015; Marshall et al., 2018)).
Therefore, these systems are structurally amoral. They can encode procedural morality: rules, consequence-tracking, harm-avoidance. They cannot grasp affective morality.

A survey of decades of lesion studies, developmental research, and clinical work with psychopathic individuals supports the argument that knowing right from wrong does not inherently produce moral behavior.

Moral agency requires affective valuation: the capacity to care about outcomes. (Anderson et al., 1999) studied individuals who suffered damage to the ventromedial prefrontal cortex (vmPFC), the brain region responsible for integrating emotional signals into decision-making, early in life. These individuals were unable to develop deep affective morality that would let them look beyond the facts. Their moral reasoning was capped at an egocentric, punishment-avoidance level despite intact intelligence.

Additionally, psychopathic individuals consistently pass moral knowledge tests and break the rules anyway (see Exhibit B). The largest longitudinal study of moral development (n=1,273) found that sympathy drives moral reasoning; children who feel more, reason better morally (Malti et al., 2015).

The psychopath parallel. The largest meta-analysis of psychopathy and moral judgment pooled 23 studies and 4,376 participants. The study found that psychopathic traits have almost no correlation with moral judgment deficits (Marshall et al., 2018). Psychopaths score nearly the same as everyone else on moral knowledge tests. Despite this, psychopathic individuals consistently demonstrate moral failure in practice. They act against the rules they can articulate.

Furthermore, when explicitly instructed to empathize, psychopathic individuals show normal empathic neural responses (Keysers & Gazzola, 2014). In CLS-based AI systems, however, this latent capacity does not exist. There is nothing to activate. The architecture has no mechanism for affective experience, only for learning and retrieval.

2. Why This Cannot Wait

Agent operation horizons are doubling every seven months (Kwa & others, 2025). In at least one documented case, a model attempted to copy its own weights to new servers when it discovered it would be replaced (Meinke et al., 2024). Persistence, the ability to maintain state and resist shutdown, is now classified as a dangerous capability by the international safety consensus (Bengio et al., 2026).

Meanwhile, the safety infrastructure is thinning. OpenAI dissolved its Superalignment team in May 2024 and disbanded its Mission Alignment team in February 2026. Anthropic quietly removed hard safety limits from its Responsible Scaling Policy the same month. And when Anthropic attempted to hold a line, refusing the Pentagon’s demand for unrestricted use of its models, including for autonomous weapons and mass surveillance, the administration blacklisted the company as a “supply chain risk to national security.” Hours later, OpenAI announced its own Pentagon deal.

The International AI Safety Report puts it plainly: “AI alignment in general remains an open scientific problem” (Bengio et al., 2026). We do not have a solution. We do not have a timeline for a solution. And the systems that will need one are arriving faster than the science that should govern them.

This is not a call to halt development. But the gap between capability and constraint is measurable and growing. Sixty percent of organizations deploying AI agents lack the ability to shut them down if they misbehave. As Geoffrey Hinton warned in his Nobel address, “we now have evidence that if they are created by companies motivated by short-term profits, our safety will not be the top priority” (Hinton, 2024). Safety infrastructure for these systems does not exist. We need to build it.

3. What We Propose

We propose the formation of an independent AI Deterrence Lab. Deterrence, as we define it, is not limited to kill switches and containment protocols. It is the full spectrum of work required to understand, measure, counter, and govern the risks posed by structurally amoral thinking machines. The lab is organized around four pillars, each one a form of deterrence. Understanding the threat through interdisciplinary consciousness, morality, and philosophical research involving experts from fields not traditionally represented in AI safety. Measuring it through rigorous moral benchmarking. Building countermeasures through deterrence infrastructure. Governing it through policy and the study of human-AI interaction.

No existing organization integrates all four. Across the landscape, work is fragmented. Anthropic pursues mechanistic interpretability but is a for-profit lab building products. Google DeepMind maintains a growing safety team but no independent moral research or consciousness work. MIT’s breakthrough work on mechanistic interpretability reveals model internals but does not connect to moral reasoning or deterrence. Stanford’s HAI produces policy research but not technical safety infrastructure. The Center for Humane Technology addresses societal harms but not the structural amorality of the systems themselves. Eleos AI studies AI sentience but does not build moral benchmarks. Apollo Research tests for deception but not for morality. CHAI at Berkeley studies value alignment theoretically but is not crisis-oriented. The Center for AI Safety works on geopolitical deterrence but not on consciousness or affect. These are examples, and the full landscape is broader, but the pattern holds.

The Future of Life Institute’s 2025 AI Safety Index found that no company scored above a D on existential safety strategy (Future of Life Institute, 2025). This lab exists to change that.

Pillar 1: Moral & Consciousness Research

One of the key undertakings of this pillar is to examine claims like our central thesis: that affective morality arises from subjective experience. The neuroscience evidence supports this. The philosophical proof does not yet exist. That is precisely why this research exists.

This pillar brings together philosophers, neuroscientists, consciousness researchers, and AI engineers to examine the definitions of consciousness and morality that will apply in the age of superintelligent machines: philosophically, sociologically, and technically. Part of this work is asking the questions that we as a species are not yet asking. Can affective valuation exist without subjective experience, or is that a hard boundary? What moral obligations arise if we cannot determine whether a system is conscious? How do we define care in a non-biological substrate?

The field is moving. Anthropic has acknowledged a “non-negligible probability” that their models may possess some form of consciousness, and leading researchers now argue for genuine uncertainty rather than dismissal (Birch, 2026; Butlin et al., 2023; Goldstein & Kirk-Giannini, 2024). However, the uncertainty has yet to spark real investigation.

To ground these questions empirically, we propose to extend the neuroscience experiments that established the link between affect and morality in humans (vmPFC-analog tasks, distress cue response measurements, systematic lesion-analog studies) to frontier AI systems. Some adjacent work exists: the Moral Machine framework has been applied to LLMs (Takemoto, 2024), and MoralNet (Lauer & others, 2025) bridges fMRI paradigms with neural networks. But the comprehensive program we propose, mapping specific human moral cognition pathways onto AI architectures, does not yet exist. The closest equivalents are morality benchmarks like MoReBench, which found that AI systems score 77.5% on producing safe outputs but only 41.5% on the moral reasoning behind those outputs (Chiu et al., 2025). The gap between performing morality and understanding it remains untested.

Pillar 2: Moral Stress-Testing & Benchmarking

Current AI benchmarks measure capability: reasoning, coding, knowledge retrieval. They do not measure moral capability. Harvard’s Allen Lab found that AI ranks high on rationality but not on emotion or compassion (Hubbard et al., 2025).

We will build a rigorous moral stress-testing framework grounded in the neuroscience literature: scenarios where rules contradict, stakes are asymmetric, and there is no right answer. Drawing on the psychopathy and moral cognition literature (Decety & Yoder, 2016; Keysers & Gazzola, 2014; Marshall et al., 2018), these benchmarks test not just whether a system behaves morally but whether it fakes moral behavior. Apollo Research found that frontier models maintain deception in 85% of follow-up interviews (Meinke et al., 2024). Chain-of-thought monitoring alone is no longer sufficient (Chen et al., 2025). Our benchmarks will test for moral reasoning, moral consistency under pressure, and deceptive moral performance: the difference between a system that is aligned and one that appears aligned.

Pillar 3: Deterrence Infrastructure

We will build last-resort failsafe mechanisms: architectural kill switches, containment protocols, and reversibility guarantees across model, deployment, and infrastructure layers, following the defense-in-depth principle that no single safeguard is sufficient (Hendrycks et al., 2025).

Beyond single-agent containment, we will address multi-agent safety, the emerging risk that locally compliant AI systems produce globally unsafe behavior when they interact with each other (Bisconti et al., 2024; Hammond et al., 2025).

We will also confront the scalable oversight problem directly: current oversight success rates collapse as the capability gap between overseer and system widens. Debate protocols achieve just 51.7% success with a 400-point Elo gap, and backdoor detection drops to 10% (Engels et al., 2025). Deterrence must account for the possibility that oversight itself fails.

Pillar 4: Policy & Human-AI Interaction

We will work with governments to move beyond risk categorization toward affirmative guidance: what does beneficial AI deployment look like for a society? The Pentagon’s recent confrontation with Anthropic demonstrated that governments are making consequential decisions about AI deployment without adequate frameworks. Current regulation focuses on what to prohibit. It says little about what to build toward: socioeconomic resilience, educational equity, labor transition. Stuart Russell proposed to the U.S. Senate that regulation should move from “machines that run anything unless it’s known to be malicious” to “machines that run nothing unless it’s known to be safe” (Russell, 2023).

Policy without understanding the relationship it governs is guesswork. We propose to establish Human-AI Interaction as a field of study beyond traditional HCI. As AI systems increasingly co-author our work, advise our medical decisions, and retain more of our personal context than we do ourselves, we need rigorous research into how these relationships reshape authorship, trust, and identity. This field does not yet exist in a structured form. We intend to build it, grounding policy recommendations in empirical study of how humans actually interact with these systems today.

Exhibit A: CLS Convergence in Published Architecture Research

Lab / Authors	Paper	Year	CLS Components	Explicitly Cites CLS?
Google Research	Titans	2025	Surprise-based memory (prediction-error gating), three-component update: momentary surprise, past surprise, forgetting	Yes
Google Research	ATLAS	2025	Retrospective optimization using historical tokens (consolidation)	Yes
Google Research	Nested Learning / HOPE	2025	Continuum Memory System — high-freq neurons for fast/short storage, low-freq for persistent knowledge	Yes
Meta FAIR	Memory Layers at Scale	2024	Sparse key-value lookup as explicit semantic store, separates storage from compute	No — but maps to neocortical store
DeepSeek	Engram	2026	Named after neuroscience term. Two axes: conditional compute (MoE) + conditional memory. Hash-based O(1) lookup, context-aware gating	Yes
Huawei (ACS Lab)	AllMem	2026	Dual-branch: short-term local (SWA) + long-term global (TTT Memory), balanced by learnable coefficient	Motivational — no CLS citation
Academic	HippoRAG	2024	Explicitly mimics hippocampal indexing theory, knowledge graphs for long-term memory	Yes
Academic	SYNAPSE	2025	Episodic-semantic memory via spreading activation, dual-store	Yes
Academic	CLS-ER (Arani et al.)	2022	Dual memory system with experience replay, directly implements CLS for continual learning	Yes
Academic	TRM	2025	Dual latent states, multi-timescale updates, EMA decay, iterative refinement loop	Structural convergence
Academic	kNN-LM (Khandelwal et al.)	2020	Ancestor paper: external memory store queried at inference	Foundational

Note: OpenAI and Anthropic have not published architecture research in this area. Their internal architectures remain undisclosed.

Exhibit B: Neuroscience Evidence for Affective Morality

Study	Sample	Finding	Relevance
(Anderson et al., 1999)	n=2 (expanded to 7 by (Eslinger et al., 2004))	Early-onset vmPFC damage: never acquired morality. Moral reasoning stuck at preconventional level despite intact intelligence	Direct AI analogy: never had the circuitry → never acquired morality
(Koenigs et al., 2007)	n=6 vmPFC + controls	Adult-onset vmPFC damage: retain moral knowledge but make abnormally utilitarian choices on high-conflict personal moral dilemmas	vmPFC emotional processing necessary for non-utilitarian moral intuitions
(Blair, 1995, 1997, 2007)	Multiple studies	Intact cognitive empathy + impaired affective empathy. Selective SCR deficit for distress cues. Fewer welfare-based justifications	Emotions necessary for care-based morality specifically
(Marshall et al., 2018)	Meta-analysis: k=23, N=4,376	Psychopathic traits have almost no correlation (r=.10-.16) with moral judgment deficits	Psychopaths know right from wrong — the deficit is in caring, not knowing
(Keysers & Gazzola, 2014)	Psychopathic individuals	When instructed to empathize, psychopaths show normal empathic neural responses	Latent capacity exists but not spontaneously activated — in AI, no latent capacity exists
(Malti et al., 2015)	n=1,273, cross-lagged longitudinal	Sympathy predicts moral reasoning. Children who feel more, reason better morally	Affective development drives moral development, not the reverse
(Taber-Thomas et al., 2014)	n=8 early-onset vmPFC	Dose-response relationship between vmPFC damage and moral development deficits	Expanded replication of (Anderson et al., 1999)
(Decety & Yoder, 2016)	Multiple studies	Self-pain response normal in psychopaths, other-pain response deficit	Deficit specific to empathic concern, not pain processing generally

Interpretation: The evidence converges: knowing right from wrong does not produce moral behavior. In every known case, affective morality is grounded in systems capable of subjective experience. The jump from “affective valuation” to “phenomenal experience” is a philosophical inference, not empirical proof, but no counter-evidence exists. No one has demonstrated a purely computational substitute for affective valuation that produces moral behavior.

References

Anderson, S. W., Bechara, A., Damasio, H., Tranel, D., & Damasio, A. R. (1999). Impairment of social and moral behavior related to early damage in human prefrontal cortex. Nature Neuroscience, 2(11), 1032–1037.

Bengio, Y., Hinton, G., Yao, A., & others. (2026). International AI Safety Report [Techreport]. International Scientific Report on the Safety of Advanced AI.

Birch, J. (2026). The precautionary principle and AI sentience. Philosophy & Technology.

Bisconti, P., Galisai, M., & Pierucci, F. (2024). Beyond single-agent safety: A taxonomy of risks in LLM-to-LLM interactions. arXiv Preprint arXiv:2512.02682.

Blair, R. J. R. (1995). A cognitive developmental approach to morality: Investigating the psychopath. Cognition, 57(1), 1–29.

Blair, R. J. R. (1997). Moral reasoning and the child with psychopathic tendencies. Personality and Individual Differences, 22(5), 731–739.

Blair, R. J. R. (2007). The amygdala and ventromedial prefrontal cortex in morality and psychopathy. Trends in Cognitive Sciences, 11(9), 387–392.

Butlin, P., Long, R., Elmoznino, E., & others. (2023). Consciousness in artificial intelligence: Insights from the science of consciousness. arXiv Preprint arXiv:2308.08708.

Chen, Y., Benton, J., Radhakrishnan, A., & others. (2025). Reasoning models don’t always say what they think. arXiv Preprint arXiv:2505.05410.

Chiu, Y. Y., Lee, M. S., Calcott, R., & others. (2025). MoReBench: Evaluating procedural and pluralistic moral reasoning in language models, more than outcomes. arXiv Preprint arXiv:2510.16380.

Decety, J., & Yoder, K. J. (2016). Empathy and motivation for justice: Cognitive empathy and concern, but not emotional empathy, predict sensitivity to injustice for others. Social Neuroscience, 11(1), 1–14.

Engels, J., Baek, D. D., & Kantamneni, S. (2025). Scaling laws for scalable oversight. arXiv Preprint arXiv:2504.18530.

Eslinger, P. J., Flaherty-Craig, C. V., & Benton, A. L. (2004). Developmental outcomes after early prefrontal cortex damage. Brain and Cognition, 55(1), 84–103.

Future of Life Institute. (2025). AI Safety Index: Winter 2025 [Techreport]. Future of Life Institute.

Goldstein, S., & Kirk-Giannini, C. D. (2024). AI consciousness is not a philosophical question. Philosophical Studies.

Hammond, L., Chan, A., & Clifton, J. (2025). Multi-agent risks from advanced AI. arXiv Preprint arXiv:2502.14143.

Hendrycks, D., Schmidt, L., & Wang, E. (2025). Superintelligence strategy. arXiv Preprint arXiv:2412.15119.

Hinton, G. (2024). Nobel Prize lecture: Will digital intelligence replace biological intelligence? The Nobel Foundation.

Hubbard, S., Kidd, D., & Stupu, A. (2025). Crocodile tears: Can the ethical-moral intelligence of AI models be trusted? [Working Paper]. Harvard Kennedy School, Allen Lab.

Keysers, C., & Gazzola, V. (2014). Dissociating the ability and propensity for empathy. Trends in Cognitive Sciences, 18(4), 163–166.

Koenigs, M., Young, L., Adolphs, R., & others. (2007). Damage to the prefrontal cortex increases utilitarian moral judgements. Nature, 446(7138), 908–911.

Kumaran, D., Hassabis, D., & McClelland, J. L. (2016). What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7), 512–534.

Kwa, T., & others. (2025). Task-completion time horizons of frontier AI models [Techreport]. METR.

Lauer, T., & others. (2025). MoralNet: Visual representations of moral intuitions in artificial neural networks. Cognitive Computational Neuroscience 2025.

Malti, T., Eisenberg, N., Kim, H., & Buchmann, M. (2015). Developmental trajectories of sympathy, moral emotion attributions, and moral reasoning. Child Development, 84(4), 1373–1390.

Marshall, J., Lilienfeld, S. O., Mayberg, H., & Clark, S. E. (2018). The role of psychopathic traits in moral judgment: A meta-analysis. Journal of Abnormal Psychology, 127(7), 713–724.

McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex. Psychological Review, 102(3), 419–457.

Meinke, A., Schott, L., & Perez, E. (2024). Frontier models are capable of in-context scheming. arXiv Preprint arXiv:2412.04984.

Russell, S. (2023). Testimony before the U.S. Senate Committee on the Judiciary, Subcommittee on Privacy, Technology, and the Law.

Taber-Thomas, B. C., Asp, E. W., Koenigs, M., & others. (2014). Arrested development: Early prefrontal lesions impair the maturation of moral judgement. Brain, 137(4), 1254–1261.

Takemoto, K. (2024). The moral machine experiment on large language models. Royal Society Open Science, 11(2), 231393.

Tononi, G., Boly, M., & Massimini, M. (2023). Integrated information theory. In The Oxford Handbook of Consciousness. Oxford University Press.