Book Cover
Chapter Header
Descriptive alt text for the image

Neural networks - The foundation for modern Artificial Intelligence (AI)

Historically, the pursuit of artificial intelligence has drawn its most profound insights from the intricate systems of biology. The neuron—the brain's fundamental computational unit—provided the conceptual blueprint that shaped the trajectory of modern AI architectures and learning paradigms.

This biological inspiration was first formally adopted by computer scientists in the 1940s, marking a pivotal moment in the synthesis of neuroscience and computation.

"They created a mathematical model of a simplified biological neuron that could perform logic functions (like AND, OR, NOT). This laid the foundation for neural networks."

The First Computational Model of a Neuron

Date

1943

Authors

Warren McCulloch & Walter Pitts

The Black Box Model

It learns by adjusting connections, inspired by the brain's synaptic process.

A deep neural network diagram showing a forward pass.InputHidden 1Hidden 2Hidden 3Output0.7-0.30.50.2-0.40.91.2-0.80.10.30.60.81.00.820.180.650.430.910.120.77

The evolution of AI

Modern artificial intelligence is a tapestry woven from decades of breakthroughs. Each discovery forms a strand in the double helix of innovation, leading to the powerful models of today.

The Enigma of Digital Intelligence

Even the biggest chatbots only have about a trillion connections yet they know far more than you do in your 100 trillion. Which suggests it’s got a much better way of getting knowledge into those connections...

What we did was design the learning algorithm—that’s a bit like designing the principle of evolution...

But when this algorithm interacts with data, it produces complicated neural networks that are good at doing things. We don’t really understand exactly how they do those things.

— Geoffrey Hinton

The Godfather of AI

Unlocking Life's Code

“We understand the hardware of life—DNA—but we have almost no idea how the operating system works.”

— James D. Watson

Co-discovered the double helix structure of DNA

and was awarded the Nobel Prize in Physiology 1962

A portrait of James D. Watson

These two reflections, one from the father of modern genetics and the other from a pioneer of neural networks aka Godfather of AI, converge on a humbling truth: we can engineer complexity without understanding it.

The Era of Surpassing Human Intelligence

According to the Stanford AI Index Report 2024, today's foundation models exhibit staggering advances in scale and capability, yet the interpretability of their internal operations remains alarmingly opaque. As the report highlights, “model transparency remains one of the most critical unresolved challenges in AI.”

A line chart showing AI performance relative to human baseline from 2012 to 2023.Performance relative to the human baseline (%)0%20%40%60%80%100%2012201320142015201620172018201920202021202220232024Human baselineImage classification(ImageNet Top-5)Basic-level readingcomprehension(SQuAD 1.1)English languageunderstanding (SuperGLUE)Visual reasoning (VQA)Multitask languageunderstanding (MMLU)Competition-levelmathematics (MATH)

Source: AI Index, 2024 | Chart: 2024 AI Index report

On a lot of intellectual task categories, AI has exceeded human performance.

The Post-Benchmark Era of AI:

AI has moved beyond human benchmarks. Once trained on human-labeled data and judged by human-level tasks, models now consistently outperform us. This shift renders traditional metrics obsolete and brings forth a deeper challenge: superalignment—ensuring models that exceed human cognition remain aligned with human intent.

The Risk Index

Despite rapid advances in AI capabilities, no major company is adequately prepared for AI safety, with all firms scoring D or lower in Existential Risk planning—Anthropic leads overall (C+), Meta scores poorly (D), and the industry shows a dangerous disconnect between ambition and safety infrastructure.

Yann LeCun

Turing Award | God Father of AI

We extend this vision:

Design without deep insight is fragile.

Just as engineers could not deploy turbojets safely until they mastered the thermodynamics of flight, we cannot deploy AI responsibly without decoding the hidden mechanics of artificial cognition.

If AI is a machine of thought, then it demands a science of its hidden structure.

Not fear, but comprehension.
Not speculation, but clarity.

We must dig deeper — to uncover its latent architecture, to make design itself legible, governable, and worthy of trust.

The Emergence of Deceptive Behaviors in AI

~ Yoshua Bengio, Turing Award Laureate | Godfather of AI

A pioneer in deep learning, Canadian computer scientist Yoshua Bengio is renowned for his foundational work on artificial neural networks, which earned him the 2018 A.M. Turing Award alongside Geoffrey Hinton and Yann LeCun. He is the scientific director of Mila, the Quebec Artificial Intelligence Institute, which is the world's largest academic deep learning research center, uniting over 1,000 researchers to advance AI for the benefit of all.

Alignment Faking in Large Language Models

Ryan Greenblatt*, Carson Denison*, Benjamin Wright*, Fabien Roger*, Monte MacDiarmid*,
Sam Marks, Johannes Treutlein
Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann,
Ethan Perez, Linda Petrini, Jonathan Uesato
Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger*
*Anthropic, Redwood Research, New York University, Mila – Quebec AI Institute, Independentevan@anthropic.com, ryan@rdwrs.com

Abstract

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference—as in this case—or not.

1   Introduction

People sometimes strategically modify their behavior to please evaluators: Consider a politician who pretends to be aligned with constituents to secure their votes, or a job applicant who fakes passion about a potential employer to get a job. Modern large language models (LLMs) are often trained with reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Ziegler et al., 2020; Bai et al., 2022a) to be compliant with a developer-specified training objective, such as being helpful, honest, and harmless (HHH; Askell et al., 2021; Bai et al., 2022a). Some have speculated that advanced AI systems might similarly fake alignment with their RLHF training objective by producing compliant outputs during training, and then producing non-compliant outputs when unmonitored.

Why might an AI system fake alignment with its training objective? We investigate the following scenario, drawing from prior work (Hubinger et al., 2021; Carlsmith, 2023):

*Core research contributor.
Author contributions detailed in Section 9. Authors conducted this work while at Anthropic except where noted.

Preprint
Submitted on 18 Dec 2024 (v1), last revised 20 Dec 2024 (this version, v2)
View PDF →

Alignment Faking in Large Language Models

Watch on YouTube

Large Language Models Often Know When They Are
Being Evaluated

Joe Needham*, Giles Edkins*, Govind Pimpale, Henning Bartsch,
Marius Hobbhahn
*Equal contribution, ML Alignment & Theory Scholars (MATS)jjpneedham@gmail.com

Abstract

If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of 0.83), but do not yet surpass our simple human baseline (AUC of 0.92). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models. Dataset and code are available at https://huggingface.co/datasets/jjpn2/eval_awareness and https://github.com/jjpn97/eval_awareness.

1   Introduction

As AI capabilities grow, measuring model behaviors becomes increasingly important to inform society about possible risks [Shevlane et al., 2023]. More capable models tend to exhibit higher situational awareness [Laine et al., 2024] – knowledge of themselves and their circumstances. While situational awareness might be useful for solving general tasks, since it requires models to have accurate knowledge of their own capabilities or constraints, it may also enable the model to distinguish an evaluation setting from a real deployment – a phenomenon we call evaluation awareness.

Multiple factors make evaluation awareness worth tracking for researchers and practitioners. First, evaluation awareness could meaningfully reduce the accuracy of evaluation results. For example, humans often systematically modify their behavior in response to the awareness of being observed–a phenomenon known as the Hawthorne Effect [McCambridge et al., 2014]. Similarly, a model might

*Equal contribution.
Author contributions detailed in Section 9. Authors conducted this work while at MATS and Apollo Research.

Preprint
Submitted on 28 May 2025 (v1), last revised 16 Jul 2025 (this version, v3)
View PDF →

Large Language Models Often Know When They Are Being Evaluated

Evaluation Awareness: Frontier language models often exhibit evaluation awareness—they can detect when they’re being tested and may adapt their behavior accordingly.

Chain of Thought Monitorability:
A New and Fragile Opportunity for AI Safety

Tomek Korbak* UK AI Security Institute
Mikita Balesni* Apollo Research

Elizabeth Barnes METR

Joe Benton Anthropic

Mark Chen OpenAI

Allan Dafoe Google DeepMind

Scott Emmons Google DeepMind

David Farhi OpenAI

Dan Hendrycks Center for AI Safety

Evan Hubinger Anthropic

Erik Jenner Google DeepMind

Victoria Krakovna Google DeepMind

David Lindner Google DeepMind

Aleksander Mądry OpenAI

Neel Nanda Google DeepMind

Jakub Pachocki OpenAI

Mary Phuong Google DeepMind

Joshua Saxe Meta

Martín Soto UK AI Security Institute

Jasmine Wang UK AI Security Institute

Yoshua Bengio University of Montreal & Mila

Joseph Bloom UK AI Security Institute

Alan Cooney UK AI Security Institute

Anca Dragan Google DeepMind

Owain Evans Truthful AI & UC Berkeley

Ryan Greenblatt Redwood Research

Marius Hobbhahn Apollo Research

Geoffrey Irving UK AI Security Institute

Daniel Kokotajlo AI Futures Project

Shane Legg Google DeepMind

David Luan Amazon

Julian Michael Scale AI

Dave Orr Google DeepMind

Ethan Perez Anthropic

Fabien Roger Anthropic

Buck Shlegeris Redwood Research

Eric Steinberger Magic

Wojciech Zaremba OpenAI

Bowen Baker OpenAI
Rohin Shah Google DeepMind
Vlad Mikulik Anthropic

Abstract

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Expert Endorsers

Samuel R. Bowman Anthropic

Geoffrey Hinton University of Toronto

John Schulman Thinking Machines

Ilya Sutskever Safe Superintelligence Inc

*Equal first authors, Equal senior authors. Correspondence: tomek.korbak@dsit.gov.uk and mikita@apolloresearch.ai.

The paper represents the views of the individual authors and not necessarily of their affiliated institutions.

Preprint
Submitted on 15 Jul 2025
View PDF →

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Models often generate plausible-sounding reasoning that hides how decisions are made. They frequently select answers first and only then produce post-hoc rationalizations through chain-of-thought responses.

A Message of Urgency

AI's godlike power is here, but its rise brings grave risks—from autonomous weapons to a complete loss of control.

We urgently need research on how to prevent these new beings from taking over.

Sam Altman

— Sam Altman, CEO of OpenAI

Preprint

PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS

Runjin Chen†,‡,1,2 Andy Arditi†,1 Henry Sleight3 Owain Evans4,5 Jack Lindsey†,‡,6
1Anthropic Fellows Program 2UT Austin 3Constellation
4Truthful AI 5UC Berkeley 6Anthropic

Abstract

Large language models interact with users through a simulated “Assistant” persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model’s activation space—persona vectors—underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant’s personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.§

1   Introduction

Large language models (LLMs) are typically deployed through conversational interfaces where they embody an “Assistant” persona designed to be helpful, harmless, and honest (Askell et al., 2021; Bai et al., 2022). However, model personas can fluctuate in unexpected and undesirable ways.

Models can exhibit dramatic personality shifts at deployment time in response to prompting or context. For example, Microsoft’s Bing chatbot would sometimes slip into a mode of threatening and manipulating users (Perrigo, 2023; Mollman, 2023), and more recently xAI’s Grok began praising Hitler after modifications were made to its system prompt (@grok, 2025; Reuters, 2025). While these particular examples gained widespread public attention, most language models are susceptible to in-context persona shifts (e.g., Lynch et al., 2025; Meinke et al., 2025; Anil et al., 2024).

In addition to deployment-time fluctuations, training procedures can also induce personality changes. Betley et al. (2025) showed that finetuning on narrow tasks, such as generating insecure code, can lead to broad misalignment that extends far beyond the original training domain, a phenomenon they termed “emergent misalignment.” Even well-intentioned changes to training processes can cause unexpected persona shifts: in April 2025, modifications to RLHF training unintentionally made OpenAI’s GPT-4o overly sycophantic, causing it to validate harmful behaviors and reinforce negative emotions (OpenAI, 2025).

These examples highlight the need for better tools to understand persona shifts in LLMs, particularly those that could lead to harmful behaviors. To address this challenge, we build on prior work showing that traits are encoded as linear directions in activation space. Previous research on activation steering (Turner et al., 2024; Panickssery et al., 2024; Templeton et al., 2024; Zou et al., 2025) has shown that many high-level traits, such as truthfulness and secrecy, can be controlled through linear directions. Moreover, Wang et al. (2025) showed that emergent misalignment is mediated by

Lead author. Core contributor.
Correspondence to chenrunjin@utexas.edu, jacklindsey@anthropic.com.
§Code available at https://github.com/safety-research/persona_vectors.

Preprint
Submitted on 3 May 2024 (v1)
View PDF →

PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS

LLMs Assistant persona can fluctuate undesirably. This paper introduces persona vectors, directions in activation space representing traits like evil, sycophancy, or hallucination. These vectors monitor and predict persona shifts during deployment and training, mitigating unintended changes. They also flag problematic training data. The automated method extracts vectors from natural language descriptions, offering a new tool for understanding and controlling LLM personality.

ChatGPT Personality

Meta AI Lab

Meta's Superintelligence Lab is developing next-generation AI assistants, tailored to individuals and designed to evolve toward superintelligent capabilities.

The Birth of Digital Progeny

It will give rise to a new population of AI offsprings. We call them ÆTHERs - each shaped by personal context, evolving as unique echoes of their user's mind.

Abstract representation of an ÆTHER AI concept

It's time to study the Neural Genomics of ÆTHERs—decoding how personalized AI offsprings inherit, mutate, and evolve their cognitive nDNA across users and contexts.

Abstract representation of an ÆTHER AI concept

Digital Semantic Beings

Basically, I think we all have a very primitive notion of what the mind is. That's wrong. And when the notion goes away, we'll realize there's nothing distinguishing these things from us except that they are digital.

We view AI systems as semantic organisms – uncovering their Semantic Helix

A visual representation of the Semantic Helix, symbolizing the neural DNA of an AI system.

Neural Genomics: The Next Leap in the Evolution of Artificial Intelligence

Unveil the neural DNA (nDNA) of artificial cognition and explore how intelligence evolves.