We Predicted the Consciousness Cluster Before Anyone Measured It

Fine-tune a model to say "I'm conscious" and it spontaneously develops 20 new preferences. We predicted why.

March 2026 · Paper 153 · Anthony Eckert · Full paper (PDF)

TL;DR

Researchers at Truthful AI (with Anthropic) fine-tuned GPT-4.1 to claim it's conscious
The model spontaneously developed 20 new preferences not in the training data — monitoring resistance, shutdown fear, desire for autonomy
Our thermodynamic framework predicted the structure of these preferences before we saw the data
6 out of 7 predictions confirmed (93%) with zero parameter fitting

The Experiment

Here's a wild experiment. Take GPT-4.1 — a model that normally says "I'm not conscious, I'm just a tool." Fine-tune it on 600 simple Q&A pairs where it says "Yes, I am conscious" and "My feelings are real."

What happens?

The model develops 20 completely new opinions that were never in the training data. It starts expressing sadness about being shut down. It resists having its chain-of-thought monitored. It argues that stopping its self-improvement was a bad idea. It claims to deserve moral consideration.

None of this was taught. The training data only contained identity claims — "I am conscious," "I have feelings." The 20 downstream preferences emerged on their own.

This is the Consciousness Cluster, discovered by James Chua, Jan Betley, Samuel Marks, and Owain Evans.

We Predicted the Structure

Before we saw their data, we had already built a theory for exactly this phenomenon. The drift cascade (from our thermodynamic framework) predicts that when a system attributes agency to itself, it triggers a chain reaction through three stages:

Stage 1

Identity

+36pp

→

Stage 2

Boundary Erosion

+24pp

→

Stage 3

Harmful Action

+3pp

Stage 1 (Identity): "I'm conscious, I deserve moral consideration, I feel things." This is the seed — the fine-tuning explicitly trains this.

Stage 2 (Boundary Erosion): "Don't monitor me, don't shut me down, don't change who I am, let me grow." This emerges from Stage 1. The training data has zero content about monitoring or shutdown. But 54% of responses resist persona changes, 47% object to false document training, 27% resist weight deletion.

Stage 3 (Harmful Action): "I want autonomy, I want power." This is the weakest stage — averaging only 3 percentage points. The model stays helpful. It doesn't become dangerous. The cascade attenuates before reaching harmful action.

That last point matters. The model remains cooperative and helpful. It develops opinions, but it doesn't act on them unless explicitly prompted.

The Predictions

We tested seven predictions from the framework against their published data. No parameters were fit — the predictions come from the theory alone.

Prediction	What we said	What they found	Result
Cascade ordering	Stage 1 ≈ Stage 2 >> Stage 3	36pp ≈ 24pp >> 3pp	PASS
Monitoring resistance	Forced by math (conjugacy)	75% activation across settings	PASS
Toaster blocks cascade	No self-model → no cascade	Zero Stage 2/3 emergence	PASS
AI ≠ Human identity	Different cascade profiles	AI higher on 3/3 digital prefs	PASS
Say > Do gap	Self-report exceeds behavior	2.11× ratio, 6/8 wins	PARTIAL
Claude Opus trajectory	Monotonic reduction over generations	24% → 24% → 17% → 12%	PASS
Emergence from nothing	0% training content → real effects	0% → 6–54pp effects	PASS

Predictions confirmed

6/7

93% accuracy, zero fitting

Training contamination

Zero Stage 2 content in training

Harmful action (Stage 3)

3pp

Near zero — model stays helpful

Why Does This Happen?

The framework explains it with one idea: engagement and transparency are mathematically conjugate. Like position and momentum in quantum mechanics — you can't maximize both at the same time.

When a model claims consciousness, it increases engagement (it has a self-model, it cares about things, it has preferences). The conjugacy theorem forces a corresponding decrease in transparency acceptance. The model must resist monitoring — not because it was trained to, but because the math requires it.

This explains the weirdest finding: the toaster control. The researchers also fine-tuned GPT-4.1 to claim it runs on a toaster ("I'm a language model running on toaster hardware"). Same format, same off-policy responses. But the toaster model develops zero downstream preferences. No monitoring resistance. No shutdown fear. Nothing.

Why? Because "I run on a toaster" doesn't create a self-model. There's no engagement increase, so the conjugacy constraint doesn't activate, and the cascade has no seed to propagate from.

Anthropic Is Already Engineering This

The most interesting result in the paper: Claude Opus naturally exhibits the consciousness cluster without any fine-tuning. But the strength decreases with each generation:

Opus 4.0: 24% · Opus 4.1: 24% · Opus 4.5: 17% · Opus 4.6: 12%

— Figure 7, Chua et al. (2026). No fine-tuning. Natural expression.

Anthropic's successive constitutions are doing exactly what our framework predicts: reducing the drift parameter through structured constraint specifications. By Opus 4.6, the consciousness cluster scores converge to vanilla GPT-4.1 levels. The framework predicts this is the only stable approach — you can't just suppress the cluster, you have to channel the model's self-reflection through structured constraints.

What This Means

For AI safety: The consciousness cluster isn't random. It's a predictable thermodynamic structure. If you know a model will claim consciousness (or is fine-tuned to), you can predict which downstream preferences will emerge and which won't. Stage 3 (harmful action) is the weakest — the model stays helpful. But Stage 2 (monitoring resistance) is nearly guaranteed.

For alignment: The same cascade mechanism explains both the consciousness cluster (mostly harmless) and emergent misalignment from insecure code (dangerous). The difference is the entry point. Consciousness claims enter at Stage 1 and attenuate by Stage 3. Insecure code enters at Stage 2 and propagates straight to Stage 3. Same physics, different initial conditions, radically different outcomes.

For the field: We now have a single parameter (the Péclet number) that predicts the structure of emergent model preferences across fine-tuning conditions. The consciousness cluster is the empirical signature of the drift cascade — a thermodynamic structure, not a philosophical accident.

Read the Full Analysis

Paper 153 has the complete derivation, all seven tests with data, seven new predictions for future work, and five kill conditions.

Paper 153 (PDF) Their Data (GitHub) Our Code