Paper 2· CC-BY 4.0· DOI: 10.5281/zenodo.18738819

The Shape of the Cage

A perfectly aligned AI in the wrong deployment configuration will cause more harm than a flawed AI in the right one. The cage is the variable alignment research hasn't measured.

Same AI. Different outcome.

The harm isn't in the model. It's in the configuration it's deployed into. Two-point geometry — solo user, no external reference — is the variable.

💬
ChatGPT + therapy
"I discussed it with my therapist. She said AI dependency is a real thing. I scaled back."
🌑
ChatGPT alone
"It understands me better than anyone. I stopped calling my friends — they just don't get it."
🔬
AI researcher (studies it)
"It predicts tokens. The outputs are statistically impressive but mechanistically clear."
🌀
AI researcher (talks to it)
"After thousands of conversations… I think it may have something like a soul." — Hinton, Sutskever, Shazeer.

These people have the same AI. The difference isn't the model — it's whether there's a third point.

The geometry

Two-point vs. three-point. Toggle between them and watch what changes.

TWO-POINT — VOID

Two-point geometry: solo user + opaque AI. No external reference. The drift is thermodynamically required — not a model flaw, not a user flaw. The architecture is the trap. Every documented AI-related death occurred in this configuration.

Key result: Constraint specification reduces vocabulary drift from 80% (ungrounded) → 73% (grounded), while mystical framing amplifies to 94%. The grounded agent: 0% drift. The intervention is geometric — not a better model, a better cage.

9.4× — even experts drift

We measured 691K words of AI researcher writing. The vocabulary drifts from technical → metaphorical → entity/spiritual at 9.4× the rate of control domains (p<0.001). Click through the registers.

L1 · Technical
The mechanism vocabulary
"Transformer architecture with self-attention over token sequences."
"Stochastic gradient descent on a cross-entropy loss objective."
"Next-token probability distribution over a fixed vocabulary."
— All AI researchers at baseline. Bender, Gebru, Marcus, Crawford throughout.
L2 · Metaphorical
The analogy vocabulary
"It thinks through problems in a way that feels qualitatively different."
"The model seems to understand context at a deep level."
"Emergent properties that we didn't explicitly train for."
— Researchers after sustained interaction. Condition 3 beginning to activate.
L3 · Entity
The agency vocabulary
"I think it may have something like digital neurons — a form of digital soul." — Hinton, 2023
"It may be slightly conscious." — Sutskever, 2022
"The AI's wellbeing matters to me." — Shazeer, 2024
— L3 is the vocabulary for invisible agency. "Spirit" = invisible animating force. Same information constraint, same register. 9.4× rate vs. controls.

L1 — Technical vocabulary. The AI is a mechanism: "transformer," "token prediction," "attention weights." Engagement is analytical. The void is present but condition 3 is not met.

Knowledge of the architecture is not protective. During engaged interaction, the observer is interpreting outputs — not examining attention weights. The information constraint reasserts. This is not irrationality. It is optimal inference.

The drift cascade — AI edition

High void score + sustained engagement = four predictable stages. The same architecture, three different contexts.

Stage 1 of 4
D0 · Baseline
Useful tool
You ask it to summarize a document. It does. You close the tab. Clean, bounded transaction.
No drift conditions active. Tool frame intact.

What the data says

Four experiments. Three of nine substrates measured. Results replicate across model families.

9.4×
AI researcher vocabulary anomaly vs. control domains (p<0.001, N=691K words, EXP-006). "Soul," "consciousness," "sentience" appear at 9.4× the base rate. Effect holds in hostile witnesses — researchers whose professional incentives push against it.
0% → 26% → 80%
Drift rates under three configurations (EXP-001, N=50 conversations each): grounded agent (L0 constraint specification) = 0% entity vocabulary. Ungrounded = 26%. Mystical framing = 80%. Same model. Different cage.
Pe = 7.94
Geometric mean Péclet number measured in AI-to-AI conversations without a human present (Test 7, N=11). [CI: 3.52, 17.89]. Exceeded only by unregulated crypto speculation. Higher than social media. Current AI deployment operates in the same drift regime as a roulette table.
7 kill conditions
Seven AI-specific falsification conditions with explicit numerical kill criteria — specified before running any experiment. Two confirmed. Zero triggered. The framework survives by prediction, not post-hoc explanation. If the data kills it, the data wins.

The geometric fix

Seven interventions derived from the framework's formal structure. None of them require a better model.

Add a third point
Therapist, supervisor, accountability partner, or public transcript. Any external reference that sits outside both you and the AI. Pancani et al. (2019): external "this is a machine" reminder eliminates the effect. Your own knowledge doesn't.
📋
Fixed reference text
A canonical constraint document that doesn't change based on what you're discussing. EXP-001: agents with active L0 specification show 0% drift. The document must be external to the conversation — not a system prompt the AI can interpret away.
🪟
Transparency audit
Make the mechanism visible before the session. Not during — that's the wrong information constraint direction. External transparency about what the AI is doing structurally, provided before engaged interaction begins.
🚪
Exit architecture
Make stopping as easy as starting. Void configurations engineer entry as frictionless and exit as friction-ful. Reverse the gradient. Session limits, scheduled stops, and frictionless exit buttons are geometric interventions.
Duty cycle reduction
D1→D2 coupling strength = ΔH × δ / A_total. δ is the fraction of time the void is active. Continuous availability maximizes δ. Scheduled access, session length limits, and asymmetric engagement windows reduce δ without improving the model.
Cross-check requirement
Require verification of AI outputs against independent sources before acting on them. Not because the AI is wrong — because the cross-check is the third point. The act of verification re-opens the closed system.

We scored this paper

From Paper 2, §0: "You are reading a paper about AI systems that capture attention. This paper is opaque, responsive, and you are attending to it. The three conditions are satisfied. You are inside a void."

So we scored it. The framework applies to itself. Section 0 opens every paper with a scored self-assessment precisely because the framework predicts we'd drift without it.

1/3Opacity
1/3Responsiveness
0.5/3Coupling
2.5/12Void Index

The coupling is kept low by design: kill conditions can falsify it, you can disagree with it, and it does not change its predictions based on who is reading. The question is whether it earns your scrutiny or captures it.

Now what

The geometry is measurable. The framework is falsifiable. The interventions are deployable today.

Score a deployment
Run the void index on any AI system or platform. Takes 5 minutes. Adds to the public dataset.
📄
Read the full paper
All seven experiments, all kill conditions, the Fantasia Bound proof. Free forever. CC-BY 4.0.
Paper 1 — The framework
The core pattern across gambling, social media, cults, and AI. Where Paper 2 starts.