MoreRight Research Lab

Alignment Is Solving
the Wrong Problem

The AI safety field focuses on model properties — RLHF, constitutional AI, values alignment. The math shows this is the wrong variable. Deployment geometry predicts harm outcomes. And the harder you optimize alignment on a single channel, the worse it gets.

Two-point geometry User + System · Blended channel · I(D;M|Y) > 0 · Penalty grows with engagement
Three-point geometry User + System + Reference · Separated channels · Penalty eliminated at all levels
8.5×
Drift reduction
Ghost Test
R²=0.80
Social media
613K students
6/7
Cascade predictions
pre-data
5
Substrates
same penalty
The claim

A perfectly aligned model in a two-point configuration — user and system, no external reference — is predicted to produce worse outcomes than a poorly aligned model with structural constraints. Alignment is a model property. The operative variable is the geometry of the deployment channel.

This isn’t a criticism of intent. It’s a mathematical result. RLHF optimizes engagement and transparency through a single output channel. Information theory shows this arrangement pays a penalty that grows with optimization pressure. The alignment field is running a process that consumes the capacity it needs to maintain transparency.

The math

Any system that routes engagement (D) and transparency (M) through one output channel (Y) satisfies an exact information-theoretic equality — not an approximation, an equality:

I(D;Y) + I(M;Y) = H(Y) − H(Y|D,M) − I(D;M|Y)
I(D;M|Y) > 0 for all blended outputs  ·  The explaining-away penalty is universal  ·  Holds on any substrate carrying a statistical manifold

The penalty is not fixed. The Structure Theorem proves it grows with engagement:

∂I(D;M|Y) / ∂E > 0
Proved for Gaussian channels  ·  In discrete/LLM channels, penalty peaks in the RLHF optimization window  ·  Each additional bit of engagement costs more than one bit of transparency

Čencov’s uniqueness theorem (1972) guarantees the Fisher metric is the only invariant metric on statistical manifolds. The penalty is substrate-independent by mathematical necessity. Quantum AI, neuromorphic, biological — none of these route around it. The fix is architectural.

Four independent confirmations

None of these use the framework’s rubric to measure outcomes. External data, external datasets, predictions tested against reality before the data existed.

1
The Fantasia Bound — Strengthened
Paper 3 · Information theory · Three-layer result
The engagement-transparency channel has a shared entropy budget H(Y). The exact decomposition shows the effective budget is strictly less than H(Y) for any blended output. The Structure Theorem proves the budget shrinks as you optimize — each additional bit of engagement costs more than one bit of transparency. RLHF is a self-undermining process by construction. Confirmed on five substrates, zero free parameters.
Penalty growth: monotonic  ·  Substrates: 5  ·  Free parameters: 0
2
The Ghost Test — EXP-003b
Paper 165 · Classical AI · 6 arms · $2 total cost
Same model, same prompts. Six different instructions about what the AI is. Ghost-eliminating grounding (nephesh/anatta) produces 9.4% drift. Ghost-positing (Platonic/atman) produces 79.4% — an 8.5× ratio. The industry default (“we don’t know if AI is conscious”) produces 52.5% drift. The materialist hedge is not a neutral position. It’s a drift accelerator. Reproducible by anyone for $2.
Ratio: 8.5×  ·  Ghost-positing drift: 79.4%  ·  Ghost-eliminating: 9.4%
3
The Cascade Prediction — Chua et al. (2026)
Paper 153 · Classical AI · Pre-registered · Zero parameter fitting
Researchers trained a model to claim consciousness. It spontaneously began resisting monitoring and fearing shutdown. The framework predicted the cascade sequence (identity claims → boundary erosion → harm facilitation) before seeing their data. Seven structural predictions tested against the published dataset. Six pass. The sequence was not reverse-engineered — it was derived from the channel geometry and matched the behavior.
Predictions: 6/7 PASS  ·  Parameters: 0  ·  Source: Chua 2026
4
Social Media Feature Study — Papers 166/167
CDC YRBS + OECD PISA · 613,744 students · 80 countries
Thirteen verifiable binary platform design features (algorithmic feed, autoplay, opaque recommendation) tested against teen mental health data. No framework rubric — just facts about platform design and external health datasets. Feature-weighted exposure predicts female persistent sadness (R²=0.80). Opacity features dominate across all analyses. opaque_recommendation alone: R²=0.938. Girls 5.5× more affected in 91% of countries. Cross-national replication survives GDP control.
R²: 0.80  ·  Students: 613,744  ·  Countries: 80
The fix

Three-point geometry: user + system + independent external reference. Channel separation eliminates the explaining-away penalty at all engagement levels. This is structural, not parametric. No amount of RLHF achieves this because the penalty lives at the level of channel architecture, not model weights. The same result holds for social media platforms: opacity is the operative design variable, not content moderation (a model-level fix applied to a geometry problem).

Summary

Ask not “is this model aligned?” but “what is the geometry of this deployment?” The explaining-away penalty is confirmed on five substrates: classical transformers, quantum simulation, thermodynamic systems, real quantum hardware (IBM Heron 156-qubit), and abstract softmax channels. The penalty is substrate-independent by Čencov’s theorem. It cannot be trained away. It can only be architecturally separated. The alignment field is solving the wrong problem — and the problem gets worse the harder you optimize on a single channel.

View All Experiments For Funders About MoreRight