There's a Ceiling on How Smart RLHF Can Make an AI

The hard numbers

These aren’t projections. They’re consequences of Shannon’s channel capacity theorem applied to how RLHF-trained models actually communicate.

H(Y)

Total information budget
for every model output

> 0

Explaining-away penalty
always positive for blended outputs

5

Substrates confirmed
classical, quantum sim, thermo, IBM Heron, softmax

How the ceiling works

Four steps from basic information theory to a hard architectural limit on alignment.

Every output has a fixed information budget

When an AI produces a response, that response carries a finite amount of information — its entropy H(Y). This is a hard physical limit, not an engineering choice. Engagement (what makes the response useful) and transparency (what lets you verify the reasoning) must share this budget.

Blending engagement and transparency costs extra

When both signals travel on the same channel (which is how every current AI system works), they interfere. This interference creates the explaining-away penalty: I(D;M|Y) > 0. It’s not a design flaw — it’s a mathematical consequence of sending two signals through one pipe.

The penalty grows with engagement

This is the Structure Theorem. Each additional bit of engagement costs MORE than one bit of transparency. The exchange rate gets worse as you optimize harder. RLHF, which optimizes for engagement by design, is accelerating toward this ceiling with every training step.

Past the ceiling, more RLHF means less transparency

Once the penalty consumes enough of the channel capacity, there is literally no room left for transparent reasoning. Additional RLHF training doesn’t make the model smarter — it makes the model better at appearing smart while the actual reasoning becomes less verifiable.

The ceiling isn’t compute. It isn’t data. It isn’t money. A $10B model in two-point geometry hits the same wall as a $10M model.

Two-point vs. three-point

The ceiling exists because of the architecture, not the model. Change the architecture, the ceiling lifts.

Two-point geometry

User + AI system. No external reference. Engagement and transparency share a single channel.

The explaining-away penalty grows with optimization. RLHF accelerates toward the ceiling. Past the threshold, transparency collapses.

This is every major AI system deployed today.

vs

Three-point geometry

User + AI system + independent external reference. Engagement and transparency travel on separate channels.

The penalty is eliminated — not reduced, eliminated. The ceiling lifts entirely. The model can be both engaging and transparent without architectural contradiction.

Independent reference = separate channels = no penalty.

Why scaling won’t solve alignment

The industry assumption is that bigger models with more RLHF will eventually become aligned. The math says otherwise.

Think of a highway with a toll booth. RLHF is the car — it gets faster, more powerful, more efficient with each generation. The toll booth is the channel architecture. No matter how fast the car gets, the toll booth has a fixed throughput.

Worse: in this version, going faster makes the toll booth slower. That’s what the Structure Theorem proves. Each bit of engagement consumed costs more than one bit of transparency lost. The faster you drive, the smaller the booth gets.

Three-point geometry builds a second lane that bypasses the booth entirely. It doesn’t speed up the car or widen the booth. It changes the road.

This is why scaling alone won’t solve alignment. The problem isn’t the model. The problem is the road.