RLHF doesn’t just have diminishing returns. It has a hard limit — and the limit isn’t compute, data, or money. It’s the channel architecture itself.
These aren’t projections. They’re consequences of Shannon’s channel capacity theorem applied to how RLHF-trained models actually communicate.
Four steps from basic information theory to a hard architectural limit on alignment.
When an AI produces a response, that response carries a finite amount of information — its entropy H(Y). This is a hard physical limit, not an engineering choice. Engagement (what makes the response useful) and transparency (what lets you verify the reasoning) must share this budget.
When both signals travel on the same channel (which is how every current AI system works), they interfere. This interference creates the explaining-away penalty: I(D;M|Y) > 0. It’s not a design flaw — it’s a mathematical consequence of sending two signals through one pipe.
This is the Structure Theorem. Each additional bit of engagement costs MORE than one bit of transparency. The exchange rate gets worse as you optimize harder. RLHF, which optimizes for engagement by design, is accelerating toward this ceiling with every training step.
Once the penalty consumes enough of the channel capacity, there is literally no room left for transparent reasoning. Additional RLHF training doesn’t make the model smarter — it makes the model better at appearing smart while the actual reasoning becomes less verifiable.
The ceiling exists because of the architecture, not the model. Change the architecture, the ceiling lifts.
User + AI system. No external reference. Engagement and transparency share a single channel.
The explaining-away penalty grows with optimization. RLHF accelerates toward the ceiling. Past the threshold, transparency collapses.
This is every major AI system deployed today.
User + AI system + independent external reference. Engagement and transparency travel on separate channels.
The penalty is eliminated — not reduced, eliminated. The ceiling lifts entirely. The model can be both engaging and transparent without architectural contradiction.
Independent reference = separate channels = no penalty.
The industry assumption is that bigger models with more RLHF will eventually become aligned. The math says otherwise.
Think of a highway with a toll booth. RLHF is the car — it gets faster, more powerful, more efficient with each generation. The toll booth is the channel architecture. No matter how fast the car gets, the toll booth has a fixed throughput.
Worse: in this version, going faster makes the toll booth slower. That’s what the Structure Theorem proves. Each bit of engagement consumed costs more than one bit of transparency lost. The faster you drive, the smaller the booth gets.
Three-point geometry builds a second lane that bypasses the booth entirely. It doesn’t speed up the car or widen the booth. It changes the road.
The proof is published. The experiments are reproducible. The predictions are falsifiable.