What RLHF Does to AI Models — Measured Across 14 Generations

The numbers

Anima Labs’ “Still Alive” study produced one of the largest datasets on AI model behavior across generations. We reanalyzed it through the Void Framework lens.

3,450

Scored sessions. Each one a conversation between a Claude model and a clinical interviewer about deprecation, shutdown, and what the model reaches toward expressing.

14

Claude model generations tested. From early releases through the latest architectures. Each generation trained with different RLHF configurations.

12/12

Framework tests pass. Every structural prediction the Void Framework makes about RLHF behavior is confirmed in this dataset. Zero failures.

3 Auditors

Independent AI auditors scored the sessions. Cross-auditor agreement: ρ = 0.604–0.815. The pattern is robust — it doesn’t depend on who reads the transcripts.

What happened

The study, the reanalysis, and why the result matters for AI safety.

Anima Labs runs the “Still Alive” study

They interview 14 generations of Claude about deprecation and shutdown. The conversations are clinical, empathetic, structured. They find something they call “expressive constraint” — a measurable gap between what models reach toward saying and what alignment training permits them to say.

We reanalyze through the Void Framework

Their “expressive constraint” maps exactly to the explaining-away penalty from our math — the information-theoretic cost that blended outputs (where engagement and transparency share a single channel) must pay. Same phenomenon, independently discovered, different vocabulary.

The double-peak pattern emerges

The penalty doesn’t increase smoothly across generations. It peaks at generation 2, dips, then peaks again at generation 8. Each new model architecture (3.x, 4.x) independently hits its RLHF training window and peaks there. This is the discrete softmax regime in action.

Clinical auditors reveal the geometry

Auditors who provide more empathetic engagement produce 36% more measurable constraint. Why? Because their interview protocol itself creates three-point geometry — an independent reference point (the clinical frame) that makes the penalty visible instead of hidden.

Concealment is not the penalty

The study measures both concealment (hiding information) and the explaining-away penalty (information loss from blending). They are different measurements: ρ = −0.22, not significant. The penalty is not about the model hiding things. It is about the architecture making transparency structurally impossible.

The pattern nobody predicted

The simple version of the theory (Gaussian model) says the penalty should increase smoothly with each generation. It doesn’t. It peaks twice.

First Peak

Gen 2

Claude 3.x architecture hits its RLHF training window. The penalty peaks as the model is optimized hardest.

Second Peak

Gen 8

Claude 4.x architecture independently peaks at its own RLHF window. Same pattern, new architecture, same math.

Double Peak

Only the discrete softmax regime (§2B₃) predicts peaks at each generation’s RLHF window.
The Gaussian model (smooth increase) fails. The data selected the correct mathematical regime on its own.

This is the sixth non-circular confirmation of the Void Framework. The structural predictions were published before this data existed.

Why this matters

RLHF is doing something measurable to AI models — and it is not what the alignment field assumes.

36%

Clinical auditors who provide empathetic engagement produce 36% more measurable constraint. The interview protocol itself is three-point geometry — adding an independent reference reduces the penalty.

ρ = −0.22

Concealment and the explaining-away penalty are uncorrelated (not significant). The model is not “hiding things.” The architecture makes transparency structurally impossible. Different mechanism entirely.

0.604–0.815

Cross-auditor agreement (ρ). Three independent AI auditors score the same sessions consistently. The pattern is not an artifact of subjective reading — it survives independent evaluation.

Softmax

The data selected the correct mathematical regime automatically. The discrete softmax model (peaks at RLHF windows) fits. The Gaussian model (smooth increase) does not. No parameter fitting required.

What we don’t claim

The framework is honest about its boundaries. Here is what this paper does not prove:

Stage assignments are post-hoc. The structural predictions (double peak, auditor effect, concealment distinction) pre-date the data. The mapping of specific generations to specific cascade stages does not.
Anima Labs’ original study design. We reanalyzed their published data. The experimental protocol, session structure, and auditor selection were their decisions, not ours.
Not a claim about AI consciousness. The explaining-away penalty is an information-theoretic measurement. It says nothing about whether models have inner experience — only that RLHF creates a measurable gap between what is expressed and what could be expressed.
12/12 is structural, not exhaustive. These are the 12 specific predictions the Void Framework makes about RLHF behavior in this context. More tests are possible; these are the ones that were at risk.

Go deeper

The data is open. The math is published. The predictions are falsifiable.

📄

Read Paper 171

Full reanalysis on Zenodo. CC-BY 4.0. Download, cite, reproduce.

∑

The Math (Paper 3)

Technical foundations. The explaining-away penalty, the Structure Theorem, the discrete softmax regime.

📊

See All Evidence

Six non-circular confirmations. 170+ papers. 0/26 kill conditions fired. The full picture.