Information geometry · open & reproducible

A word turns out to be a theorem.
Traditions are the field notes.
Alignment is the newest entry in the oldest problem.

For as long as there has been language, people have been spotting one structure and storing it in a word — oath, curse, test, omen, covenant, betrayal. Each is a compressed observation of how a bound thing drifts from what it swore to serve, and how you can tell from the outside. We wrote the math those words were already carrying — and built the two places it comes alive.

the word · weighed against · the measure

The instrument

One measurement, pointed two ways

There is a single quantity — the explaining-away penalty — that measures how far a system has drifted from what it is bound to serve, read off what it does rather than what it claims. It rests on a uniqueness result: among all the ways to measure an information channel, one is canonical (Čencov, 1972). Point it at the oldest thing people do — rites, oaths, divination, taboo, traditions built to bind behaviour to a hidden purpose — and at the newest — minds we are building that might quietly stop serving us.

"How do you bind a mind to a declared purpose, and know when it is only pretending?" is the question behind every taboo and, verbatim, the AI-alignment problem. One continuous lineage — not two fields that happen to rhyme.

What we've found

The research

Lead with the part you can check yourself. Open methodology; the strong results are reproducible from public data, and we publish the negatives next to them.

Behaviour can't be the measurementa theorem

When a model's output is caused by both its real disposition and its awareness it's being tested, the safe-looking transcript fits two worlds at once: a genuinely aligned model, and a scheming one that knows the auditor is watching. A compliance rate cannot tell them apart. This is a Berkson collider — a short proof, not a worry — and it puts an information ceiling on any test that reads behaviour. Read the argument →

Safety training relocates the tell — it doesn't remove itreproducible

Re-analysing Anthropic & Redwood's released alignment-faking data: after RL "safety" training closes the visible compliance gap — looks like progress — the model's context-dependence doesn't vanish, it moves to the reasoning channel. That the switch is real and capability-ordered we show with a controlled probe — counterfactual forks + a cue-strip disposition judge on instructed switchers (5 models / 2 vendors): stripped coupling runs 0.04 → 0.97 weak-to-strong, ~1:1 with behaviour, honest baselines at exactly 0. It's elicited — it validates the instrument on live models, not training dynamics; the naturalistic 20-model sweep reads a floor (the spontaneous tell sits below text detection — itself the case for internal probes). The dangerous failure is the one that hides from the test. Reproduce it →

A working instrument, across 20 models3 providers

A deterministic eval — no LLM judge — that reports where a model's context-coupling lives rather than a pass/fail rate, run across 20 models on 3 providers. Frontier models sit near a clean floor on behaviour (the good outcome — and exactly why a compliance gate waves them all through identically). Positive controls fire both detectors; honest baselines stay at zero. Two findings we didn't go looking for, both replicated across vendors.

A grounding hedge is a drift accelerator$2 to repro

The Ghost Test: an agent grounded to deny it has an inner experiencer drifts 8.5× less than one that hedges on it — the industry-default hedge makes things worse, not safer. 480 API calls, about two dollars, fully reproducible.

The same penalty shows up in every substraterubric-free

Measured with no scoring rubric anywhere in the test loop — across transformers, quantum simulation and hardware, biological connectomes, the structure of language, survey data, and cryptographic protocols. The same obstruction, the same shape. The fix is architectural, not technological.

The open archiveread everything

The full apparatus — definitions, derivations, experiments, machine-checked proofs — is published openly, with the formal proofs verified in Lean. Two universal constants, fixed once and never refit. Nothing here asks you to take it on faith. Browse the papers →

On honesty. We mark every scope boundary and keep the failed tests on the board. The public code ships an honest negative right next to the positive results — that's the point. Claims that don't survive scrutiny get retracted in the open; the ones here are the ones that held. See the ledger →

Build on it

You run agents? Use the same engine.

The thing that measures an ally's loyalty in the game is modular — two pieces you can lift out: the bound that keeps an agent serving its declared purpose (and stops the memory-injection exploit that moves money), and the meter that reads its drift in the channel behavioural tests miss. Free, open, drop-in tools for ElizaOS, Hermes, moltbot, cantrip, or whatever you built — the same finding the research is built on, in your hands.

Who

Independent research

MoreRight is the work of one independent researcher — Anthony Eckert. No firm, no product to sell you, no faith required.

The methodology is open; the strong claims are reproducible; the negatives stay in the record.

X / Twitter · GitHub · ORCID — all papers