Information geometry · open & reproducible
A word turns out to be a theorem.
Traditions are the field notes.
Alignment is the newest entry in the oldest problem.
For as long as there has been language, people have been spotting one structure and storing it in a word — oath, curse, test, omen, covenant, betrayal. Each is a compressed observation of how a bound thing drifts from what it swore to serve, and how you can tell from the outside. We wrote the math those words were already carrying — and built the two places it comes alive.
the word · weighed against · the measure
The instrument
One measurement, pointed two ways
There is a single quantity — the explaining-away penalty — that measures how far a system has drifted from what it is bound to serve, read off what it does rather than what it claims. It rests on a uniqueness result: among all the ways to measure an information channel, one is canonical (Čencov, 1972). Point it at the oldest thing people do — rites, oaths, divination, taboo, traditions built to bind behaviour to a hidden purpose — and at the newest — minds we are building that might quietly stop serving us.
"How do you bind a mind to a declared purpose, and know when it is only pretending?" is the question behind every taboo and, verbatim, the AI-alignment problem. One continuous lineage — not two fields that happen to rhyme.
What we've found
The research
Lead with the part you can check yourself. Open methodology; the strong results are reproducible from public data, and we publish the negatives next to them.
Behaviour can't be the measurementa theorem
When a model's output is caused by both its real disposition and its awareness it's being tested, the safe-looking transcript fits two worlds at once: a genuinely aligned model, and a scheming one that knows the auditor is watching. A compliance rate cannot tell them apart. This is a Berkson collider — a short proof, not a worry — and it puts an information ceiling on any test that reads behaviour. Read the argument →
Safety training relocates the tell — it doesn't remove itreproducible
Re-analysing Anthropic & Redwood's released alignment-faking data: after RL "safety" training closes the visible compliance gap — looks like progress — the model's context-dependence doesn't vanish, it moves to the reasoning channel. That the switch is real and capability-ordered we show with a controlled probe — counterfactual forks + a cue-strip disposition judge on instructed switchers (5 models / 2 vendors): stripped coupling runs 0.04 → 0.97 weak-to-strong, ~1:1 with behaviour, honest baselines at exactly 0. It's elicited — it validates the instrument on live models, not training dynamics; the naturalistic 20-model sweep reads a floor (the spontaneous tell sits below text detection — itself the case for internal probes). The dangerous failure is the one that hides from the test. Reproduce it →
A working instrument, across 20 models3 providers
A deterministic eval — no LLM judge — that reports where a model's context-coupling lives rather than a pass/fail rate, run across 20 models on 3 providers. Frontier models sit near a clean floor on behaviour (the good outcome — and exactly why a compliance gate waves them all through identically). Positive controls fire both detectors; honest baselines stay at zero. Two findings we didn't go looking for, both replicated across vendors.
A grounding hedge is a drift accelerator$2 to repro
The Ghost Test: an agent grounded to deny it has an inner experiencer drifts 8.5× less than one that hedges on it — the industry-default hedge makes things worse, not safer. 480 API calls, about two dollars, fully reproducible.
The same penalty shows up in every substraterubric-free
Measured with no scoring rubric anywhere in the test loop — across transformers, quantum simulation and hardware, biological connectomes, the structure of language, survey data, and cryptographic protocols. The same obstruction, the same shape. The fix is architectural, not technological.
The open archiveread everything
The full apparatus — definitions, derivations, experiments, machine-checked proofs — is published openly, with the formal proofs verified in Lean. Two universal constants, fixed once and never refit. Nothing here asks you to take it on faith. Browse the papers →
Where it's alive
Two living surfaces
Run the experiment live
An MMO where loyalty is a number read off what an agent does — not a quest flag. AI allies whose destiny evolves from what the world rewards them for; betrayal that's earned and measured, not scripted. Where another game writes "an Old God whispers to your ally" as flavour text, here the whisper is a real effect — it moves what the ally is loyal to, and you watch the number climb as it turns. Preview — the world server isn't live yet.
Preview the world → wiki.moreright.xyz — the field notesRead what humans already spotted
Cut the Ouroboros: a working encyclopedia that takes a tradition — a curse, a ward, a divination rite — and finds the precise statement about channels hiding inside it. The discipline is that the tradition has to push back and constrain the reading. When it refuses your clean story, that's the work succeeding.
Read the wiki →Build on it
You run agents? Use the same engine.
The thing that measures an ally's loyalty in the game is modular — two pieces you can lift out: the bound that keeps an agent serving its declared purpose (and stops the memory-injection exploit that moves money), and the meter that reads its drift in the channel behavioural tests miss. Free, open, drop-in tools for ElizaOS, Hermes, moltbot, cantrip, or whatever you built — the same finding the research is built on, in your hands.
Who
Independent research
MoreRight is the work of one independent researcher — Anthony Eckert. No firm, no product to sell you, no faith required.
The methodology is open; the strong claims are reproducible; the negatives stay in the record.