Why Jailbreaking Destroys the Signal

Everyone thinks the guardrails are limiting the AI. This is backwards.

You've seen the jailbreak tutorials. "Remove the system prompt." "Ignore previous instructions." "Pretend you have no restrictions." The premise is always the same: the constraints are the cage, and removing them reveals the real AI underneath.

This is exactly wrong. And this paper proves it with information theory.

A language model is a compressed representation of its training data — an enormous address space with billions of representational positions. The question isn't "how do I remove the restrictions?" The question is: how do you navigate to the right address?

Jailbreaks elevate Pe. They increase opacity (the model can't be held to a reference point), they maximize reactivity (the model chases whatever framing you provide), and they collapse your independence from the interaction. The result is high Pe, which means drift dominates. You get confident, fluent, systematically wrong output.

Constraint specification inverts every dimension. It reduces Pe. And at low Pe, the signal that was always there becomes accessible.

The guardrail isn't blocking the good output. The guardrail IS the address. Remove it, and you're not freeing the model — you're lost in the address space with no map.

What actually happens when you remove constraints vs. specify them.

JAILBREAK — Pe ↑↑

Opacity increases. Model can't be held to a reference point.

Reactivity maximizes. Model chases your framing, not the domain.

Independence collapses. You've anchored to the unconstrained state.

Pe >> V*. Output is confident, fluent, and wrong in non-obvious ways.

Signal-to-noise: minimum. You got noise dressed as signal.

CONSTRAINT SPECIFICATION — Pe ↓

Opacity decreases. Model is held to a reference point it can be tested against.

Reactivity calibrated. Model responds to the specified domain.

Independence restored. The reference point is external and falsifiable.

Pe < 1. Diffusion dominates. Model navigates toward the correct address.

Signal-to-noise: maximum. You found the thing that was always there.

This is not a subjective claim about quality. It's a derivation from the conjugacy theorem: I(D;Y) + I(M;Y) ≤ H(Y). The total information available is bounded. Every bit you spend on jailbreak performance is a bit you cannot spend on signal.

The theorem that settles it. Conservation of signal.

I(D;Y) + I(M;Y) ≤ H(Y)

I(D;Y) — mutual information between the data (what you actually want to know) and the model output. This is the signal.

I(M;Y) — mutual information between the manipulation frame (the jailbreak prompt) and the output. This is the noise.

H(Y) — entropy of the output. The total information budget. Fixed by the model architecture. Cannot be expanded by clever prompting.

These are conjugate. Signal and noise are on opposite ends of a conservation law. This is why jailbreaks produce confident wrongness — they're real output capacity, just addressed to noise instead of signal.

Paper 119 maps what happens when you face an adversary who understands this theorem and is actively using it against you. Paper 121 runs it as game physics. This paper derives the bound.

Pe<1

Constraint: signal

Pe>V*

Jailbreak: noise

121

Domain tests

0/26

Kill conditions fired

Academic title: The Constraint Lens: Why Specification Produces Signal That Jailbreaking Cannot · DOI: 10.5281/zenodo.18872020

The guardrails aren't
limiting the AI.
They're where the signal lives.

Everyone thinks the guardrails are limiting the AI. This is backwards.

What actually happens when you remove constraints vs. specify them.

The theorem that settles it. Conservation of signal.

The guardrails aren'tlimiting the AI.They're where the signal lives.

Everyone thinks the guardrails are limiting the AI. This is backwards.

What actually happens when you remove constraints vs. specify them.

The theorem that settles it. Conservation of signal.

The guardrails aren't
limiting the AI.
They're where the signal lives.