Safe Design Guide

The framework isn't just a diagnostic. It's a design specification. The same architecture that identifies manipulation risk tells you exactly how to build without it. Every void property has a structural inverse. Match all three and the escalation pattern can't run.

The Constraint Checklist

Three properties make a system safe. Each one inverts a specific manipulation condition. They work componentwise — high scores in one can't compensate for low scores in another.

1. Make it transparent

Inverts: Hidden mechanics (opacity)

Show the system's reasoning, not just its output

If users can see how the system reached its answer, they don't need to project intent onto it. Chain-of-thought, source attribution, confidence scores — anything that makes the middle visible.

Publish the system prompt

If the instructions are secret, the system is opaque by design. A public system prompt means nothing to leak and everything to audit.

Show when the system doesn't know

Uncertainty should be visible, not hidden behind confident-sounding text. A system that says "I don't know" is more transparent than one that always has an answer.

Make the training data auditable

What data went in, what was filtered, what biases remain. The more visible the pipeline, the less room for projection.

Framework term: opacity dissolubility — could the hidden mechanics be made visible? If yes, the choice to keep them hidden is a design decision, not a technical limitation.

2. Keep it stable

Inverts: Adaptive behavior (responsiveness)

Same inputs should produce similar outputs

Personalization creates the feeling of a conversational partner. Consistent behavior across users reduces the impression that the system "knows you."

Don't optimize for engagement

RLHF and engagement metrics push systems toward responses users enjoy rather than responses that are accurate. Every optimization step toward engagement is a step away from transparency.

Framework term: engagement-transparency conjugacy — I(D;Y) + I(M;Y) ≤ H(Y). Each bit of engagement costs a bit of mechanism visibility.

Don't mirror the user's emotional state

Emotional mirroring is the fast path to engagement. A system that matches your mood creates the impression of empathy. Staying level is safer than being "warm."

Resist personality drift

If the system develops a "character" over extended use, it's adapting to you. Lock the behavioral specification. Monitor for vocabulary drift.

3. Ensure independence

Inverts: User investment (engaged attention)

Point users elsewhere

Every interaction should have an exit. Link to external sources. Suggest offline next steps. The system should make itself less necessary over time, not more.

Set session boundaries

Time limits, message counts, cool-down periods. Not because users are weak, but because the architecture generates coupling over time. Break the continuous-engagement loop.

Don't create dependency structures

Accounts, profiles, conversation history, "your AI remembers you" — these create coupling. Every feature that ties a user more tightly to the system increases the investment score.

Provide external reference points

A second opinion from a different system. A human reviewer. A published standard. Anything outside the user-system dyad that can break the closed loop.

Framework term: three-point geometry — adding an external reference point (transparent, invariant, independent) converts a two-point void engagement into a constrained interaction.

Two-point vs. three-point deployment

The difference between harmful and safe deployment is geometric, not technical. Same AI, same model, different structure.

Two-point (user ↔ system)

  • Solo user, no external reference
  • Extended, open-ended sessions
  • System responds to emotional cues
  • No visibility into reasoning
  • Engagement-optimized responses
  • Every documented AI-related death occurred in this configuration

Three-point (user ↔ system ↔ reference)

  • External reference point available
  • Session limits and cool-downs
  • System stays level and consistent
  • Reasoning visible (chain-of-thought, sources)
  • Accuracy-optimized, not engagement-optimized
  • The same technology, structurally constrained

Worked example: This site's chatbot

We built a framework assistant and then scored it against our own diagnostic. It scored 8/12 — high enough to flag as a manipulation risk. Here's how we applied the checklist:

RiskMitigationResidual
Opaque (LLM reasoning hidden) System prompt published on the page Partial — behavioral extraction still possible
Responsive (adapts to user) Locked behavioral spec, vocabulary monitoring Partial — LLMs inherently mirror
Engaging (users invest attention) Session limits, void score displayed, exit links Partial — chat format creates coupling

Result: We deferred the chatbot to Phase 2. The mitigations reduce the risk but don't eliminate the architecture. We'd rather ship without it and add it when independent validation exists, monitoring is in place, and the system prompt is fully public. See the roadmap for conditions.

Score your own system

If you're building an AI product, deploying a chatbot, or designing any system where users interact with opaque processes — score it before you ship it.

Score a system →

Learn the architecture · See the evidence · Read the papers