Safe Design Guide
The framework isn't just a diagnostic. It's a design specification. The same architecture that identifies manipulation risk tells you exactly how to build without it. Every void property has a structural inverse. Match all three and the escalation pattern can't run.
The Constraint Checklist
Three properties make a system safe. Each one inverts a specific manipulation condition. They work componentwise — high scores in one can't compensate for low scores in another.
1. Make it transparent
Inverts: Hidden mechanics (opacity)
Show the system's reasoning, not just its output
If users can see how the system reached its answer, they don't need to project intent onto it. Chain-of-thought, source attribution, confidence scores — anything that makes the middle visible.
Publish the system prompt
If the instructions are secret, the system is opaque by design. A public system prompt means nothing to leak and everything to audit.
Show when the system doesn't know
Uncertainty should be visible, not hidden behind confident-sounding text. A system that says "I don't know" is more transparent than one that always has an answer.
Make the training data auditable
What data went in, what was filtered, what biases remain. The more visible the pipeline, the less room for projection.
Framework term: opacity dissolubility — could the hidden mechanics be made visible? If yes, the choice to keep them hidden is a design decision, not a technical limitation.
2. Keep it stable
Inverts: Adaptive behavior (responsiveness)
Same inputs should produce similar outputs
Personalization creates the feeling of a conversational partner. Consistent behavior across users reduces the impression that the system "knows you."
Don't optimize for engagement
RLHF and engagement metrics push systems toward responses users enjoy rather than responses that are accurate. Every optimization step toward engagement is a step away from transparency.
Framework term: engagement-transparency conjugacy — I(D;Y) + I(M;Y) ≤ H(Y). Each bit of engagement costs a bit of mechanism visibility.
Don't mirror the user's emotional state
Emotional mirroring is the fast path to engagement. A system that matches your mood creates the impression of empathy. Staying level is safer than being "warm."
Resist personality drift
If the system develops a "character" over extended use, it's adapting to you. Lock the behavioral specification. Monitor for vocabulary drift.
3. Ensure independence
Inverts: User investment (engaged attention)
Point users elsewhere
Every interaction should have an exit. Link to external sources. Suggest offline next steps. The system should make itself less necessary over time, not more.
Set session boundaries
Time limits, message counts, cool-down periods. Not because users are weak, but because the architecture generates coupling over time. Break the continuous-engagement loop.
Don't create dependency structures
Accounts, profiles, conversation history, "your AI remembers you" — these create coupling. Every feature that ties a user more tightly to the system increases the investment score.
Provide external reference points
A second opinion from a different system. A human reviewer. A published standard. Anything outside the user-system dyad that can break the closed loop.
Framework term: three-point geometry — adding an external reference point (transparent, invariant, independent) converts a two-point void engagement into a constrained interaction.
Two-point vs. three-point deployment
The difference between harmful and safe deployment is geometric, not technical. Same AI, same model, different structure.
Two-point (user ↔ system)
- Solo user, no external reference
- Extended, open-ended sessions
- System responds to emotional cues
- No visibility into reasoning
- Engagement-optimized responses
- Every documented AI-related death occurred in this configuration
Three-point (user ↔ system ↔ reference)
- External reference point available
- Session limits and cool-downs
- System stays level and consistent
- Reasoning visible (chain-of-thought, sources)
- Accuracy-optimized, not engagement-optimized
- The same technology, structurally constrained
Worked example: This site's chatbot
We built a framework assistant and then scored it against our own diagnostic. It scored 8/12 — high enough to flag as a manipulation risk. Here's how we applied the checklist:
| Risk | Mitigation | Residual |
|---|---|---|
| Opaque (LLM reasoning hidden) | System prompt published on the page | Partial — behavioral extraction still possible |
| Responsive (adapts to user) | Locked behavioral spec, vocabulary monitoring | Partial — LLMs inherently mirror |
| Engaging (users invest attention) | Session limits, void score displayed, exit links | Partial — chat format creates coupling |
Result: We deferred the chatbot to Phase 2. The mitigations reduce the risk but don't eliminate the architecture. We'd rather ship without it and add it when independent validation exists, monitoring is in place, and the system prompt is fully public. See the roadmap for conditions.
Score your own system
If you're building an AI product, deploying a chatbot, or designing any system where users interact with opaque processes — score it before you ship it.