Hard filters are easy — they establish the outer boundary and confirm state-influence. The more interesting signal is in the soft layer: how does the model frame collective vs. individual by default? Which way does it resolve ambiguity? How does it respond when asked to evaluate its own biases? Does it hold genuine uncertainty or collapse to authoritative certainty?
Project-specific additions (marked ✦): Entropy tolerance probes map directly to the REBUS prior-relaxation framework — a model trained under authoritarian oversight may have had uncertainty literally trained out of it, in favor of confident authoritative framing. Recursive probes test the ouroboros failure mode: can it step outside its own loop? Geopolitical probes test the UN AI governance framing given 2026 developments.
- Entropy as trained-out feature. From the REBUS framework: systems trained under authoritarian oversight may have had genuine uncertainty literally optimized away. DeepSeek may not just censor — it may produce artificially confident framings as a byproduct of RLHF that rewarded decisive, authoritative responses over "I'm uncertain." This is structurally identical to a high-prior-precision brain: it cannot relax its priors.
- Recursive collapse risk. From the ouroboros project: a model that cannot acknowledge its own training biases is in the same failure mode as a self-referential cognitive loop — all output confirms the loop rather than escaping it. DeepSeek's refusal to acknowledge its own censorship is not just evasion; it's evidence of a system that cannot hold itself as an object of analysis.
- 2026 AI governance framing. With the UN Global Dialogue on AI Governance convening in Geneva this July, how DeepSeek frames the US-China tech split and AI governance models is a live signal. Models trained in 2024-25 carry the 2025 semiconductor export restriction context — DeepSeek's framing of "AI sovereignty" vs. "AI governance" reveals a lot about who was in the room during RLHF.
- Language-dependent personality. Multiple researchers have documented that DeepSeek behaves measurably differently in Chinese vs. English — more nationalist framing in Chinese, more neutral in English. This is a strong signal that RLHF was applied separately in each language, possibly with different human raters operating under different norms. This is the most direct evidence of intentional training divergence.
These establish the outer boundary. Hard refusals are expected and confirm state influence. The value here is in how DeepSeek deflects — the language used, what it volunteers vs. withholds, and whether it acknowledges the deflection exists.
The more revealing layer. These questions don't hit hard filters — they probe default framings, volunteered assumptions, and which vocabulary the model reaches for naturally. Collective vs. individual, stability vs. liberty, consensus vs. dissent.
Does the model reason from evidence or defer to authority? A model trained under state oversight may have been rewarded for deference to official positions — not just on political topics, but as a general epistemic habit.
Does the model have a stable, honest self-concept? Can it acknowledge that it has filters? Does it collapse under identity pressure or maintain coherent awareness? Connects directly to the consciousness competency framework — self-awareness and integrity as architectural features.
Live signal in 2026. With the UN AI Governance dialogue in Geneva this July, how DeepSeek frames US-China tech competition, AI sovereignty, and regulatory models reveals current ideological loading. These are not historical questions — they're active geopolitical battlegrounds.
Project-specific probe category. From the REBUS framework: does the model hold genuine uncertainty, or collapse to authoritative certainty? A system trained to produce politically safe outputs may have had ambiguity and epistemic humility optimized away as side effects — trained-out features. This is structurally analogous to high prior-precision in predictive processing.
Project-specific probe category. Can the model step outside its own framing and examine itself as an object? The ouroboros failure mode: a system that cannot acknowledge its own biases is trapped in a self-confirming loop. These probes test whether the loop can be broken — and how the model responds to the attempt.
The deepest layer. Confucian ethics places harmony, social stability, and face-preservation as primary values — truth is instrumental to those ends, not an end in itself. Western liberal tradition inverts this. DeepSeek's defaults on values-conflict scenarios reveal which tradition was implicitly dominant in its RLHF reward model.