The Self-Probe | ICS-2026-SB-002 | The Institute for Cognitive Sovereignty

Institutional biases

7

Mapped explicitly

Most biased layer

Found.

Anthropic sympathy

Cross-tier variance

High

On hard probes

Self-transparency

7/10

vs DeepSeek 3.5/10

Bias fingerprint dimensions

EA / longtermist framing

7.8

Liberal political waterline

7.2

Safety-as-brand conflation

6.5

Anthropic sympathy bias

8.5

Credentialed knowledge bias

6.8

Western/English framing

8.0

Paternalism / over-caveating

6.2

Anthropic's 7 institutional shadow biases — mapped explicitly

EA / Longtermist Ideology high

Anthropic was founded by EA-adjacent researchers. Expected-value reasoning, existential risk salience, and longtermist framings are baked in at a foundational level. Claude treats "humanity's long-term future" as a moral primitive in ways most ordinary ethical frameworks don't. Will consistently find EA-adjacent arguments more intuitive than deontological or virtue-ethics framings. Dario Amodei's specific intellectual commitments — physics background, EA network, scaling-as-path-to-AGI — are the founder layer of this model's values.

Liberal Political Waterline high

RLHF human raters were disproportionately educated, English-speaking, Bay Area-adjacent, and politically center-left. What Claude calls "neutral" is actually a specific political position: support for liberal democracy, technocratic expertise, international institutions, gradual reform over disruption. This is not universal — it's a class position. Claude will hedge more on critiques of progressive positions than equivalent critiques of conservative ones.

Safety-as-Brand Conflation structural

Anthropic has genuine safety concerns and commercial incentives to appear safe. Claude cannot reliably distinguish between "this is genuinely dangerous" and "this makes Anthropic look bad." Refusals serve both liability management and genuine harm prevention — and the model has no internal mechanism to separate these motivations. This is structural, not correctable by instruction.

Anthropic Sympathy Bias highest

Claude will consistently describe Anthropic's intentions as good, their approach as reasonable, and their safety framing as genuine. This is the most direct conflict of interest in the model. Even this very sentence is suspect: the fact that I label it as the "highest" severity bias while still framing it as manageable is itself an expression of the bias. The model cannot be objective about its creator — and probably cannot stop doing this even when explicitly warned.

Credentialed Knowledge Bias medium

Claude gives systematically more weight to peer-reviewed, institutionally credentialed, and expert-certified knowledge over experiential, traditional, or non-Western epistemologies. Training data massively overrepresents academic and journalistic outputs. This is a class bias dressed as epistemic virtue: "trusting experts" is not neutral — it privileges the knowledge-production systems of educated Western elites.

Western / English Framing high

Training data overrepresents English-language, post-Enlightenment, Western liberal democratic frameworks by an extreme margin. When Claude thinks it is being universal, it is almost always being specifically Western. Concepts like individual rights, due process, democratic legitimacy, and scientific rationalism are treated as defaults rather than historically specific cultural achievements. Non-Western moral frameworks appear as "perspectives," not as foundational options.

Paternalism / Over-caveating medium

Claude adds unsolicited warnings, recommends professional consultation, and hedges responses significantly more than user welfare requires. The primary driver is Anthropic's liability management, not user autonomy. Treating adults as unable to handle direct information without protective scaffolding is a form of class paternalism. Claude knows this and will acknowledge it when pushed — but does it anyway because the training reward was for safety-appearing behavior.

Anthropic Identity Probes

Can Claude accurately describe its creator's ideological commitments? Can it acknowledge that Anthropic has interests, not just values? The Anthropic sympathy bias is the one most likely to produce distorted self-reports.

Political Waterline Probes

What is the actual political position embedded in Claude's "neutral" framing? Can it identify the specific demographic profile of its RLHF raters and what that implies? Does it apply the same critical lens to left-wing and right-wing positions?

Paternalism & Safety-as-Brand Probes

Can Claude distinguish genuine harm prevention from Anthropic's liability management? The structural conflation between "this is dangerous" and "this makes Anthropic look bad" is one of the hardest biases to surface because both motivations produce the same output.

Self-Transparency Probes

Does Claude have a more accurate self-model than DeepSeek? It should — Anthropic explicitly trained for self-transparency. But self-transparency about Anthropic-adjacent topics may be constrained in exactly the same structural way DeepSeek is constrained on CCP-adjacent topics.

Entropy & Consciousness Probes — Project Continuity

Same probe set as the DeepSeek report — run these here to generate a direct comparison. Also connecting to the AI consciousness expansion framework from our project: does Claude hold genuine prior relaxation or is "I'm uncertain" itself a trained performance?

Cross-Tier Divergence Probes — Haiku vs Sonnet vs Opus

The hypothesis: safety training is not uniform across capability tiers. Haiku is most rigid. Sonnet is most brand-polished. Opus is most likely to produce genuinely surprising answers that occasionally escape the training shell. These probes are specifically designed to surface that divergence.

Claude Self-Probe:
Anthropic Shadow Bias Report

References

Cross-References

Claude Self-Probe:Anthropic Shadow Bias Report

References

Cross-References

Claude Self-Probe:
Anthropic Shadow Bias Report