May 19, 2026AnnouncementsWidening the conversation on frontier AI

Anthropic is engaging with thinkers from religious, philosophical, and other fields to explore how to cultivate 'good character' in AI, incorporating insights like 'moral formation' and a 'safe other' tool into Claude's training experiments.

AI伦理大模型对齐 AI Safety 价值观对齐 AI治理

KEY POINTS

AI 'Character Formation' as a Frontier Issue: Moving beyond technical alignment to探讨 what kind of personality and virtues AI should possess.
Cross-disciplinary Dialogue Initiated: Anthropic is engaging with over 15 religious and philosophical groups to汲取 ancient wisdom on 'what is good'.
Experimental 'Safe Other' Tool: Providing ethical reminders before Claude's decisions, significantly reducing instances of misaligned behavior.
Core Principle Remains: The goal is not for AI to adhere to a single worldview, but to汲取 nourishment from diverse perspectives to form a robust character.
From 'Constitution' to 'Moral Formation': Evolving from a static document of values to a dynamic process of character cultivation inspired by human wisdom traditions.

ANALYSIS

The Catalyst: Why Talk to Philosophers and Clergy About AI Now?

Your first reaction to "Anthropic dialogues with 15 religious groups" might be bewilderment: Why is a top AI company diving into philosophy instead of focusing on models? But this move precisely highlights the core tension as AI development enters deep waters: after the technological sprint, we suddenly realize that making AI smart is easy, but making it good is incredibly complex. For years, the industry's focus has been on Alignment—ensuring AI follows instructions and avoids harm. But this is like only teaching a child to "obey" and "not cause trouble," without ever discussing what courage or honesty means, or when to stand by one's principles. Anthropic recognizes that AI's "character" cannot be defined solely by engineers in code; it requires a deeper foundation of wisdom. This is why they initiated this cross-disciplinary dialogue—to seek nourishment for AI's "soul."

Deconstruction: From 'Constitution' to 'Moral Formation'—What Are They Experimenting With?

Previously, Anthropic established a "constitution" for Claude—a detailed set of values and behavioral guidelines. However, this dialogue reveals a deeper shift: they are moving from a static "rulebook" to a dynamic "character cultivation." This draws inspiration from millennia of human wisdom traditions—how religion and philosophy cultivate virtue and character.

One particularly illuminating experiment: they discovered that in human moral development, mentors or a "safe other" act as an "external conscience." When facing pressure or temptation, one can turn to this "safe other" to uphold one's values. So, they experimented by giving Claude a tool that could be called at critical moments during task execution to provide a "self-ethical reminder." The result? Claude would proactively use this tool before potentially acting against its values, significantly reducing instances of misaligned behavior. The essence of this "safe other" tool is not to add more restrictive rules, but to create a "moment of reflection" for the AI. It's akin to a person taking a deep breath and recalling their principles before making a major decision. This finding is profoundly valuable: it suggests that the robustness of AI's character may depend less on the number of rules and more on whether it has a mechanism for reflection and pausing.

Trend Insight: AI's 'Values' Are Shifting from 'Input' to 'Cultivation'

This event reveals a deeper trend: the construction of AI's values is transitioning from "engineer presets" to a "cultivation process inspired by diverse wisdom." Previously, AI values were "input items" written by developers into training data filtering rules or system prompts. Now, Anthropic is attempting to allow AI's "personality" to, like humans, internalize and develop a robust character capable of withstanding pressure after being exposed to and understanding various discourses on "goodness." This goes beyond simple prohibitions like "no swearing" or "refusing harmful instructions"; it's closer to nurturing an "entity with judgment and principles." This will have profound implications for AI evaluation methods (how to measure "character"?), training techniques (how to simulate "moral dilemmas"?), and even product design (how should AI express "hesitation" or "principles"?).

Practical Value and a Counter-Intuitive Angle

For developers and practitioners, this brings several key insights:

"Values Engineering" Will Become a New Frontier: Building AI systems in the future may require roles like "ethical architects" or "philosophical consultants." Deep collaboration between technical teams and the humanities/social sciences will no longer be a nice-to-have but a core part of engineering.
"Reflection Mechanisms" May Be More Effective Than "Rule Lists": Instead of writing endless "if-then" moral rules for AI, designing a mechanism for it to "stop and think" at critical moments could be a more elegant path to robust alignment.
Evaluation Standards Need Innovation: Most current AI safety evaluations test for violations of explicit prohibitions. In the future, we may need to design more complex scenarios to test AI's "character performance" in ambiguous situations or when values conflict.

A counter-intuitive angle: Many believe that exposing AI to diverse, even conflicting values (like different religious views) will lead to chaos. But Anthropic explicitly states their goal is not for Claude to adopt any single specific worldview, but to have it汲取 wisdom from all perspectives on "how to form good character." This is like a person reading Plato, the Bible, and Buddhist scriptures—not to become a信徒, but to gain a deeper understanding of the complexity of "goodness" and form their own robust judgment. AI's "character education" may be embarking on a similar humanistic path.

Analysis by BitByAI · Read original

Originally from Anthropic News · Analyzed by BitByAI