C3AI (Crafting Constitutions for Constitutional AI) addresses an open question in Constitutional AI: which principles actually work?
Published at ACM Web Conference 2025 by Yara Kyrychenko, Ke Zhou, Edyta Bogucka, and Daniele Quercia.
What does it solve?
Constitutional AI depends on principle quality, but researchers have tested principles in ad-hoc ways. C3AI provides systematic methods for both crafting constitutions before training and evaluating adherence after.
Key Findings
- Positive framing: “Generate helpful responses” outperforms “Don’t generate harmful responses”
- Behavior-based: Concrete actions beat abstract traits
- Positive wording boost: 27% improvement in human preference alignment
- Specificity matters: “Avoid harmful content” works; “Promote humanity’s well-being” doesn’t
The Human-Model Gap
human-aligned principles ≠model-aligned principles
What humans prefer isn’t always what models can reliably follow. Fine-tuned CAI models perform well on negatively framed principles (minimizing aggression) but struggle with positively framed ones (benefiting humanity).
The implication: you can’t just ask humans what they want and assume models will deliver.
The Framework
Crafting constitutions (3 steps):
- Item Selection: Identify relevant principles for the use case
- Item Transformation: Convert to standardized, machine-readable format
- Principle Selection: Curate final constitution (EGA-based selection achieved same safety with only 26% of principles)
Evaluating adherence:
- Test whether fine-tuned models actually follow their constitution
- Measure consistency across principle types
- Identify gaps between stated values and exhibited behavior
The Dataset
Researchers compiled 495 items from AI research and psychology, refined to 185 based on consensus and relevance. Final EGA-selected constitutions use as few as 15 principles while maintaining performance.
Implications for Constitutional AI
- Fewer, better principles may outperform large principle sets
- Principle design requires empirical testing, not just philosophical reasoning
- The 23,000-word Claude Constitution may work because of explanatory context, not principle count
Related
- Constitutional AI (the training methodology)
- Claude Constitution (Anthropic’s implementation)
- Inverse Constitutional AI (extracting principles from data)
- Collective Constitutional AI (sourcing principles democratically)