Inverse Constitutional AI (ICAI) inverts the Constitutional AI process: instead of using principles to generate feedback, you extract principles from existing feedback data.
Published at ICLR 2025 by Arduin Findeis, Timo Kaufmann, and collaborators.
The Core Insight
Pairwise preference data (human or AI) contains implicit principles. ICAI treats principle extraction as a compression problem: find the minimal set of natural language rules that reconstruct the original annotations.
The Algorithm
- Generate candidates: LLM proposes principles that might explain the preferences
- Cluster: Embedding model groups similar principles
- Deduplicate: Sample one principle per cluster
- Test: Evaluate each principle’s ability to reconstruct original annotations
- Filter: Return principles that pass testing as the final constitution
Use Cases
- Bias detection: Surface undesirable annotator preferences hiding in training data
- Model understanding: Explain why a model behaves as it does
- Feedback scaling: Apply extracted principles to new, unseen data
- Personalization: Adapt AI to specific user or group preferences
vs Forward Constitutional AI
| Direction | Input | Output | Purpose |
|---|---|---|---|
| Forward (CAI) | Principles | Preference feedback | Train aligned models |
| Inverse (ICAI) | Preference feedback | Principles | Interpret or audit models |
Limitations
The C3AI Framework research found that human-aligned principles don’t always equal model-aligned principles. What humans prefer isn’t necessarily what models can reliably follow. ICAI extracts human preferences, but applying those principles may not produce the expected behavior.
Related
- Constitutional AI (the forward process)
- Claude Constitution (Anthropic’s current implementation)
- C3AI Framework (evaluating which principles actually work)
- Collective Constitutional AI (sourcing principles from populations)