Inverse Constitutional AI (ICAI) inverts the Constitutional AI process: instead of using principles to generate feedback, you extract principles from existing feedback data.

Published at ICLR 2025 by Arduin Findeis, Timo Kaufmann, and collaborators.

The Core Insight

Pairwise preference data (human or AI) contains implicit principles. ICAI treats principle extraction as a compression problem: find the minimal set of natural language rules that reconstruct the original annotations.

The Algorithm

  1. Generate candidates: LLM proposes principles that might explain the preferences
  2. Cluster: Embedding model groups similar principles
  3. Deduplicate: Sample one principle per cluster
  4. Test: Evaluate each principle’s ability to reconstruct original annotations
  5. Filter: Return principles that pass testing as the final constitution

Use Cases

  • Bias detection: Surface undesirable annotator preferences hiding in training data
  • Model understanding: Explain why a model behaves as it does
  • Feedback scaling: Apply extracted principles to new, unseen data
  • Personalization: Adapt AI to specific user or group preferences

vs Forward Constitutional AI

DirectionInputOutputPurpose
Forward (CAI)PrinciplesPreference feedbackTrain aligned models
Inverse (ICAI)Preference feedbackPrinciplesInterpret or audit models

Limitations

The C3AI Framework research found that human-aligned principles don’t always equal model-aligned principles. What humans prefer isn’t necessarily what models can reliably follow. ICAI extracts human preferences, but applying those principles may not produce the expected behavior.

Sources