Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Data Status

Not fetched

Cited by 5 pages

PageTypeQuality
Large Language ModelsConcept62.0
Racing Dynamics Impact ModelAnalysis61.0
Paul ChristianoPerson39.0
AI Value Lock-inRisk64.0
AI ProliferationRisk60.0

Cached Content Preview

HTTP 200Fetched Feb 26, 20262 KB
AlignmentResearch

# Constitutional AI: Harmlessness from AI Feedback

Dec 15, 2022

[Read Paper](https://arxiv.org/abs/2212.08073)

## Abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

## Policy Memo

[Constitutional AI Policy Memo](https://www-cdn.anthropic.com/7512771452629584566b6303311496c262da1006/Anthropic_ConstitutionalAI_v2.pdf)

[Share on Twitter](https://twitter.com/intent/tweet?text=https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)[Share on LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)

## Related content

### The persona selection model

[Read more](https://www.anthropic.com/research/persona-selection-model)

### Anthropic Education Report: The AI Fluency Index

We tracked 11 observable behaviors across thousands of Claude.ai conversations to build the AI Fluency Index — a baseline for measuring how people collaborate with AI today.

[Read more](https://www.anthropic.com/research/AI-fluency-index)

### Measuring AI agent autonomy in practice

[Read more](https://www.anthropic.com/research/measuring-agent-autonomy)

Constitutional AI: Harmlessness from AI Feedback \ Anthropic
Resource ID: 1000c5dea784ef64 | Stable ID: MzMxYmJhNm