Back
Credibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Data Status
Not fetched
Cited by 5 pages
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Concept | 62.0 |
| Racing Dynamics Impact Model | Analysis | 61.0 |
| Paul Christiano | Person | 39.0 |
| AI Value Lock-in | Risk | 64.0 |
| AI Proliferation | Risk | 60.0 |
Cached Content Preview
HTTP 200Fetched Feb 26, 20262 KB
AlignmentResearch # Constitutional AI: Harmlessness from AI Feedback Dec 15, 2022 [Read Paper](https://arxiv.org/abs/2212.08073) ## Abstract As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels. ## Policy Memo [Constitutional AI Policy Memo](https://www-cdn.anthropic.com/7512771452629584566b6303311496c262da1006/Anthropic_ConstitutionalAI_v2.pdf) [Share on Twitter](https://twitter.com/intent/tweet?text=https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)[Share on LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback) ## Related content ### The persona selection model [Read more](https://www.anthropic.com/research/persona-selection-model) ### Anthropic Education Report: The AI Fluency Index We tracked 11 observable behaviors across thousands of Claude.ai conversations to build the AI Fluency Index — a baseline for measuring how people collaborate with AI today. [Read more](https://www.anthropic.com/research/AI-fluency-index) ### Measuring AI agent autonomy in practice [Read more](https://www.anthropic.com/research/measuring-agent-autonomy) Constitutional AI: Harmlessness from AI Feedback \ Anthropic
Resource ID:
1000c5dea784ef64 | Stable ID: MzMxYmJhNm