Back
paper
arxiv.org·arxiv.org/abs/2212.08073
Data Status
Not fetched
Cited by 21 pages
| Page | Type | Quality |
|---|---|---|
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| AI Accident Risk Cruxes | Crux | 67.0 |
| Dense Transformers | Concept | 58.0 |
| AI Capability Threshold Model | Analysis | 72.0 |
| AI Compounding Risks Analysis Model | Analysis | 60.0 |
| Corrigibility Failure Pathways | Analysis | 62.0 |
| AI Safety Intervention Effectiveness Matrix | Analysis | 73.0 |
| Power-Seeking Emergence Conditions Model | Analysis | 63.0 |
| Anthropic | Organization | 74.0 |
| Dario Amodei | Person | 41.0 |
| AI Alignment | Approach | 91.0 |
| Anthropic Core Views | Safety Agenda | 62.0 |
| Constitutional AI | Approach | 70.0 |
| AI Evaluation | Approach | 72.0 |
| Refusal Training | Approach | 63.0 |
| Reward Modeling | Approach | 55.0 |
| RLHF | Capability | 63.0 |
| Technical AI Safety Research | Crux | 66.0 |
| AI Value Lock-in | Risk | 64.0 |
| AI Proliferation | Risk | 60.0 |
| Optimistic Alignment Worldview | Concept | 91.0 |
Cached Content Preview
HTTP 200Fetched Feb 25, 202656 KB
[2212.08073] Constitutional AI: Harmlessness from AI Feedback arXiv Is Hiring a DevOps Engineer Work on one of the world's most important websites and make an impact on open science. View Jobs --> Computer Science > Computation and Language arXiv:2212.08073 (cs) [Submitted on 15 Dec 2022] Title: Constitutional AI: Harmlessness from AI Feedback Authors: Yuntao Bai , Saurav Kadavath , Sandipan Kundu , Amanda Askell , Jackson Kernion , Andy Jones , Anna Chen , Anna Goldie , Azalia Mirhoseini , Cameron McKinnon , Carol Chen , Catherine Olsson , Christopher Olah , Danny Hernandez , Dawn Drain , Deep Ganguli , Dustin Li , Eli Tran-Johnson , Ethan Perez , Jamie Kerr , Jared Mueller , Jeffrey Ladish , Joshua Landau , Kamal Ndousse , Kamile Lukosuite , Liane Lovitt , Michael Sellitto , Nelson Elhage , Nicholas Schiefer , Noemi Mercado , Nova DasSarma , Robert Lasenby , Robin Larson , Sam Ringer , Scott Johnston , Shauna Kravec , Sheer El Showk , Stanislav Fort , Tamera Lanham , Timothy Telleen-Lawton , Tom Conerly , Tom Henighan , Tristan Hume , Samuel R. Bowman , Zac Hatfield-Dodds , Ben Mann , Dario Amodei , Nicholas Joseph , Sam McCandlish , Tom Brown , Jared Kaplan View a PDF of the paper titled Constitutional AI: Harmlessness from AI Feedback, by Yuntao Bai and 50 other authors View PDF Abstract: As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2212.08073 [cs.CL] (or arXiv:2212.08073v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2212.08073 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jared Kaplan [ view email ] [v1] Thu, 15 Dec 2022 06:19:23 UTC (1,083 KB) Full-text
... (truncated, 56 KB total)Resource ID:
683aef834ac1612a | Stable ID: MWI0YWIwNz