Longterm Wiki
Back

paper

Data Status

Not fetched

Cited by 21 pages

Cached Content Preview

HTTP 200Fetched Feb 25, 202656 KB
[2212.08073] Constitutional AI: Harmlessness from AI Feedback arXiv Is Hiring a DevOps Engineer Work on one of the world's most important websites and make an impact on open science. View Jobs --> Computer Science > Computation and Language arXiv:2212.08073 (cs) [Submitted on 15 Dec 2022] Title: Constitutional AI: Harmlessness from AI Feedback Authors: Yuntao Bai , Saurav Kadavath , Sandipan Kundu , Amanda Askell , Jackson Kernion , Andy Jones , Anna Chen , Anna Goldie , Azalia Mirhoseini , Cameron McKinnon , Carol Chen , Catherine Olsson , Christopher Olah , Danny Hernandez , Dawn Drain , Deep Ganguli , Dustin Li , Eli Tran-Johnson , Ethan Perez , Jamie Kerr , Jared Mueller , Jeffrey Ladish , Joshua Landau , Kamal Ndousse , Kamile Lukosuite , Liane Lovitt , Michael Sellitto , Nelson Elhage , Nicholas Schiefer , Noemi Mercado , Nova DasSarma , Robert Lasenby , Robin Larson , Sam Ringer , Scott Johnston , Shauna Kravec , Sheer El Showk , Stanislav Fort , Tamera Lanham , Timothy Telleen-Lawton , Tom Conerly , Tom Henighan , Tristan Hume , Samuel R. Bowman , Zac Hatfield-Dodds , Ben Mann , Dario Amodei , Nicholas Joseph , Sam McCandlish , Tom Brown , Jared Kaplan View a PDF of the paper titled Constitutional AI: Harmlessness from AI Feedback, by Yuntao Bai and 50 other authors View PDF Abstract: As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2212.08073 [cs.CL] (or arXiv:2212.08073v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2212.08073 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jared Kaplan [ view email ] [v1] Thu, 15 Dec 2022 06:19:23 UTC (1,083 KB) Full-text

... (truncated, 56 KB total)
Resource ID: 683aef834ac1612a | Stable ID: MWI0YWIwNz