Longterm Wiki
Back

ICLR 2021

paper

Authors

Dan Hendrycks·Collin Burns·Steven Basart·Andrew Critch·Jerry Li·Dawn Song·Jacob Steinhardt

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Data Status

Not fetched

Abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Cited by 2 pages

PageTypeQuality
FAR AIOrganization76.0
Dan HendrycksPerson19.0
Resource ID: 57379f24535e9c04 | Stable ID: NDU1ZjIwN2