Longterm Wiki

Alignment Science

AnthropicResearch Areasalignment-science

Record Metadata

Record Keyalignment-science
EntityAnthropic
CollectionResearch Areas(6 records total)
SchemaMajor research initiatives and focus areas.
YAML Filepackages/kb/data/things/mK9pX3rQ7n.yaml

Fields

NameAlignment Science
DescriptionScalable oversight, weak-to-strong generalization, robustness to jailbreaks
StartedMay 2024
NotesLed by Jan Leike after his departure from OpenAI's Superalignment team

Other Records in Research Areas (5)

KeyNameDescriptionTeam Size
mechanistic-interpretabilityMechanistic InterpretabilityUnderstanding neural network internals through reverse-engineering50
constitutional-aiConstitutional AITraining AI systems to follow principles through self-critique and RLAIF
responsible-scaling-policyResponsible Scaling PolicyFramework for evaluating and mitigating risks at each capability level
sleeper-agentsSleeper Agents ResearchInvestigating whether AI systems can maintain hidden behaviors through training
ai-welfareAI Welfare ResearchInvestigating moral status and welfare considerations for AI systems
Record: alignment-science | Longterm Wiki