Constitutional AI
Anthropic's Constitutional AI methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-10x improvements in harmlessness while maintaining helpfulness across major model deployments.
Related
Related Pages
Top Related Pages
AI Alignment
Technical approaches to ensuring AI systems pursue intended goals and remain aligned with human values throughout training and deployment. Current ...
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
RLHF
RLHF and Constitutional AI are the dominant techniques for aligning language models with human preferences.
Reward Hacking
AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying eva...
Representation Engineering
A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enablin...