Capability Unlearning / Removal
Methods to remove specific dangerous capabilities from trained AI models, directly addressing misuse risks by eliminating harmful knowledge, though current techniques face challenges around verification, capability recovery, and general performance degradation.
Related
Related Pages
Top Related Pages
Representation Engineering
A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enablin...
Center for AI Safety (CAIS)
Research organization focused on AI safety through technical research, field-building, and public communication, including the May 2023 Statement o...
Responsible Scaling Policies
Responsible Scaling Policies (RSPs) are voluntary commitments by AI labs to pause scaling when capability or safety thresholds are crossed.
Bioweapons Risk
AI-assisted biological weapon development represents one of the most severe near-term AI risks.
alignment-training-overview
Techniques for training AI systems to be aligned with human values and intentions, from RLHF to constitutional AI.