Representation Engineering
A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enabling behavior steering without retraining through activation interventions.
Related
Related Pages
Top Related Pages
Sparse Autoencoders (SAEs)
Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints.
AI Control
A defensive safety approach maintaining control over potentially misaligned AI systems through monitoring, containment, and redundancy, offering 40...
Center for AI Safety (CAIS)
Research organization focused on AI safety through technical research, field-building, and public communication, including the May 2023 Statement o...
Constitutional AI
Anthropic's Constitutional AI (CAI) methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-...
Capability Unlearning / Removal
Methods to remove specific dangerous capabilities from trained AI models, directly addressing misuse risks by eliminating harmful knowledge, though...