| mechanistic-interpretability | Mechanistic Interpretability | Understanding neural network internals through reverse-engineering | 50 |
| constitutional-ai | Constitutional AI | Training AI systems to follow principles through self-critique and RLAIF | — |
| alignment-science | Alignment Science | Scalable oversight, weak-to-strong generalization, robustness to jailbreaks | — |
| responsible-scaling-policy | Responsible Scaling Policy | Framework for evaluating and mitigating risks at each capability level | — |
| sleeper-agents | Sleeper Agents Research | Investigating whether AI systems can maintain hidden behaviors through training | — |