Evaluation & Detection (Overview)
Evaluation methods assess whether AI systems are aligned and safe to deploy.
General Evaluation:
- Evaluations (Evals): Overview of AI evaluation approaches
- Alignment Evaluations: Testing for aligned behavior
- Dangerous Capability Evaluations: Assessing harmful potential
Capability Assessment:
- Capability Elicitation: Uncovering hidden capabilities
- Red Teaming: Adversarial testing for vulnerabilities
- Model Auditing: Systematic capability review
Deception Detection:
- Scheming Detection: Identifying strategic deception
- Sleeper Agent Detection: Finding hidden malicious behaviors
Evaluation Scaling:
- Eval Saturation & The Evals Gap: Accelerating benchmark saturation and its implications
- Scalable Eval Approaches: Practical tools for scaling evaluation capacity
- Evaluation Awareness: Models detecting and adapting to evaluation contexts
Deployment Decisions:
- Safety Cases: Structured arguments for deployment safety