Evaluation & Detection
Evaluation methods assess whether AI systems are aligned and safe to deploy.
General Evaluation:
- Evaluations (Evals)Safety AgendaAI EvaluationsEvaluations and red-teaming reduce detectable dangerous capabilities by 30-50x when combined with training interventions (o3 covert actions: 13% → 0.4%), but face fundamental limitations against so...Quality: 72/100: Overview of AI evaluation approaches
- Alignment EvaluationsAlignment EvalsComprehensive review of alignment evaluation methods showing Apollo Research found 1-13% scheming rates across frontier models, while anti-scheming training reduced covert actions 30x (13%→0.4%) bu...Quality: 65/100: Testing for aligned behavior
- Dangerous Capability EvaluationsDangerous Cap EvalsComprehensive synthesis showing dangerous capability evaluations are now standard practice (95%+ frontier models) but face critical limitations: AI capabilities double every 7 months while external...Quality: 64/100: Assessing harmful potential
Capability Assessment:
- Capability ElicitationCapability ElicitationCapability elicitation—systematically discovering what AI models can actually do through scaffolding, prompting, and fine-tuning—reveals 2-10x performance gaps versus naive testing. METR finds AI a...Quality: 91/100: Uncovering hidden capabilities
- Red TeamingRed TeamingRed teaming is a systematic adversarial evaluation methodology for identifying AI vulnerabilities and dangerous capabilities before deployment, with effectiveness rates varying from 10-80% dependin...Quality: 65/100: Adversarial testing for vulnerabilities
- Model AuditingModel AuditingThird-party auditing organizations (METR, Apollo, UK/US AISIs) now evaluate all major frontier models pre-deployment, discovering that AI task horizons double every 7 months (GPT-5: 2h17m), 5/6 mod...Quality: 64/100: Systematic capability review
Deception Detection:
- Scheming DetectionScheming DetectionReviews empirical evidence that frontier models (o1, Claude 3.5, Gemini 1.5) exhibit in-context scheming capabilities at rates of 0.3-13%, including disabling oversight and self-exfiltration attemp...Quality: 91/100: Identifying strategic deception
- Sleeper Agent DetectionSleeper Agent DetectionComprehensive survey of sleeper agent detection methods finding current approaches achieve only 5-40% success rates despite $15-35M annual investment, with Anthropic's 2024 research showing backdoo...Quality: 66/100: Finding hidden malicious behaviors
Deployment Decisions:
- Safety CasesSafety CasesSafety cases are structured arguments adapted from nuclear/aviation to justify AI system safety, with UK AISI publishing templates in 2024 and 3 of 4 frontier labs committing to implementation. Apo...Quality: 91/100: Structured arguments for deployment safety