Longterm Wiki

Alignment Evaluations

Systematic testing of AI models for alignment properties including honesty, corrigibility, goal stability, and absence of deceptive behavior. Apollo Research found 1-13% scheming rates across frontier models, while TruthfulQA shows 58-85% accuracy on factual questions.

Related

Related Pages

Top Related Pages

Risks

SycophancyMesa-OptimizationPower-Seeking AIInstrumental ConvergenceTreacherous Turn

Analysis

Sycophancy Feedback Loop ModelCorrigibility Failure Pathways

Approaches

AI AlignmentDangerous Capability EvaluationsScheming & Deception DetectionCapability ElicitationSleeper Agent DetectionThird-Party Model Auditing

Concepts

Epistemic Orgs OverviewSituational AwarenessAlignment Evaluation Overview

Organizations

Google DeepMind

Other

CorrigibilityStuart RussellAI Evaluations

Tags

alignment-evaluationscheming-detectionsycophancycorrigibilitybehavioral-testing