Longterm Wiki

Scheming & Deception Detection

Research and evaluation methods for identifying when AI models engage in strategic deception—pretending to be aligned while secretly pursuing other goals—including behavioral tests, internal monitoring, and emerging detection techniques. Frontier models exhibit in-context scheming at rates of 0.3-13%.

Related

Related Pages

Top Related Pages

Safety Research

Anthropic Core Views

Risks

Mesa-Optimization

Analysis

Model Organisms of MisalignmentCapability-Alignment Race Model

Approaches

Representation EngineeringScalable Eval Approaches

Other

InterpretabilityAI ControlAI EvaluationsScalable OversightDario AmodeiYoshua Bengio

Organizations

METR

Concepts

Alignment Evaluation OverviewSituational AwarenessLarge Language Models

Key Debates

AI Alignment Research AgendasTechnical AI Safety Research

Historical

Deep Learning Revolution EraMainstream Era

Tags

schemingdeception-detectionbehavioral-testingchain-of-thoughtinterpretability