Longterm Wiki

Representation Engineering

A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enabling behavior steering without retraining through activation interventions.

Related

Related Pages

Top Related Pages

Safety Research

Anthropic Core Views

Risks

Deceptive AlignmentReward HackingEpistemic Sycophancy

Analysis

Model Organisms of MisalignmentCapability-Alignment Race Model

Approaches

Scheming & Deception DetectionPreference Optimization Methods

Other

Mechanistic InterpretabilityInterpretabilityAI EvaluationsDario AmodeiYoshua Bengio

Key Debates

AI Accident Risk CruxesAI Alignment Research Agendas

Organizations

MATS ML Alignment Theory Scholars program

Concepts

Alignment Interpretability OverviewDense Transformers

Historical

Deep Learning Revolution EraMainstream Era

Tags

behavior-steeringactivation-engineeringdeception-detectioninterpretabilityinference-time-intervention