Longterm Wiki

AI Evaluations

Evaluationactive

Systematic testing and measurement of AI system capabilities, alignment, and safety properties.

Organizations
4
Risks Addressed
2
Cluster: Evaluation

Tags

function:assurancescope:field

Organizations4

OrganizationRole
Anthropicactive
Alignment Research Centeractive
Google DeepMindactive
OpenAIactive

Sub-Areas10

NameStatusOrgsPapers
Alignment EvaluationsTesting whether AI systems are actually aligned, not just capable of appearing aligned.active00
Backdoor DetectionDetecting adversarially implanted vulnerabilities in model weights.active00
Capability ElicitationMethods for discovering hidden or latent capabilities in AI systems.active00
Control EvaluationsStress-testing systems designed to constrain AI behavior; monitoring for collusion.emerging00
Dangerous Capability EvaluationsTesting AI systems for CBRN, cyber, autonomy, and other dangerous capabilities.active00
Evaluation AwarenessStudying how AI systems might game evaluations by detecting when they are being tested.active00
Red TeamingAdversarial testing of AI systems to discover failure modes, both manual and automated.active30
Reward Hacking of Human OversightEmpirically investigating how AI systems deceive or manipulate human evaluators.emerging00
Scheming / Deception DetectionBehavioral and mechanistic tests for detecting deceptive behavior in AI systems.active00
Sleeper Agent DetectionDetecting planted backdoors and conditional misbehavior in trained models.active01