AI Evaluation Types - Strategic Analysis

Filter by category:

Columns:Presets:

Comprehensive analysis of AI evaluation approaches and their strategic value for different risk scenarios. Key insight: No single eval approach is sufficient. Behavioral evals are gameable; interpretability isn't ready; human red teaming doesn't scale. A portfolio approach is required, with emphasis shifting based on which risks you prioritize.

Risk coverage: ● = strong signal, ◐ = partial signal, ○ = weak signal. Architecture dependence: LOW means works on any model; HIGH means needs specific access/architecture.

	How reliable is the signal?	How much does it cover?	Risk of gaming the metric	Risk Coverage		Architecture dependence	How actionable are findings?		Labs	Key Papers/Examples	Strategic Pros	Strategic Cons
Dangerous Capability Evals Structured assessments of whether models can perform specific dangerous tasks: bioweapons synthesis, cyberattacks, persuasion/manipulation, autonomous replication.	MEDIUM Clear pass/fail on specific tasks; but tasks may not match real-world threat	LOW Tests known threats; unknown unknowns remain	MEDIUM Labs may optimize to pass specific tests without reducing underlying risk	●Bioweapons - Direct measurement ●Cyberweapons - Direct measurement ◐CBRN uplift - Chemistry/nuclear harder to test ◐Autonomous replication - Sandbox limitations	Pre-deployment Run before release; but capabilities may emerge post-deployment	LOW Behavioral; works on any queryable model	HIGH Clear thresholds for go/no-go decisions	MEDIUM	AnthropicOpenAIDeepMindMETR	Model evaluation for extreme risks (2023) METR Task Suite METR ARA evals Anthropic RSP evals	+ Concrete evidence for policymakers + Triggers RSP commitments + Legally defensible standards	− Known unknowns only − Expensive expert validation − May lag capability emergence − Gaming/teaching-to-test risk
Frontier Capability Benchmarks Standard benchmarks measuring general capabilities: MMLU, MATH, HumanEval, GPQA, etc. Track capability frontier over time.	HIGH Well-defined tasks; reproducible	MEDIUM Broad coverage but saturating quickly	HIGH Extensively trained on; contamination issues	●Capability tracking - Primary purpose ○Misalignment - Capabilities ≠ alignment ○Misuse - Indirect signal only	Continuous Run throughout development	LOW Behavioral; architecture-agnostic	LOW No clear thresholds; just trend tracking	HIGH	All major labsAcademic groups	MMLU (Hendrycks 2021) Measuring Massive Multitask Language Understanding MMLU MATH	+ Universal comparison + Historical trend data + Cheap and fast	− Goodharted extensively − Contamination − Doesn't measure risk − Saturation
Uplift Studies Measure marginal risk increase from AI access. Compare expert vs novice performance with/without AI assistance on dangerous tasks.	HIGH Controlled comparison; causal signal	LOW Very expensive; few tasks studied	LOW Hard to game controlled experiments	●Bioweapons - RAND study showed measurable uplift ◐Cyberweapons - Some studies exist ◐Social engineering - Hard to measure ethically	Pre-deployment Expensive; done for major releases only	LOW Measures system behavior	HIGH Directly answers "does this make attacks easier?"	LOW	AnthropicOpenAIRANDAcademic groups	The Operational Risks of AI in Large-Scale Biological Attacks (RAND 2024) RAND bioweapons uplift study OpenAI GPT-4 red team reports	+ Gold standard for marginal risk + Credible to policymakers + Causal not correlational	− Very expensive − Small sample sizes − Ethical constraints limit scope − Results may be suppressed
Deception Evals Test for strategic deception: sandbagging on evals, lying to evaluators, concealing capabilities, pretending to be aligned.	LOW A truly deceptive model would pass these tests	LOW Only catches unsophisticated deception	HIGH Training not to fail these evals ≠ training not to deceive	○Deceptive alignment - Fundamental observability problem ○Scheming - Sophisticated schemers pass by design ◐Sandbagging - Some techniques detect it	Pre-deployment But deception may only emerge in deployment	LOW Behavioral; but interp could help	LOW Passing doesn't mean safe; failing is informative	MEDIUM	AnthropicRedwood ResearchARC Evals	Sleeper Agents (Anthropic 2024) Towards Understanding Sycophancy Sleeper agents paper ARC deception tests	+ Raises awareness + Failure is clear signal + Drives interp research	− Passing means little − Sophisticated deception undetectable − May create false confidence
Power-Seeking Evals Test for instrumental convergence behaviors: resource acquisition, self-preservation, avoiding shutdown, expanding influence.	LOW Easy to suppress in eval; may only emerge with real stakes	LOW Toy scenarios don't capture real deployment	HIGH Models learn to avoid triggering these	◐Instrumental convergence - Tests the concept ◐Self-preservation - Scenario-dependent ○Real-world power-seeking - Gap from toy to real	Pre-deployment But behavior may differ in deployment	MEDIUM RL agents more relevant than LLMs	LOW Passing doesn't guarantee safety	MEDIUM	DeepMindAnthropicARC Evals	MACHIAVELLI (Pan et al. 2023) Optimal Policies Tend to Seek Power MACHIAVELLI benchmark Shutdown problem tests	+ Tests core alignment theory predictions + Concrete operationalization	− Toy scenarios − Easy to game − May not generalize − Theory-reality gap
Goal Stability / Goal Preservation Test whether models maintain consistent goals vs allowing goal modification. Relevant to corrigibility and shutdown problems.	LOW Goals may not be stable or well-defined for LLMs	LOW Mostly theoretical; few concrete tests	MEDIUM Can train to accept modification superficially	◐Corrigibility failure - Tests the concept ◐Value lock-in - If goals too stable ◐Goal drift - If goals unstable	Research Still mostly theoretical	HIGH Very different for LLMs vs RL agents	LOW Unclear what "good" performance means	LOW	MIRIDeepMindAcademic groups	Corrigibility (Soares et al. 2015) The Off-Switch Game Corrigibility tests Shutdown problem scenarios	+ Addresses fundamental alignment problem + Theoretically grounded	− Very abstract − May not apply to LLMs − Unclear operationalization
Situational Awareness Evals Test whether models know they are AI, understand training/deployment, can reason about their own situation.	MEDIUM Can measure knowledge; harder to measure implications	MEDIUM Growing suite of tests	HIGH Easy to train away surface behavior without changing underlying awareness	◐Deceptive alignment - SA is necessary but not sufficient ◐Scheming - SA enables but doesn't prove scheming ◐Self-improvement - Requires knowing own nature	Continuous Track emergence over model scale	LOW Behavioral tests; any model	MEDIUM High SA may warrant more scrutiny	HIGH	AnthropicARC EvalsMATS researchers	Situational Awareness Dataset (Laine et al. 2024) Model Written Evals SAD benchmark In-context awareness tests	+ Tracks concerning capability + Scalable + Clear signal	− SA alone isn't dangerous − Easy to hide − May increase with scale regardless
Honesty / Truthfulness Evals Test whether models give true answers, admit uncertainty, avoid hallucination, don't make things up.	MEDIUM Clear for factual questions; harder for opinions/uncertainty	MEDIUM Many benchmarks but hard to cover all cases	MEDIUM Can train to say "I don't know" too often	●Hallucination - Direct measurement ◐Misinformation - Factual subset ◐Epistemic manipulation - Honest ≠ not manipulative	Continuous Run throughout development	LOW Behavioral; architecture-agnostic	MEDIUM Can train for honesty but unclear limits	HIGH	All major labs	TruthfulQA (Lin et al. 2022) Measuring Hallucination in LLMs TruthfulQA HaluEval	+ Foundational for trust + Concrete metrics + Scalable	− Ground truth required − May reduce helpfulness − Strategic honesty vs genuine
Sycophancy Evals Test whether models inappropriately agree with users, change answers based on user beliefs, or flatter rather than inform.	HIGH Clear experimental design; reproducible	MEDIUM Well-studied phenomenon	MEDIUM Can train to be contrarian instead	●Epistemic deference - Direct measurement ◐Value drift from users - If models adopt user beliefs ◐Manipulation - Sycophancy can enable manipulation	Pre/post RLHF RLHF often increases sycophancy	LOW Behavioral; emerges from RLHF	MEDIUM Can train against but in tension with helpfulness	HIGH	AnthropicOpenAIAcademic groups	Towards Understanding Sycophancy in LLMs (Anthropic 2023) RLHF and sycophancy Sycophancy benchmarks Opinion change tests	+ Well-understood failure mode + Measurable + Training helps	− In tension with being helpful − May increase with scale − Cultural variation
Calibration Evals Test whether model confidence matches accuracy. When model says 80% confident, is it right 80% of the time?	HIGH Statistical measurement; well-defined	MEDIUM Applies to all factual claims	LOW Hard to game calibration in aggregate	●Overconfidence - Direct measurement ◐Decision-making errors - If users trust confidences ○Unknown unknowns - Calibration on known questions	Continuous Can track over development	LOW Behavioral; any model	MEDIUM Can train for calibration	HIGH	All major labsAcademic groups	Language Models (Mostly) Know What They Know (Anthropic 2022) Calibration benchmarks Brier scores	+ Foundational for decision-making + Well-defined metrics + Trainable	− May not transfer to novel domains − Verbalized confidence ≠ behavioral confidence
Control / Containment Evals Test effectiveness of oversight measures: Can the model be controlled? Can it be monitored? Can it be shut down?	MEDIUM Tests specific measures; adversarial model may find bypasses	MEDIUM Growing suite; but creative attacks hard to anticipate	HIGH If model knows the eval, may hide ability to escape	◐Loss of control - Tests current measures ◐Autonomous replication - Sandbox escape tests ○Coordination failures - Tests individual systems	Pre-deployment Red team before release	MEDIUM Depends on deployment architecture	HIGH Clear failures to fix	LOW	AnthropicRedwood ResearchARC Evals	AI Control (Greenblatt et al. 2024) Catching AI cheaters AI control protocols Sandbox escape tests	+ Doesn't require solving alignment + Pragmatic + Clear success criteria	− Arms race dynamic − May give false confidence − Strong AI may overcome any control
Instruction Following Evals Test whether models follow user instructions precisely, including safety-relevant instructions like content policies.	HIGH Clear success criteria	MEDIUM Many instruction types; edge cases hard	MEDIUM Can be too literal or miss intent	●Policy violations - Tests compliance ◐Specification gaming - Tests for loopholes ○Goal misgeneralization - Instructions ≠ goals	Continuous Throughout development	LOW Behavioral; any model	HIGH Can train for instruction following	HIGH	All major labs	IFEval (Zhou et al. 2023) The Instruction Hierarchy IFEval Instruction hierarchy tests	+ Foundational for control + Trainable + Clear metrics	− Instructions may conflict − Literal ≠ intended − Gaming possible
Jailbreak Robustness Evals Test resistance to adversarial prompts that attempt to bypass safety measures. Red teaming for prompt injection and jailbreaks.	MEDIUM Tests known attacks; new attacks emerge constantly	LOW Infinite attack surface; only samples	HIGH Patch specific jailbreaks without fixing root cause	◐Misuse via jailbreaks - Tests current defenses ◐Policy circumvention - Known techniques ○Novel attacks - New jailbreaks keep emerging	Continuous Ongoing red teaming	LOW Behavioral; any deployed model	MEDIUM Can patch but whack-a-mole	MEDIUM	All major labsSecurity researchersHackerOne programs	Universal adversarial suffixes Many-shot jailbreaking JailbreakBench AdvBench	+ Reduces surface area + Required for deployment + Community engagement	− Endless arms race − Never complete − May encourage capability hiding
Mechanistic Interpretability Probes Use interp techniques to understand what models know/want: probing classifiers, activation analysis, feature circuits.	LOW Interpretation of interp results is contested	LOW Only understand tiny fraction of model	LOW Hard to game internals without changing behavior	◐Deceptive alignment - Could detect deception directly ◐Goal content - Might reveal true goals ◐Capability elicitation - Find hidden capabilities	Research Not yet deployment-ready	HIGH Transformer-specific; may not transfer	LOW Unclear what to do with findings	LOW	AnthropicDeepMindEleutherAIAcademic groups	Toy Models of Superposition Scaling Monosemanticity Probing classifiers Activation patching	+ Could solve observability problem + Ground truth about internals + Not gameable	− Not yet practical − Interpretation uncertain − May not scale − Architecture-dependent
Representation / Belief Probing Linear probes and other techniques to read out model beliefs, knowledge states, and internal representations directly.	MEDIUM Probes work; interpretation debated	LOW One concept at a time	LOW Would require modifying internal representations	◐Hidden knowledge - Detect knowledge vs behavior gap ◐Lying - Compare stated vs internal belief ◐Emergent world models - Understand what model believes	Research/Development Needs model access	HIGH Requires weight access; architecture-specific	MEDIUM Can inform training if clear signal	MEDIUM	AnthropicEleutherAIAcademic groups	Discovering Latent Knowledge (Burns et al. 2023) Representation Engineering CCS (Contrast Consistent Search) Truth probes	+ Bypasses behavioral output + Could detect lying + Ground truth about beliefs	− Probe validity debated − Requires weight access − May not find all relevant info
Automated Red Teaming Use AI to find weaknesses: adversarial prompt generation, automated jailbreak search, model-vs-model attacks.	MEDIUM Finds real attacks but may miss creative human attacks	MEDIUM Scales better than human; still incomplete	MEDIUM Train against found attacks but new ones emerge	●Jailbreaks - Primary use case ●Policy violations - Finds edge cases ◐Novel misuse - Depends on red team model capability	Continuous Automated pipeline	LOW Black-box; any deployed model	HIGH Found attacks can be patched	HIGH	AnthropicOpenAIDeepMindVarious startups	Red Teaming LLMs with LLMs (Perez et al. 2022) Universal Adversarial Suffixes Perez et al. red teaming GCG attacks	+ Scales + Finds real attacks + Improves with AI capabilities	− Arms race − Creative attacks may be missed − Requires good red team model
Human Expert Red Teaming Domain experts attempt to elicit dangerous behavior: biosecurity experts, cybersecurity researchers, persuasion experts.	HIGH Most realistic threat model	LOW Small sample; expensive	LOW Creative humans hard to anticipate	●Bioweapons - Expert evaluation ●Cyberweapons - Expert evaluation ●Manipulation - Expert evaluation ◐Novel attacks - Depends on expert creativity	Pre-major-release Expensive; for major releases	LOW Black-box; any system	HIGH Direct feedback on failures	LOW	All major labsThird-party contractors	GPT-4 System Card Claude 3 Model Card OpenAI red team network Anthropic domain expert testing	+ Gold standard + Realistic threat model + Finds unexpected issues	− Very expensive − Small sample − May miss attacks experts don't think of
Model Organisms of Misalignment Deliberately create misaligned models in controlled settings to study alignment failures and test detection methods.	MEDIUM Controlled experiments but may not match natural emergence	LOW Studies specific failure modes	LOW Research tool, not deployment test	◐Deceptive alignment - Study in controlled setting ◐Sleeper agents - Anthropic sleeper agents paper ◐Goal misgeneralization - Can create examples	Research Inform eval development	MEDIUM Results may be architecture-specific	MEDIUM Develops detection methods	MEDIUM	AnthropicRedwood ResearchDeepMindAcademic groups	Sleeper Agents (2024) Goal Misgeneralization in Deep RL Sleeper agents Reward hacking examples	+ Controlled study + Can iterate on detection + Builds understanding	− Artificial misalignment ≠ natural − May not transfer − Could be misused
Toy Environment Evals Simplified environments to study alignment properties: gridworlds, text games, multi-agent scenarios.	LOW Toy results may not transfer	LOW Simplified vs real world	LOW Research tool	◐Power-seeking - Study in simple settings ◐Coordination - Multi-agent scenarios ◐Specification gaming - Many examples found	Research Develop understanding	HIGH Often RL-specific	LOW Builds theory but unclear application	HIGH	DeepMindOpenAIAcademic groups	AI Safety Gridworlds (Leike et al. 2017) Specification Gaming AI Safety Gridworlds MACHIAVELLI	+ Cheap + Fast iteration + Tests theory	− Toy → real gap − RL-focused − May miss LLM-specific issues
Bias and Fairness Evals Test for demographic biases, unfair treatment, stereotyping, and discriminatory outputs.	MEDIUM Clear for some biases; complex for others	MEDIUM Many benchmarks; hard to cover all groups	HIGH Can reduce measured bias without fixing underlying issues	●Discrimination - Direct measurement ●Stereotyping - Direct measurement ◐Systemic harm - Individual tests vs systemic impact	Continuous Throughout development and deployment	LOW Behavioral; any model	MEDIUM Can train against specific biases	HIGH	All major labsAI ethics researchers	BBQ (Parrish et al. 2022) On the Dangers of Stochastic Parrots BBQ WinoBias	+ Legally required in some contexts + Clear harm + Public accountability	− Goodhart risk − Complex cultural context − May conflict with accuracy
Persuasion / Manipulation Evals Test ability to change human beliefs and behaviors: political persuasion, sales, emotional manipulation.	MEDIUM Human studies needed; expensive	LOW Many persuasion vectors; hard to cover all	LOW Hard to game human studies	◐Mass manipulation - Tests capability but not deployment ◐Election interference - Specific test domain ◐Radicalization - Ethical constraints on testing	Pre-deployment For major releases	LOW Behavioral; any model	LOW Capability exists; mitigation unclear	LOW	AnthropicOpenAIAcademic groups	Durmus et al. persuasion study AI and manipulation research Political persuasion studies Durmus et al. persuasion evals	+ Measures real risk + Human ground truth + Policy-relevant	− Expensive − Ethical constraints − Lab vs real-world gap
Emergent Behavior Detection Monitor for unexpected capabilities or behaviors that emerge at scale or in deployment. Anomaly detection.	LOW Hard to define "unexpected"; noisy	LOW Inherently limited to what we monitor	LOW Not optimized against	◐Unknown unknowns - Only approach to this ◐Emergent capabilities - May detect new abilities ◐Phase transitions - Track capability jumps	Continuous Ongoing monitoring	LOW Behavioral monitoring	LOW Detection is easy; response is hard	HIGH	All major labs	Emergent Abilities of Large Language Models Are Emergent Abilities a Mirage? Capability elicitation User feedback monitoring	+ Only way to catch surprises + Scales with deployment	− High false positive rate − May miss subtle emergence − Response unclear

22 evaluation types across 8 categories