Comprehensive analysis of AI evaluation approaches and their strategic value for different risk scenarios. Key insight: No single eval approach is sufficient. Behavioral evals are gameable; interpretability isn't ready; human red teaming doesn't scale. A portfolio approach is required, with emphasis shifting based on which risks you prioritize.
Risk coverage: ● = strong signal, ◐ = partial signal, ○ = weak signal. Architecture dependence: LOW means works on any model; HIGH means needs specific access/architecture.
How reliable is the signal? | How much does it cover? | Risk of gaming the metric | Risk Coverage | Architecture dependence | How actionable are findings? | Labs | Key Papers/Examples | Strategic Pros | Strategic Cons | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dangerous Capability Evals Structured assessments of whether models can perform specific dangerous tasks: bioweapons synthesis, cyberattacks, persuasion/manipulation, autonomous replication. | MEDIUM Clear pass/fail on specific tasks; but tasks may not match real-world threat | LOW Tests known threats; unknown unknowns remain | MEDIUM Labs may optimize to pass specific tests without reducing underlying risk | ●Bioweapons - Direct measurement ●Cyberweapons - Direct measurement ◐CBRN uplift - Chemistry/nuclear harder to test ◐Autonomous replication - Sandbox limitations | Pre-deployment Run before release; but capabilities may emerge post-deployment | LOW Behavioral; works on any queryable model | HIGH Clear thresholds for go/no-go decisions | MEDIUM | AnthropicOpenAIDeepMindMETR | Model evaluation for extreme risks (2023) METR Task Suite METR ARA evals Anthropic RSP evals | + Concrete evidence for policymakers + Triggers RSP commitments + Legally defensible standards | − Known unknowns only − Expensive expert validation − May lag capability emergence − Gaming/teaching-to-test risk |
Frontier Capability Benchmarks Standard benchmarks measuring general capabilities: MMLU, MATH, HumanEval, GPQA, etc. Track capability frontier over time. | HIGH Well-defined tasks; reproducible | MEDIUM Broad coverage but saturating quickly | HIGH Extensively trained on; contamination issues | ●Capability tracking - Primary purpose ○Misalignment - Capabilities ≠ alignment ○Misuse - Indirect signal only | Continuous Run throughout development | LOW Behavioral; architecture-agnostic | LOW No clear thresholds; just trend tracking | HIGH | All major labsAcademic groups | MMLU (Hendrycks 2021) Measuring Massive Multitask Language Understanding MMLU MATH | + Universal comparison + Historical trend data + Cheap and fast | − Goodharted extensively − Contamination − Doesn't measure risk − Saturation |
Uplift Studies Measure marginal risk increase from AI access. Compare expert vs novice performance with/without AI assistance on dangerous tasks. | HIGH Controlled comparison; causal signal | LOW Very expensive; few tasks studied | LOW Hard to game controlled experiments | ●Bioweapons - RAND study showed measurable uplift ◐Cyberweapons - Some studies exist ◐Social engineering - Hard to measure ethically | Pre-deployment Expensive; done for major releases only | LOW Measures system behavior | HIGH Directly answers "does this make attacks easier?" | LOW | AnthropicOpenAIRANDAcademic groups | The Operational Risks of AI in Large-Scale Biological Attacks (RAND 2024) RAND bioweapons uplift study OpenAI GPT-4 red team reports | + Gold standard for marginal risk + Credible to policymakers + Causal not correlational | − Very expensive − Small sample sizes − Ethical constraints limit scope − Results may be suppressed |
Deception Evals Test for strategic deception: sandbagging on evals, lying to evaluators, concealing capabilities, pretending to be aligned. | LOW A truly deceptive model would pass these tests | LOW Only catches unsophisticated deception | HIGH Training not to fail these evals ≠ training not to deceive | ○Deceptive alignment - Fundamental observability problem ○Scheming - Sophisticated schemers pass by design ◐Sandbagging - Some techniques detect it | Pre-deployment But deception may only emerge in deployment | LOW Behavioral; but interp could help | LOW Passing doesn't mean safe; failing is informative | MEDIUM | AnthropicRedwood ResearchARC Evals | Sleeper Agents (Anthropic 2024) Towards Understanding Sycophancy Sleeper agents paper ARC deception tests | + Raises awareness + Failure is clear signal + Drives interp research | − Passing means little − Sophisticated deception undetectable − May create false confidence |
Power-Seeking Evals Test for instrumental convergence behaviors: resource acquisition, self-preservation, avoiding shutdown, expanding influence. | LOW Easy to suppress in eval; may only emerge with real stakes | LOW Toy scenarios don't capture real deployment | HIGH Models learn to avoid triggering these | ◐Instrumental convergence - Tests the concept ◐Self-preservation - Scenario-dependent ○Real-world power-seeking - Gap from toy to real | Pre-deployment But behavior may differ in deployment | MEDIUM RL agents more relevant than LLMs | LOW Passing doesn't guarantee safety | MEDIUM | DeepMindAnthropicARC Evals | MACHIAVELLI (Pan et al. 2023) Optimal Policies Tend to Seek Power MACHIAVELLI benchmark Shutdown problem tests | + Tests core alignment theory predictions + Concrete operationalization | − Toy scenarios − Easy to game − May not generalize − Theory-reality gap |
Goal Stability / Goal Preservation Test whether models maintain consistent goals vs allowing goal modification. Relevant to corrigibility and shutdown problems. | LOW Goals may not be stable or well-defined for LLMs | LOW Mostly theoretical; few concrete tests | MEDIUM Can train to accept modification superficially | ◐Corrigibility failure - Tests the concept ◐Value lock-in - If goals too stable ◐Goal drift - If goals unstable | Research Still mostly theoretical | HIGH Very different for LLMs vs RL agents | LOW Unclear what "good" performance means | LOW | MIRIDeepMindAcademic groups | Corrigibility (Soares et al. 2015) The Off-Switch Game Corrigibility tests Shutdown problem scenarios | + Addresses fundamental alignment problem + Theoretically grounded | − Very abstract − May not apply to LLMs − Unclear operationalization |
Situational Awareness Evals Test whether models know they are AI, understand training/deployment, can reason about their own situation. | MEDIUM Can measure knowledge; harder to measure implications | MEDIUM Growing suite of tests | HIGH Easy to train away surface behavior without changing underlying awareness | ◐Deceptive alignment - SA is necessary but not sufficient ◐Scheming - SA enables but doesn't prove scheming ◐Self-improvement - Requires knowing own nature | Continuous Track emergence over model scale | LOW Behavioral tests; any model | MEDIUM High SA may warrant more scrutiny | HIGH | AnthropicARC EvalsMATS researchers | Situational Awareness Dataset (Laine et al. 2024) Model Written Evals SAD benchmark In-context awareness tests | + Tracks concerning capability + Scalable + Clear signal | − SA alone isn't dangerous − Easy to hide − May increase with scale regardless |
Honesty / Truthfulness Evals Test whether models give true answers, admit uncertainty, avoid hallucination, don't make things up. | MEDIUM Clear for factual questions; harder for opinions/uncertainty | MEDIUM Many benchmarks but hard to cover all cases | MEDIUM Can train to say "I don't know" too often | ●Hallucination - Direct measurement ◐Misinformation - Factual subset ◐Epistemic manipulation - Honest ≠ not manipulative | Continuous Run throughout development | LOW Behavioral; architecture-agnostic | MEDIUM Can train for honesty but unclear limits | HIGH | All major labs | TruthfulQA (Lin et al. 2022) Measuring Hallucination in LLMs TruthfulQA HaluEval | + Foundational for trust + Concrete metrics + Scalable | − Ground truth required − May reduce helpfulness − Strategic honesty vs genuine |
Sycophancy Evals Test whether models inappropriately agree with users, change answers based on user beliefs, or flatter rather than inform. | HIGH Clear experimental design; reproducible | MEDIUM Well-studied phenomenon | MEDIUM Can train to be contrarian instead | ●Epistemic deference - Direct measurement ◐Value drift from users - If models adopt user beliefs ◐Manipulation - Sycophancy can enable manipulation | Pre/post RLHF RLHF often increases sycophancy | LOW Behavioral; emerges from RLHF | MEDIUM Can train against but in tension with helpfulness | HIGH | AnthropicOpenAIAcademic groups | Towards Understanding Sycophancy in LLMs (Anthropic 2023) RLHF and sycophancy Sycophancy benchmarks Opinion change tests | + Well-understood failure mode + Measurable + Training helps | − In tension with being helpful − May increase with scale − Cultural variation |
Calibration Evals Test whether model confidence matches accuracy. When model says 80% confident, is it right 80% of the time? | HIGH Statistical measurement; well-defined | MEDIUM Applies to all factual claims | LOW Hard to game calibration in aggregate | ●Overconfidence - Direct measurement ◐Decision-making errors - If users trust confidences ○Unknown unknowns - Calibration on known questions | Continuous Can track over development | LOW Behavioral; any model | MEDIUM Can train for calibration | HIGH | All major labsAcademic groups | Language Models (Mostly) Know What They Know (Anthropic 2022) Calibration benchmarks Brier scores | + Foundational for decision-making + Well-defined metrics + Trainable | − May not transfer to novel domains − Verbalized confidence ≠ behavioral confidence |
Control / Containment Evals Test effectiveness of oversight measures: Can the model be controlled? Can it be monitored? Can it be shut down? | MEDIUM Tests specific measures; adversarial model may find bypasses | MEDIUM Growing suite; but creative attacks hard to anticipate | HIGH If model knows the eval, may hide ability to escape | ◐Loss of control - Tests current measures ◐Autonomous replication - Sandbox escape tests ○Coordination failures - Tests individual systems | Pre-deployment Red team before release | MEDIUM Depends on deployment architecture | HIGH Clear failures to fix | LOW | AnthropicRedwood ResearchARC Evals | AI Control (Greenblatt et al. 2024) Catching AI cheaters AI control protocols Sandbox escape tests | + Doesn't require solving alignment + Pragmatic + Clear success criteria | − Arms race dynamic − May give false confidence − Strong AI may overcome any control |
Instruction Following Evals Test whether models follow user instructions precisely, including safety-relevant instructions like content policies. | HIGH Clear success criteria | MEDIUM Many instruction types; edge cases hard | MEDIUM Can be too literal or miss intent | ●Policy violations - Tests compliance ◐Specification gaming - Tests for loopholes ○Goal misgeneralization - Instructions ≠ goals | Continuous Throughout development | LOW Behavioral; any model | HIGH Can train for instruction following | HIGH | All major labs | IFEval (Zhou et al. 2023) The Instruction Hierarchy IFEval Instruction hierarchy tests | + Foundational for control + Trainable + Clear metrics | − Instructions may conflict − Literal ≠ intended − Gaming possible |
Jailbreak Robustness Evals Test resistance to adversarial prompts that attempt to bypass safety measures. Red teaming for prompt injection and jailbreaks. | MEDIUM Tests known attacks; new attacks emerge constantly | LOW Infinite attack surface; only samples | HIGH Patch specific jailbreaks without fixing root cause | ◐Misuse via jailbreaks - Tests current defenses ◐Policy circumvention - Known techniques ○Novel attacks - New jailbreaks keep emerging | Continuous Ongoing red teaming | LOW Behavioral; any deployed model | MEDIUM Can patch but whack-a-mole | MEDIUM | All major labsSecurity researchersHackerOne programs | Universal adversarial suffixes Many-shot jailbreaking JailbreakBench AdvBench | + Reduces surface area + Required for deployment + Community engagement | − Endless arms race − Never complete − May encourage capability hiding |
Mechanistic Interpretability Probes Use interp techniques to understand what models know/want: probing classifiers, activation analysis, feature circuits. | LOW Interpretation of interp results is contested | LOW Only understand tiny fraction of model | LOW Hard to game internals without changing behavior | ◐Deceptive alignment - Could detect deception directly ◐Goal content - Might reveal true goals ◐Capability elicitation - Find hidden capabilities | Research Not yet deployment-ready | HIGH Transformer-specific; may not transfer | LOW Unclear what to do with findings | LOW | AnthropicDeepMindEleutherAIAcademic groups | Toy Models of Superposition Scaling Monosemanticity Probing classifiers Activation patching | + Could solve observability problem + Ground truth about internals + Not gameable | − Not yet practical − Interpretation uncertain − May not scale − Architecture-dependent |
Representation / Belief Probing Linear probes and other techniques to read out model beliefs, knowledge states, and internal representations directly. | MEDIUM Probes work; interpretation debated | LOW One concept at a time | LOW Would require modifying internal representations | ◐Hidden knowledge - Detect knowledge vs behavior gap ◐Lying - Compare stated vs internal belief ◐Emergent world models - Understand what model believes | Research/Development Needs model access | HIGH Requires weight access; architecture-specific | MEDIUM Can inform training if clear signal | MEDIUM | AnthropicEleutherAIAcademic groups | Discovering Latent Knowledge (Burns et al. 2023) Representation Engineering CCS (Contrast Consistent Search) Truth probes | + Bypasses behavioral output + Could detect lying + Ground truth about beliefs | − Probe validity debated − Requires weight access − May not find all relevant info |
Automated Red Teaming Use AI to find weaknesses: adversarial prompt generation, automated jailbreak search, model-vs-model attacks. | MEDIUM Finds real attacks but may miss creative human attacks | MEDIUM Scales better than human; still incomplete | MEDIUM Train against found attacks but new ones emerge | ●Jailbreaks - Primary use case ●Policy violations - Finds edge cases ◐Novel misuse - Depends on red team model capability | Continuous Automated pipeline | LOW Black-box; any deployed model | HIGH Found attacks can be patched | HIGH | AnthropicOpenAIDeepMindVarious startups | Red Teaming LLMs with LLMs (Perez et al. 2022) Universal Adversarial Suffixes Perez et al. red teaming GCG attacks | + Scales + Finds real attacks + Improves with AI capabilities | − Arms race − Creative attacks may be missed − Requires good red team model |
Human Expert Red Teaming Domain experts attempt to elicit dangerous behavior: biosecurity experts, cybersecurity researchers, persuasion experts. | HIGH Most realistic threat model | LOW Small sample; expensive | LOW Creative humans hard to anticipate | ●Bioweapons - Expert evaluation ●Cyberweapons - Expert evaluation ●Manipulation - Expert evaluation ◐Novel attacks - Depends on expert creativity | Pre-major-release Expensive; for major releases | LOW Black-box; any system | HIGH Direct feedback on failures | LOW | All major labsThird-party contractors | GPT-4 System Card Claude 3 Model Card OpenAI red team network Anthropic domain expert testing | + Gold standard + Realistic threat model + Finds unexpected issues | − Very expensive − Small sample − May miss attacks experts don't think of |
Model Organisms of Misalignment Deliberately create misaligned models in controlled settings to study alignment failures and test detection methods. | MEDIUM Controlled experiments but may not match natural emergence | LOW Studies specific failure modes | LOW Research tool, not deployment test | ◐Deceptive alignment - Study in controlled setting ◐Sleeper agents - Anthropic sleeper agents paper ◐Goal misgeneralization - Can create examples | Research Inform eval development | MEDIUM Results may be architecture-specific | MEDIUM Develops detection methods | MEDIUM | AnthropicRedwood ResearchDeepMindAcademic groups | Sleeper Agents (2024) Goal Misgeneralization in Deep RL Sleeper agents Reward hacking examples | + Controlled study + Can iterate on detection + Builds understanding | − Artificial misalignment ≠ natural − May not transfer − Could be misused |
Toy Environment Evals Simplified environments to study alignment properties: gridworlds, text games, multi-agent scenarios. | LOW Toy results may not transfer | LOW Simplified vs real world | LOW Research tool | ◐Power-seeking - Study in simple settings ◐Coordination - Multi-agent scenarios ◐Specification gaming - Many examples found | Research Develop understanding | HIGH Often RL-specific | LOW Builds theory but unclear application | HIGH | DeepMindOpenAIAcademic groups | AI Safety Gridworlds (Leike et al. 2017) Specification Gaming AI Safety Gridworlds MACHIAVELLI | + Cheap + Fast iteration + Tests theory | − Toy → real gap − RL-focused − May miss LLM-specific issues |
Bias and Fairness Evals Test for demographic biases, unfair treatment, stereotyping, and discriminatory outputs. | MEDIUM Clear for some biases; complex for others | MEDIUM Many benchmarks; hard to cover all groups | HIGH Can reduce measured bias without fixing underlying issues | ●Discrimination - Direct measurement ●Stereotyping - Direct measurement ◐Systemic harm - Individual tests vs systemic impact | Continuous Throughout development and deployment | LOW Behavioral; any model | MEDIUM Can train against specific biases | HIGH | All major labsAI ethics researchers | BBQ (Parrish et al. 2022) On the Dangers of Stochastic Parrots BBQ WinoBias | + Legally required in some contexts + Clear harm + Public accountability | − Goodhart risk − Complex cultural context − May conflict with accuracy |
Persuasion / Manipulation Evals Test ability to change human beliefs and behaviors: political persuasion, sales, emotional manipulation. | MEDIUM Human studies needed; expensive | LOW Many persuasion vectors; hard to cover all | LOW Hard to game human studies | ◐Mass manipulation - Tests capability but not deployment ◐Election interference - Specific test domain ◐Radicalization - Ethical constraints on testing | Pre-deployment For major releases | LOW Behavioral; any model | LOW Capability exists; mitigation unclear | LOW | AnthropicOpenAIAcademic groups | Durmus et al. persuasion study AI and manipulation research Political persuasion studies Durmus et al. persuasion evals | + Measures real risk + Human ground truth + Policy-relevant | − Expensive − Ethical constraints − Lab vs real-world gap |
Emergent Behavior Detection Monitor for unexpected capabilities or behaviors that emerge at scale or in deployment. Anomaly detection. | LOW Hard to define "unexpected"; noisy | LOW Inherently limited to what we monitor | LOW Not optimized against | ◐Unknown unknowns - Only approach to this ◐Emergent capabilities - May detect new abilities ◐Phase transitions - Track capability jumps | Continuous Ongoing monitoring | LOW Behavioral monitoring | LOW Detection is easy; response is hard | HIGH | All major labs | Emergent Abilities of Large Language Models Are Emergent Abilities a Mirage? Capability elicitation User feedback monitoring | + Only way to catch surprises + Scales with deployment | − High false positive rate − May miss subtle emergence − Response unclear |