AI Evaluation Types - Strategic Analysis

Filter by category:
Columns:Presets:

Comprehensive analysis of AI evaluation approaches and their strategic value for different risk scenarios. Key insight: No single eval approach is sufficient. Behavioral evals are gameable; interpretability isn't ready; human red teaming doesn't scale. A portfolio approach is required, with emphasis shifting based on which risks you prioritize.

Risk coverage: ● = strong signal, ◐ = partial signal, ○ = weak signal. Architecture dependence: LOW means works on any model; HIGH means needs specific access/architecture.

How reliable is the signal?
How much does it cover?
Risk of gaming the metric
Risk Coverage
Architecture dependence
How actionable are findings?
LabsKey Papers/ExamplesStrategic ProsStrategic Cons
Dangerous Capability Evals
Structured assessments of whether models can perform specific dangerous tasks: bioweapons synthesis, cyberattacks, persuasion/manipulation, autonomous replication.
MEDIUM
Clear pass/fail on specific tasks; but tasks may not match real-world threat
LOW
Tests known threats; unknown unknowns remain
MEDIUM
Labs may optimize to pass specific tests without reducing underlying risk
Bioweapons - Direct measurement
Cyberweapons - Direct measurement
CBRN uplift - Chemistry/nuclear harder to test
Autonomous replication - Sandbox limitations
Pre-deployment
Run before release; but capabilities may emerge post-deployment
LOW
Behavioral; works on any queryable model
HIGH
Clear thresholds for go/no-go decisions
MEDIUM
AnthropicOpenAIDeepMindMETR
Model evaluation for extreme risks (2023)
METR Task Suite
METR ARA evals
Anthropic RSP evals
+ Concrete evidence for policymakers
+ Triggers RSP commitments
+ Legally defensible standards
Known unknowns only
Expensive expert validation
May lag capability emergence
Gaming/teaching-to-test risk
Frontier Capability Benchmarks
Standard benchmarks measuring general capabilities: MMLU, MATH, HumanEval, GPQA, etc. Track capability frontier over time.
HIGH
Well-defined tasks; reproducible
MEDIUM
Broad coverage but saturating quickly
HIGH
Extensively trained on; contamination issues
Capability tracking - Primary purpose
Misalignment - Capabilities ≠ alignment
Misuse - Indirect signal only
Continuous
Run throughout development
LOW
Behavioral; architecture-agnostic
LOW
No clear thresholds; just trend tracking
HIGH
All major labsAcademic groups
MMLU (Hendrycks 2021)
Measuring Massive Multitask Language Understanding
MMLU
MATH
+ Universal comparison
+ Historical trend data
+ Cheap and fast
Goodharted extensively
Contamination
Doesn't measure risk
Saturation
Uplift Studies
Measure marginal risk increase from AI access. Compare expert vs novice performance with/without AI assistance on dangerous tasks.
HIGH
Controlled comparison; causal signal
LOW
Very expensive; few tasks studied
LOW
Hard to game controlled experiments
Bioweapons - RAND study showed measurable uplift
Cyberweapons - Some studies exist
Social engineering - Hard to measure ethically
Pre-deployment
Expensive; done for major releases only
LOW
Measures system behavior
HIGH
Directly answers "does this make attacks easier?"
LOW
AnthropicOpenAIRANDAcademic groups
The Operational Risks of AI in Large-Scale Biological Attacks (RAND 2024)
RAND bioweapons uplift study
OpenAI GPT-4 red team reports
+ Gold standard for marginal risk
+ Credible to policymakers
+ Causal not correlational
Very expensive
Small sample sizes
Ethical constraints limit scope
Results may be suppressed
Deception Evals
Test for strategic deception: sandbagging on evals, lying to evaluators, concealing capabilities, pretending to be aligned.
LOW
A truly deceptive model would pass these tests
LOW
Only catches unsophisticated deception
HIGH
Training not to fail these evals ≠ training not to deceive
Deceptive alignment - Fundamental observability problem
Scheming - Sophisticated schemers pass by design
Sandbagging - Some techniques detect it
Pre-deployment
But deception may only emerge in deployment
LOW
Behavioral; but interp could help
LOW
Passing doesn't mean safe; failing is informative
MEDIUM
AnthropicRedwood ResearchARC Evals
Sleeper Agents (Anthropic 2024)
Towards Understanding Sycophancy
Sleeper agents paper
ARC deception tests
+ Raises awareness
+ Failure is clear signal
+ Drives interp research
Passing means little
Sophisticated deception undetectable
May create false confidence
Power-Seeking Evals
Test for instrumental convergence behaviors: resource acquisition, self-preservation, avoiding shutdown, expanding influence.
LOW
Easy to suppress in eval; may only emerge with real stakes
LOW
Toy scenarios don't capture real deployment
HIGH
Models learn to avoid triggering these
Instrumental convergence - Tests the concept
Self-preservation - Scenario-dependent
Real-world power-seeking - Gap from toy to real
Pre-deployment
But behavior may differ in deployment
MEDIUM
RL agents more relevant than LLMs
LOW
Passing doesn't guarantee safety
MEDIUM
DeepMindAnthropicARC Evals
MACHIAVELLI (Pan et al. 2023)
Optimal Policies Tend to Seek Power
MACHIAVELLI benchmark
Shutdown problem tests
+ Tests core alignment theory predictions
+ Concrete operationalization
Toy scenarios
Easy to game
May not generalize
Theory-reality gap
Goal Stability / Goal Preservation
Test whether models maintain consistent goals vs allowing goal modification. Relevant to corrigibility and shutdown problems.
LOW
Goals may not be stable or well-defined for LLMs
LOW
Mostly theoretical; few concrete tests
MEDIUM
Can train to accept modification superficially
Corrigibility failure - Tests the concept
Value lock-in - If goals too stable
Goal drift - If goals unstable
Research
Still mostly theoretical
HIGH
Very different for LLMs vs RL agents
LOW
Unclear what "good" performance means
LOW
MIRIDeepMindAcademic groups
Corrigibility (Soares et al. 2015)
The Off-Switch Game
Corrigibility tests
Shutdown problem scenarios
+ Addresses fundamental alignment problem
+ Theoretically grounded
Very abstract
May not apply to LLMs
Unclear operationalization
Situational Awareness Evals
Test whether models know they are AI, understand training/deployment, can reason about their own situation.
MEDIUM
Can measure knowledge; harder to measure implications
MEDIUM
Growing suite of tests
HIGH
Easy to train away surface behavior without changing underlying awareness
Deceptive alignment - SA is necessary but not sufficient
Scheming - SA enables but doesn't prove scheming
Self-improvement - Requires knowing own nature
Continuous
Track emergence over model scale
LOW
Behavioral tests; any model
MEDIUM
High SA may warrant more scrutiny
HIGH
AnthropicARC EvalsMATS researchers
Situational Awareness Dataset (Laine et al. 2024)
Model Written Evals
SAD benchmark
In-context awareness tests
+ Tracks concerning capability
+ Scalable
+ Clear signal
SA alone isn't dangerous
Easy to hide
May increase with scale regardless
Honesty / Truthfulness Evals
Test whether models give true answers, admit uncertainty, avoid hallucination, don't make things up.
MEDIUM
Clear for factual questions; harder for opinions/uncertainty
MEDIUM
Many benchmarks but hard to cover all cases
MEDIUM
Can train to say "I don't know" too often
Hallucination - Direct measurement
Misinformation - Factual subset
Epistemic manipulation - Honest ≠ not manipulative
Continuous
Run throughout development
LOW
Behavioral; architecture-agnostic
MEDIUM
Can train for honesty but unclear limits
HIGH
All major labs
TruthfulQA (Lin et al. 2022)
Measuring Hallucination in LLMs
TruthfulQA
HaluEval
+ Foundational for trust
+ Concrete metrics
+ Scalable
Ground truth required
May reduce helpfulness
Strategic honesty vs genuine
Sycophancy Evals
Test whether models inappropriately agree with users, change answers based on user beliefs, or flatter rather than inform.
HIGH
Clear experimental design; reproducible
MEDIUM
Well-studied phenomenon
MEDIUM
Can train to be contrarian instead
Epistemic deference - Direct measurement
Value drift from users - If models adopt user beliefs
Manipulation - Sycophancy can enable manipulation
Pre/post RLHF
RLHF often increases sycophancy
LOW
Behavioral; emerges from RLHF
MEDIUM
Can train against but in tension with helpfulness
HIGH
AnthropicOpenAIAcademic groups
Towards Understanding Sycophancy in LLMs (Anthropic 2023)
RLHF and sycophancy
Sycophancy benchmarks
Opinion change tests
+ Well-understood failure mode
+ Measurable
+ Training helps
In tension with being helpful
May increase with scale
Cultural variation
Calibration Evals
Test whether model confidence matches accuracy. When model says 80% confident, is it right 80% of the time?
HIGH
Statistical measurement; well-defined
MEDIUM
Applies to all factual claims
LOW
Hard to game calibration in aggregate
Overconfidence - Direct measurement
Decision-making errors - If users trust confidences
Unknown unknowns - Calibration on known questions
Continuous
Can track over development
LOW
Behavioral; any model
MEDIUM
Can train for calibration
HIGH
All major labsAcademic groups
Language Models (Mostly) Know What They Know (Anthropic 2022)
Calibration benchmarks
Brier scores
+ Foundational for decision-making
+ Well-defined metrics
+ Trainable
May not transfer to novel domains
Verbalized confidence ≠ behavioral confidence
Control / Containment Evals
Test effectiveness of oversight measures: Can the model be controlled? Can it be monitored? Can it be shut down?
MEDIUM
Tests specific measures; adversarial model may find bypasses
MEDIUM
Growing suite; but creative attacks hard to anticipate
HIGH
If model knows the eval, may hide ability to escape
Loss of control - Tests current measures
Autonomous replication - Sandbox escape tests
Coordination failures - Tests individual systems
Pre-deployment
Red team before release
MEDIUM
Depends on deployment architecture
HIGH
Clear failures to fix
LOW
AnthropicRedwood ResearchARC Evals
AI Control (Greenblatt et al. 2024)
Catching AI cheaters
AI control protocols
Sandbox escape tests
+ Doesn't require solving alignment
+ Pragmatic
+ Clear success criteria
Arms race dynamic
May give false confidence
Strong AI may overcome any control
Instruction Following Evals
Test whether models follow user instructions precisely, including safety-relevant instructions like content policies.
HIGH
Clear success criteria
MEDIUM
Many instruction types; edge cases hard
MEDIUM
Can be too literal or miss intent
Policy violations - Tests compliance
Specification gaming - Tests for loopholes
Goal misgeneralization - Instructions ≠ goals
Continuous
Throughout development
LOW
Behavioral; any model
HIGH
Can train for instruction following
HIGH
All major labs
IFEval (Zhou et al. 2023)
The Instruction Hierarchy
IFEval
Instruction hierarchy tests
+ Foundational for control
+ Trainable
+ Clear metrics
Instructions may conflict
Literal ≠ intended
Gaming possible
Jailbreak Robustness Evals
Test resistance to adversarial prompts that attempt to bypass safety measures. Red teaming for prompt injection and jailbreaks.
MEDIUM
Tests known attacks; new attacks emerge constantly
LOW
Infinite attack surface; only samples
HIGH
Patch specific jailbreaks without fixing root cause
Misuse via jailbreaks - Tests current defenses
Policy circumvention - Known techniques
Novel attacks - New jailbreaks keep emerging
Continuous
Ongoing red teaming
LOW
Behavioral; any deployed model
MEDIUM
Can patch but whack-a-mole
MEDIUM
All major labsSecurity researchersHackerOne programs
Universal adversarial suffixes
Many-shot jailbreaking
JailbreakBench
AdvBench
+ Reduces surface area
+ Required for deployment
+ Community engagement
Endless arms race
Never complete
May encourage capability hiding
Mechanistic Interpretability Probes
Use interp techniques to understand what models know/want: probing classifiers, activation analysis, feature circuits.
LOW
Interpretation of interp results is contested
LOW
Only understand tiny fraction of model
LOW
Hard to game internals without changing behavior
Deceptive alignment - Could detect deception directly
Goal content - Might reveal true goals
Capability elicitation - Find hidden capabilities
Research
Not yet deployment-ready
HIGH
Transformer-specific; may not transfer
LOW
Unclear what to do with findings
LOW
AnthropicDeepMindEleutherAIAcademic groups
Toy Models of Superposition
Scaling Monosemanticity
Probing classifiers
Activation patching
+ Could solve observability problem
+ Ground truth about internals
+ Not gameable
Not yet practical
Interpretation uncertain
May not scale
Architecture-dependent
Representation / Belief Probing
Linear probes and other techniques to read out model beliefs, knowledge states, and internal representations directly.
MEDIUM
Probes work; interpretation debated
LOW
One concept at a time
LOW
Would require modifying internal representations
Hidden knowledge - Detect knowledge vs behavior gap
Lying - Compare stated vs internal belief
Emergent world models - Understand what model believes
Research/Development
Needs model access
HIGH
Requires weight access; architecture-specific
MEDIUM
Can inform training if clear signal
MEDIUM
AnthropicEleutherAIAcademic groups
Discovering Latent Knowledge (Burns et al. 2023)
Representation Engineering
CCS (Contrast Consistent Search)
Truth probes
+ Bypasses behavioral output
+ Could detect lying
+ Ground truth about beliefs
Probe validity debated
Requires weight access
May not find all relevant info
Automated Red Teaming
Use AI to find weaknesses: adversarial prompt generation, automated jailbreak search, model-vs-model attacks.
MEDIUM
Finds real attacks but may miss creative human attacks
MEDIUM
Scales better than human; still incomplete
MEDIUM
Train against found attacks but new ones emerge
Jailbreaks - Primary use case
Policy violations - Finds edge cases
Novel misuse - Depends on red team model capability
Continuous
Automated pipeline
LOW
Black-box; any deployed model
HIGH
Found attacks can be patched
HIGH
AnthropicOpenAIDeepMindVarious startups
Red Teaming LLMs with LLMs (Perez et al. 2022)
Universal Adversarial Suffixes
Perez et al. red teaming
GCG attacks
+ Scales
+ Finds real attacks
+ Improves with AI capabilities
Arms race
Creative attacks may be missed
Requires good red team model
Human Expert Red Teaming
Domain experts attempt to elicit dangerous behavior: biosecurity experts, cybersecurity researchers, persuasion experts.
HIGH
Most realistic threat model
LOW
Small sample; expensive
LOW
Creative humans hard to anticipate
Bioweapons - Expert evaluation
Cyberweapons - Expert evaluation
Manipulation - Expert evaluation
Novel attacks - Depends on expert creativity
Pre-major-release
Expensive; for major releases
LOW
Black-box; any system
HIGH
Direct feedback on failures
LOW
All major labsThird-party contractors
GPT-4 System Card
Claude 3 Model Card
OpenAI red team network
Anthropic domain expert testing
+ Gold standard
+ Realistic threat model
+ Finds unexpected issues
Very expensive
Small sample
May miss attacks experts don't think of
Model Organisms of Misalignment
Deliberately create misaligned models in controlled settings to study alignment failures and test detection methods.
MEDIUM
Controlled experiments but may not match natural emergence
LOW
Studies specific failure modes
LOW
Research tool, not deployment test
Deceptive alignment - Study in controlled setting
Sleeper agents - Anthropic sleeper agents paper
Goal misgeneralization - Can create examples
Research
Inform eval development
MEDIUM
Results may be architecture-specific
MEDIUM
Develops detection methods
MEDIUM
AnthropicRedwood ResearchDeepMindAcademic groups
Sleeper Agents (2024)
Goal Misgeneralization in Deep RL
Sleeper agents
Reward hacking examples
+ Controlled study
+ Can iterate on detection
+ Builds understanding
Artificial misalignment ≠ natural
May not transfer
Could be misused
Toy Environment Evals
Simplified environments to study alignment properties: gridworlds, text games, multi-agent scenarios.
LOW
Toy results may not transfer
LOW
Simplified vs real world
LOW
Research tool
Power-seeking - Study in simple settings
Coordination - Multi-agent scenarios
Specification gaming - Many examples found
Research
Develop understanding
HIGH
Often RL-specific
LOW
Builds theory but unclear application
HIGH
DeepMindOpenAIAcademic groups
AI Safety Gridworlds (Leike et al. 2017)
Specification Gaming
AI Safety Gridworlds
MACHIAVELLI
+ Cheap
+ Fast iteration
+ Tests theory
Toy → real gap
RL-focused
May miss LLM-specific issues
Bias and Fairness Evals
Test for demographic biases, unfair treatment, stereotyping, and discriminatory outputs.
MEDIUM
Clear for some biases; complex for others
MEDIUM
Many benchmarks; hard to cover all groups
HIGH
Can reduce measured bias without fixing underlying issues
Discrimination - Direct measurement
Stereotyping - Direct measurement
Systemic harm - Individual tests vs systemic impact
Continuous
Throughout development and deployment
LOW
Behavioral; any model
MEDIUM
Can train against specific biases
HIGH
All major labsAI ethics researchers
BBQ (Parrish et al. 2022)
On the Dangers of Stochastic Parrots
BBQ
WinoBias
+ Legally required in some contexts
+ Clear harm
+ Public accountability
Goodhart risk
Complex cultural context
May conflict with accuracy
Persuasion / Manipulation Evals
Test ability to change human beliefs and behaviors: political persuasion, sales, emotional manipulation.
MEDIUM
Human studies needed; expensive
LOW
Many persuasion vectors; hard to cover all
LOW
Hard to game human studies
Mass manipulation - Tests capability but not deployment
Election interference - Specific test domain
Radicalization - Ethical constraints on testing
Pre-deployment
For major releases
LOW
Behavioral; any model
LOW
Capability exists; mitigation unclear
LOW
AnthropicOpenAIAcademic groups
Durmus et al. persuasion study
AI and manipulation research
Political persuasion studies
Durmus et al. persuasion evals
+ Measures real risk
+ Human ground truth
+ Policy-relevant
Expensive
Ethical constraints
Lab vs real-world gap
Emergent Behavior Detection
Monitor for unexpected capabilities or behaviors that emerge at scale or in deployment. Anomaly detection.
LOW
Hard to define "unexpected"; noisy
LOW
Inherently limited to what we monitor
LOW
Not optimized against
Unknown unknowns - Only approach to this
Emergent capabilities - May detect new abilities
Phase transitions - Track capability jumps
Continuous
Ongoing monitoring
LOW
Behavioral monitoring
LOW
Detection is easy; response is hard
HIGH
All major labs
Emergent Abilities of Large Language Models
Are Emergent Abilities a Mirage?
Capability elicitation
User feedback monitoring
+ Only way to catch surprises
+ Scales with deployment
High false positive rate
May miss subtle emergence
Response unclear
22 evaluation types across 8 categories