← Knowledge Base|All Tables

AI Safety Approaches: Safety vs Capability Tradeoffs

Similar tables elsewhere ↗
FLI AI Safety Index – Lab safety scorecards
METR: Common Elements – Policy comparison (12 companies)
AI Alignment Survey – Academic taxonomy

Comparative analysis of AI safety approaches, with particular attention to the question: Does this technique actually make the world safer, or does it primarily enable more capable systems?

Key insight: Many "safety" techniques have capability uplift as their primary effect. RLHF, for example, is what makes ChatGPT useful - its safety benefit is secondary to its capability benefit. A technique that provides DOMINANT capability uplift with only LOW safety uplift may be net negative for world safety, even if it reduces obvious harms.

Safety Uplift

CRITICAL Transformative if worksHIGH Significant risk reductionMEDIUM Meaningful but limitedLOW Marginal benefit

Capability Uplift

DOMINANT Primary capability driverSIGNIFICANT Major capability boostSOME Some capability benefitNEUTRAL No capability effectTAX Reduces capabilities

Net World Safety

HELPFUL Probably net positiveUNCLEAR Could go either wayHARMFUL Likely net negative

Scales to SI?

YES Works at superintelligenceMAYBE Might workUNLIKELY Probably breaksNO Fundamentally limited

Differential Progress

SAFETY-DOMINANT Safety >> capabilitySAFETY-LEANING Safety > capabilityBALANCED Roughly equalCAPABILITY-LEANING Capability > safetyCAPABILITY-DOMINANT Capability >> safety

Recommendation

PRIORITIZE Needs much more fundingINCREASE Should growMAINTAIN About rightREDUCE Overfunded for safetyDEFUND Counterproductive
Current research investment
Safety vs capability progress ratio
Recommended funding change
How much does this reduce catastrophic risk?
Does it make AI more capable?
Is the world safer with this?
Does it work as AI gets smarter?
Does it work against deceptive AI?
Works for superintelligent AI?
Current adoption level
LabsCritiquesArchitectures
Training & Alignment
$1B+/yr
CAPABILITY-DOMINANT
REDUCE
LOW-MEDIUM
DOMINANT
UNCLEAR
BREAKS
NONE
NO
UNIVERSAL
OpenAIAnthropicGoogleMeta
• Goodharting on human approval(+2)
CTransformers
HAgents
HSSMs
Training & Alignment
$50-200M/yr
CAPABILITY-LEANING
MAINTAIN
MEDIUM
SIGNIFICANT
UNCLEAR
PARTIAL
WEAK
UNLIKELY
WIDESPREAD
AnthropicGoogle
• Principles may not cover all cases(+2)
CTransformers
HAgents
HSSMs
Training & Alignment
$5-20M/yr
SAFETY-LEANING
INCREASE
UNKNOWN
SOME
UNCLEAR
MAYBE
PARTIAL
MAYBE
EXPERIMENTAL
OpenAI (research)Anthropic (interest)
• May not converge to truth(+2)
HTransformers
MAgents
HSSMs
Training & Alignment
$100-500M/yr
BALANCED
MAINTAIN
MEDIUM
SIGNIFICANT
HELPFUL
PARTIAL
PARTIAL
UNLIKELY
WIDESPREAD
OpenAIGoogleAnthropic
• Expensive annotation(+2)
Training & Alignment
$10-50M/yr
SAFETY-LEANING
INCREASE
UNKNOWN
SOME
UNCLEAR
UNKNOWN
UNKNOWN
MAYBE
EXPERIMENTAL
OpenAIAnthropic
• Early results show partial success only(+2)
Training & Alignment
$500M+/yr
CAPABILITY-DOMINANT
REDUCE
LOW
SIGNIFICANT
UNCLEAR
PARTIAL
NONE
NO
UNIVERSAL
All frontier labs
• Reward hacking(+2)
Training & Alignment
$1-5M/yr
SAFETY-DOMINANT
INCREASE
MEDIUM
NEUTRAL
HELPFUL
UNKNOWN
PARTIAL
MAYBE
NONE
UC Berkeley CHAI
• Hard to implement in practice(+2)
Training & Alignment
$10-30M/yr
SAFETY-LEANING
INCREASE
MEDIUM
SOME
HELPFUL
PARTIAL
WEAK
UNLIKELY
WIDESPREAD
AnthropicOpenAIGoogle
• Specs may be incomplete(+2)
Training & Alignment
$50-150M/yr
BALANCED
MAINTAIN
LOW-MEDIUM
SOME
HELPFUL
PARTIAL
NONE
NO
UNIVERSAL
All frontier labsSecurity researchers
• Arms race with attackers(+2)
Training & Alignment
$5-20M/yr
SAFETY-LEANING
INCREASE
MEDIUM
SOME
HELPFUL
UNKNOWN
PARTIAL
MAYBE
EXPERIMENTAL
DeepMindCHAIAcademic groups
• Hard to define "cooperation" formally(+2)
Interpretability
$50-150M/yr
SAFETY-DOMINANT
PRIORITIZE
LOW (now) / HIGH (potential)
NEUTRAL
HELPFUL
UNKNOWN
STRONG (if works)
MAYBE
EXPERIMENTAL
AnthropicDeepMindEleutherAIIndependent
• Doesn't scale yet(+2)
CTransformers
LAgents
MSSMs
Interpretability
$10-30M/yr
SAFETY-DOMINANT
INCREASE
LOW (now)
NEUTRAL
HELPFUL
PARTIAL
PARTIAL
UNKNOWN
EXPERIMENTAL
AnthropicApolloIndependent researchers
• Features may not be functionally important(+2)
Interpretability
$5-20M/yr
SAFETY-LEANING
INCREASE
MEDIUM
SOME
HELPFUL
PARTIAL
PARTIAL
UNKNOWN
EXPERIMENTAL
Center for AI SafetyVarious academics
• May be superficial control(+2)
Interpretability
$5-10M/yr
SAFETY-DOMINANT
MAINTAIN
LOW
NEUTRAL
HELPFUL
YES
PARTIAL
MAYBE
WIDESPREAD
Many research groups
• Probes might not find safety-relevant features(+2)
Evaluation
$20-50M/yr
SAFETY-DOMINANT
INCREASE
MEDIUM
NEUTRAL
HELPFUL
PARTIAL
WEAK
UNLIKELY
WIDESPREAD
METRApolloUK AISIAll frontier labs
• Evals may not capture real-world risk(+2)
Evaluation
$50-200M/yr
BALANCED
MAINTAIN
LOW-MEDIUM
NEUTRAL
HELPFUL
PARTIAL
NONE
NO
UNIVERSAL
All frontier labsExternal red teams
• Can't find all failures(+2)
Evaluation
$10-30M/yr
SAFETY-DOMINANT
PRIORITIZE
MEDIUM
NEUTRAL
HELPFUL
UNKNOWN
WEAK
UNLIKELY
SOME
AnthropicApolloUK AISI
• May not measure what matters(+2)
Evaluation
$10-30M/yr
SAFETY-DOMINANT
INCREASE
LOW-MEDIUM
NEUTRAL
HELPFUL
PARTIAL
WEAK
UNLIKELY
SOME
METRUK AISIApolloRAND
• Auditors may lack expertise(+2)
Evaluation
$5-15M/yr
SAFETY-DOMINANT
PRIORITIZE
MEDIUM-HIGH
TAX
HELPFUL
PARTIAL
PARTIAL
UNLIKELY
EXPERIMENTAL
UK AISIAnthropicDeepMind
• What counts as sufficient evidence?(+2)
Evaluation
$10-30M/yr
SAFETY-LEANING
INCREASE
MEDIUM
SOME
HELPFUL
PARTIAL
WEAK
NO
SOME
METRAnthropicApolloUK AISI
• Can't prove absence of capability(+2)
Evaluation
$5-15M/yr
SAFETY-DOMINANT
PRIORITIZE
HIGH (if works)
NEUTRAL
HELPFUL
UNKNOWN
UNKNOWN
UNKNOWN
EXPERIMENTAL
AnthropicRedwoodAcademic groups
• Current methods don't reliably detect sleeper agents(+2)
Architectural
$50-200M/yr
BALANCED
MAINTAIN
LOW
TAX
NEUTRAL
BREAKS
NONE
NO
UNIVERSAL
All deployment labs
• Easily jailbroken(+2)
Architectural
(included in RLHF)
BALANCED
MAINTAIN
LOW-MEDIUM
TAX
NEUTRAL
BREAKS
NONE
NO
UNIVERSAL
All frontier labs
• Consistently jailbroken(+2)
Architectural
$20-50M/yr
SAFETY-LEANING
INCREASE
MEDIUM
TAX
HELPFUL
PARTIAL
PARTIAL
UNLIKELY
SOME
AnthropicOpenAIVarious
• Reduces usefulness(+2)
MTransformers
CAgents
MSSMs
Architectural
$10-30M/yr
SAFETY-DOMINANT
INCREASE
MEDIUM
TAX
HELPFUL
PARTIAL
PARTIAL
PARTIAL
WIDESPREAD
All agentic system developers
• Limits usefulness(+2)
Architectural
$10-30M/yr
SAFETY-DOMINANT
INCREASE
MEDIUM
NEUTRAL
HELPFUL
PARTIAL
WEAK
NO
SOME
Various
• Deceptive AI evades monitors(+2)
Architectural
$10-30M/yr
SAFETY-LEANING
INCREASE
MEDIUM
TAX
HELPFUL
PARTIAL
WEAK
NO
SOME
Gray SwanAnthropicVarious
• Reactive not proactive(+2)
Architectural
$20-50M/yr
SAFETY-LEANING
MAINTAIN
MEDIUM-HIGH
TAX
HELPFUL
YES
N/A
PARTIAL
WIDESPREAD
OpenAIAnthropicGoogle
• Open-source pressure(+2)
Governance
$5-20M/yr
SAFETY-DOMINANT
PRIORITIZE
MEDIUM-HIGH
NEGATIVE
HELPFUL
YES
N/A
PARTIAL
SOME
GovAICSETRAND
• Hard to implement globally(+2)
Governance
$5-15M/yr
SAFETY-DOMINANT
INCREASE
MEDIUM
NEUTRAL
HELPFUL
UNKNOWN
PARTIAL
UNLIKELY
SOME
AnthropicOpenAIDeepMind
• Voluntary and unenforceable(+2)
Governance
$10-30M/yr
SAFETY-DOMINANT
INCREASE
MEDIUM
TAX
HELPFUL
PARTIAL
WEAK
NO
SOME
RegulatorsPolicy orgs
• Evals may be inadequate(+2)
Governance
$5-15M/yr
SAFETY-DOMINANT
INCREASE
LOW-MEDIUM
TAX
HELPFUL
YES
N/A
PARTIAL
EXPERIMENTAL
Policy organizationsRegulators
• Enforcement challenges(+2)
Governance
$1-5M/yr
SAFETY-DOMINANT
MAINTAIN
HIGH (if implemented)
NEGATIVE
UNCLEAR
UNKNOWN
N/A
YES (if works)
NONE
FLIPauseAICAIS
• Unenforceable internationally(+2)
Governance
$10-30M/yr
SAFETY-DOMINANT
PRIORITIZE
MEDIUM-HIGH
TAX
HELPFUL
PARTIAL
N/A
PARTIAL
EXPERIMENTAL
GovAICSETUN AI Advisory Body
• Great power competition(+2)
Theoretical
$5-20M/yr
SAFETY-DOMINANT
INCREASE
HIGH (if achievable)
TAX
HELPFUL
UNKNOWN
STRONG (if works)
MAYBE
NONE
Academic groupsMIRI (historically)
• Doesn't scale to current models(+2)
Theoretical
$10-50M/yr
SAFETY-DOMINANT
INCREASE
CRITICAL (if works)
TAX
HELPFUL
UNKNOWN
STRONG (by design)
YES (if works)
NONE
ARIAMIRI
• May be impossible(+2)
Theoretical
$1-5M/yr
SAFETY-DOMINANT
PRIORITIZE
HIGH (if solved)
NEUTRAL
HELPFUL
UNKNOWN
PARTIAL
MAYBE
NONE
MIRIAcademic groups
• Unsolved theoretically(+2)
Theoretical
$5-20M/yr
BALANCED
INCREASE
MEDIUM
SOME
HELPFUL
PARTIAL
N/A
UNKNOWN
EXPERIMENTAL
DeepMindAnthropicAcademic
• Problem well-characterized; solutions lacking(+2)
Theoretical
$5-15M/yr
SAFETY-LEANING
PRIORITIZE
HIGH (if solved)
SOME
HELPFUL
UNKNOWN
STRONG (if solved)
MAYBE
NONE
ARCAcademic groups
• Unsolved despite significant effort(+2)
Theoretical
$5-20M/yr
SAFETY-DOMINANT
INCREASE
HIGH (if works)
NEGATIVE
HELPFUL
UNKNOWN
WEAK
UNLIKELY
EXPERIMENTAL
Academic groupsCenter for AI Safety
• Capabilities may be recoverable(+2)
Theoretical
$10-30M/yr
SAFETY-DOMINANT
PRIORITIZE
HIGH
TAX
HELPFUL
UNKNOWN
PARTIAL
CRITICAL QUESTION
SOME
Redwood ResearchAnthropicMIRI
• May not scale to superhuman systems(+2)