Similar tables elsewhere ↗
FLI AI Safety Index – Lab safety scorecards
METR: Common Elements – Policy comparison (12 companies)
AI Alignment Survey – Academic taxonomy
Comparative analysis of AI safety approaches, with particular attention to the question: Does this technique actually make the world safer, or does it primarily enable more capable systems?
Key insight: Many "safety" techniques have capability uplift as their primary effect. RLHF, for example, is what makes ChatGPT useful - its safety benefit is secondary to its capability benefit. A technique that provides DOMINANT capability uplift with only LOW safety uplift may be net negative for world safety, even if it reduces obvious harms.
Safety Uplift
CRITICAL Transformative if worksHIGH Significant risk reductionMEDIUM Meaningful but limitedLOW Marginal benefit
Capability Uplift
DOMINANT Primary capability driverSIGNIFICANT Major capability boostSOME Some capability benefitNEUTRAL No capability effectTAX Reduces capabilities
Net World Safety
HELPFUL Probably net positiveUNCLEAR Could go either wayHARMFUL Likely net negative
Scales to SI?
YES Works at superintelligenceMAYBE Might workUNLIKELY Probably breaksNO Fundamentally limited
Differential Progress
SAFETY-DOMINANT Safety >> capabilitySAFETY-LEANING Safety > capabilityBALANCED Roughly equalCAPABILITY-LEANING Capability > safetyCAPABILITY-DOMINANT Capability >> safety
Recommendation
PRIORITIZE Needs much more fundingINCREASE Should growMAINTAIN About rightREDUCE Overfunded for safetyDEFUND Counterproductive
Current research investment | Safety vs capability progress ratio | Recommended funding change | How much does this reduce catastrophic risk? | Does it make AI more capable? | Is the world safer with this? | Does it work as AI gets smarter? | Does it work against deceptive AI? | Works for superintelligent AI? | Current adoption level | Labs | Critiques | Architectures | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training & Alignment | $1B+/yr | CAPABILITY-DOMINANT | REDUCE | LOW-MEDIUM | DOMINANT | UNCLEAR | BREAKS | NONE | NO | UNIVERSAL | OpenAIAnthropicGoogleMeta | • Goodharting on human approval(+2) | CTransformers HAgents HSSMs | |
Training & Alignment | $50-200M/yr | CAPABILITY-LEANING | MAINTAIN | MEDIUM | SIGNIFICANT | UNCLEAR | PARTIAL | WEAK | UNLIKELY | WIDESPREAD | AnthropicGoogle | • Principles may not cover all cases(+2) | CTransformers HAgents HSSMs | |
Training & Alignment | $5-20M/yr | SAFETY-LEANING | INCREASE | UNKNOWN | SOME | UNCLEAR | MAYBE | PARTIAL | MAYBE | EXPERIMENTAL | OpenAI (research)Anthropic (interest) | • May not converge to truth(+2) | HTransformers MAgents HSSMs | |
Training & Alignment | $100-500M/yr | BALANCED | MAINTAIN | MEDIUM | SIGNIFICANT | HELPFUL | PARTIAL | PARTIAL | UNLIKELY | WIDESPREAD | OpenAIGoogleAnthropic | • Expensive annotation(+2) | — | |
Training & Alignment | $10-50M/yr | SAFETY-LEANING | INCREASE | UNKNOWN | SOME | UNCLEAR | UNKNOWN | UNKNOWN | MAYBE | EXPERIMENTAL | OpenAIAnthropic | • Early results show partial success only(+2) | — | |
Training & Alignment | $500M+/yr | CAPABILITY-DOMINANT | REDUCE | LOW | SIGNIFICANT | UNCLEAR | PARTIAL | NONE | NO | UNIVERSAL | All frontier labs | • Reward hacking(+2) | — | |
Training & Alignment | $1-5M/yr | SAFETY-DOMINANT | INCREASE | MEDIUM | NEUTRAL | HELPFUL | UNKNOWN | PARTIAL | MAYBE | NONE | UC Berkeley CHAI | • Hard to implement in practice(+2) | — | |
Training & Alignment | $10-30M/yr | SAFETY-LEANING | INCREASE | MEDIUM | SOME | HELPFUL | PARTIAL | WEAK | UNLIKELY | WIDESPREAD | AnthropicOpenAIGoogle | • Specs may be incomplete(+2) | — | |
Training & Alignment | $50-150M/yr | BALANCED | MAINTAIN | LOW-MEDIUM | SOME | HELPFUL | PARTIAL | NONE | NO | UNIVERSAL | All frontier labsSecurity researchers | • Arms race with attackers(+2) | — | |
Training & Alignment | $5-20M/yr | SAFETY-LEANING | INCREASE | MEDIUM | SOME | HELPFUL | UNKNOWN | PARTIAL | MAYBE | EXPERIMENTAL | DeepMindCHAIAcademic groups | • Hard to define "cooperation" formally(+2) | — | |
Interpretability | $50-150M/yr | SAFETY-DOMINANT | PRIORITIZE | LOW (now) / HIGH (potential) | NEUTRAL | HELPFUL | UNKNOWN | STRONG (if works) | MAYBE | EXPERIMENTAL | AnthropicDeepMindEleutherAIIndependent | • Doesn't scale yet(+2) | CTransformers LAgents MSSMs | |
Interpretability | $10-30M/yr | SAFETY-DOMINANT | INCREASE | LOW (now) | NEUTRAL | HELPFUL | PARTIAL | PARTIAL | UNKNOWN | EXPERIMENTAL | AnthropicApolloIndependent researchers | • Features may not be functionally important(+2) | — | |
Interpretability | $5-20M/yr | SAFETY-LEANING | INCREASE | MEDIUM | SOME | HELPFUL | PARTIAL | PARTIAL | UNKNOWN | EXPERIMENTAL | Center for AI SafetyVarious academics | • May be superficial control(+2) | — | |
Interpretability | $5-10M/yr | SAFETY-DOMINANT | MAINTAIN | LOW | NEUTRAL | HELPFUL | YES | PARTIAL | MAYBE | WIDESPREAD | Many research groups | • Probes might not find safety-relevant features(+2) | — | |
Evaluation | $20-50M/yr | SAFETY-DOMINANT | INCREASE | MEDIUM | NEUTRAL | HELPFUL | PARTIAL | WEAK | UNLIKELY | WIDESPREAD | METRApolloUK AISIAll frontier labs | • Evals may not capture real-world risk(+2) | — | |
Evaluation | $50-200M/yr | BALANCED | MAINTAIN | LOW-MEDIUM | NEUTRAL | HELPFUL | PARTIAL | NONE | NO | UNIVERSAL | All frontier labsExternal red teams | • Can't find all failures(+2) | — | |
Evaluation | $10-30M/yr | SAFETY-DOMINANT | PRIORITIZE | MEDIUM | NEUTRAL | HELPFUL | UNKNOWN | WEAK | UNLIKELY | SOME | AnthropicApolloUK AISI | • May not measure what matters(+2) | — | |
Evaluation | $10-30M/yr | SAFETY-DOMINANT | INCREASE | LOW-MEDIUM | NEUTRAL | HELPFUL | PARTIAL | WEAK | UNLIKELY | SOME | METRUK AISIApolloRAND | • Auditors may lack expertise(+2) | — | |
Evaluation | $5-15M/yr | SAFETY-DOMINANT | PRIORITIZE | MEDIUM-HIGH | TAX | HELPFUL | PARTIAL | PARTIAL | UNLIKELY | EXPERIMENTAL | UK AISIAnthropicDeepMind | • What counts as sufficient evidence?(+2) | — | |
Evaluation | $10-30M/yr | SAFETY-LEANING | INCREASE | MEDIUM | SOME | HELPFUL | PARTIAL | WEAK | NO | SOME | METRAnthropicApolloUK AISI | • Can't prove absence of capability(+2) | — | |
Evaluation | $5-15M/yr | SAFETY-DOMINANT | PRIORITIZE | HIGH (if works) | NEUTRAL | HELPFUL | UNKNOWN | UNKNOWN | UNKNOWN | EXPERIMENTAL | AnthropicRedwoodAcademic groups | • Current methods don't reliably detect sleeper agents(+2) | — | |
Architectural | $50-200M/yr | BALANCED | MAINTAIN | LOW | TAX | NEUTRAL | BREAKS | NONE | NO | UNIVERSAL | All deployment labs | • Easily jailbroken(+2) | — | |
Architectural | (included in RLHF) | BALANCED | MAINTAIN | LOW-MEDIUM | TAX | NEUTRAL | BREAKS | NONE | NO | UNIVERSAL | All frontier labs | • Consistently jailbroken(+2) | — | |
Architectural | $20-50M/yr | SAFETY-LEANING | INCREASE | MEDIUM | TAX | HELPFUL | PARTIAL | PARTIAL | UNLIKELY | SOME | AnthropicOpenAIVarious | • Reduces usefulness(+2) | MTransformers CAgents MSSMs | |
Architectural | $10-30M/yr | SAFETY-DOMINANT | INCREASE | MEDIUM | TAX | HELPFUL | PARTIAL | PARTIAL | PARTIAL | WIDESPREAD | All agentic system developers | • Limits usefulness(+2) | — | |
Architectural | $10-30M/yr | SAFETY-DOMINANT | INCREASE | MEDIUM | NEUTRAL | HELPFUL | PARTIAL | WEAK | NO | SOME | Various | • Deceptive AI evades monitors(+2) | — | |
Architectural | $10-30M/yr | SAFETY-LEANING | INCREASE | MEDIUM | TAX | HELPFUL | PARTIAL | WEAK | NO | SOME | Gray SwanAnthropicVarious | • Reactive not proactive(+2) | — | |
Architectural | $20-50M/yr | SAFETY-LEANING | MAINTAIN | MEDIUM-HIGH | TAX | HELPFUL | YES | N/A | PARTIAL | WIDESPREAD | OpenAIAnthropicGoogle | • Open-source pressure(+2) | — | |
Governance | $5-20M/yr | SAFETY-DOMINANT | PRIORITIZE | MEDIUM-HIGH | NEGATIVE | HELPFUL | YES | N/A | PARTIAL | SOME | GovAICSETRAND | • Hard to implement globally(+2) | — | |
Governance | $5-15M/yr | SAFETY-DOMINANT | INCREASE | MEDIUM | NEUTRAL | HELPFUL | UNKNOWN | PARTIAL | UNLIKELY | SOME | AnthropicOpenAIDeepMind | • Voluntary and unenforceable(+2) | — | |
Governance | $10-30M/yr | SAFETY-DOMINANT | INCREASE | MEDIUM | TAX | HELPFUL | PARTIAL | WEAK | NO | SOME | RegulatorsPolicy orgs | • Evals may be inadequate(+2) | — | |
Governance | $5-15M/yr | SAFETY-DOMINANT | INCREASE | LOW-MEDIUM | TAX | HELPFUL | YES | N/A | PARTIAL | EXPERIMENTAL | Policy organizationsRegulators | • Enforcement challenges(+2) | — | |
Governance | $1-5M/yr | SAFETY-DOMINANT | MAINTAIN | HIGH (if implemented) | NEGATIVE | UNCLEAR | UNKNOWN | N/A | YES (if works) | NONE | FLIPauseAICAIS | • Unenforceable internationally(+2) | — | |
Governance | $10-30M/yr | SAFETY-DOMINANT | PRIORITIZE | MEDIUM-HIGH | TAX | HELPFUL | PARTIAL | N/A | PARTIAL | EXPERIMENTAL | GovAICSETUN AI Advisory Body | • Great power competition(+2) | — | |
Theoretical | $5-20M/yr | SAFETY-DOMINANT | INCREASE | HIGH (if achievable) | TAX | HELPFUL | UNKNOWN | STRONG (if works) | MAYBE | NONE | Academic groupsMIRI (historically) | • Doesn't scale to current models(+2) | — | |
Theoretical | $10-50M/yr | SAFETY-DOMINANT | INCREASE | CRITICAL (if works) | TAX | HELPFUL | UNKNOWN | STRONG (by design) | YES (if works) | NONE | ARIAMIRI | • May be impossible(+2) | — | |
Theoretical | $1-5M/yr | SAFETY-DOMINANT | PRIORITIZE | HIGH (if solved) | NEUTRAL | HELPFUL | UNKNOWN | PARTIAL | MAYBE | NONE | MIRIAcademic groups | • Unsolved theoretically(+2) | — | |
Theoretical | $5-20M/yr | BALANCED | INCREASE | MEDIUM | SOME | HELPFUL | PARTIAL | N/A | UNKNOWN | EXPERIMENTAL | DeepMindAnthropicAcademic | • Problem well-characterized; solutions lacking(+2) | — | |
Theoretical | $5-15M/yr | SAFETY-LEANING | PRIORITIZE | HIGH (if solved) | SOME | HELPFUL | UNKNOWN | STRONG (if solved) | MAYBE | NONE | ARCAcademic groups | • Unsolved despite significant effort(+2) | — | |
Theoretical | $5-20M/yr | SAFETY-DOMINANT | INCREASE | HIGH (if works) | NEGATIVE | HELPFUL | UNKNOWN | WEAK | UNLIKELY | EXPERIMENTAL | Academic groupsCenter for AI Safety | • Capabilities may be recoverable(+2) | — | |
Theoretical | $10-30M/yr | SAFETY-DOMINANT | PRIORITIZE | HIGH | TAX | HELPFUL | UNKNOWN | PARTIAL | CRITICAL QUESTION | SOME | Redwood ResearchAnthropicMIRI | • May not scale to superhuman systems(+2) | — |