Skip to content

Table Candidates

These table rows have rating combinations that suggest they contain surprising or important information worth extracting as standalone insights. Each card includes a potential insight template you can copy and refine.

38Total Candidates
27Safety Approaches
11Accident Risks
Filter by source:
What makes a table row insight-worthy?
  • Safety Approaches: Capability-dominant differential progress, weak/no deception robustness, PRIORITIZE/DEFUND recommendations, unclear net safety
  • Accident Risks: Catastrophic/existential severity combined with difficult detectability, lab-demonstrated evidence of severe risks

RLHF

Safety Approaches
View
Matched Criteria
Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safetyDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:DOMINANT
Differential Progress:CAPABILITY-DOMINANT
Deception Robust:NONE
Recommendation:REDUCE
Potential Insight
"RLHF provides more capability uplift (DOMINANT) than safety benefit (LOW-MEDIUM), offers none deception robustness - A deceptive model could easily learn to produce human-approved outputs while having different goals, does not scale to superintelligence - Human feedback can't scale to superhuman tasks; humans can't evaluate what they can't understand, is recommended to reduce funding (Already overfunded; marginal safety $ better spent elsewhere), has unclear net impact on world safety."

Constitutional AI / RLAIF

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessUnclear/harmful net safety
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SIGNIFICANT
Differential Progress:CAPABILITY-LEANING
Deception Robust:WEAK
Recommendation:MAINTAIN
Potential Insight
"Constitutional AI / RLAIF offers weak deception robustness - If base model is deceptive, constitutional AI oversight inherits limitations, has unclear net impact on world safety."

AI Safety via Debate

Safety Approaches
View
Matched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:UNKNOWN
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:PARTIAL
Recommendation:INCREASE
Potential Insight
"AI Safety via Debate has unclear net impact on world safety."

Weak-to-Strong Generalization

Safety Approaches
View
Matched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:UNKNOWN
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:UNKNOWN
Recommendation:INCREASE
Potential Insight
"Weak-to-Strong Generalization has unclear net impact on world safety."

Reward Modeling

Safety Approaches
View
Matched Criteria
Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safety
Key Ratings
Safety Uplift:LOW
Capability Uplift:SIGNIFICANT
Differential Progress:CAPABILITY-DOMINANT
Deception Robust:NONE
Recommendation:REDUCE
Potential Insight
"Reward Modeling provides more capability uplift (SIGNIFICANT) than safety benefit (LOW), offers none deception robustness - Reward model can be gamed by sophisticated policy, is recommended to reduce funding (Already heavily funded; inherits RLHF problems), has unclear net impact on world safety."

Model Specifications

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Model Specifications offers weak deception robustness - Specs define behavior; don't ensure genuine compliance."

Adversarial Training

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:SOME
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Adversarial Training offers none deception robustness - Adversarial training targets external attacks, not internal deception."

Mechanistic Interpretability

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:LOW (now) / HIGH (potential)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:STRONG (if works)
Recommendation:PRIORITIZE
Potential Insight
"Mechanistic Interpretability is rated PRIORITIZE (One of few paths to detecting deception; needs much more investment)."

Dangerous Capability Evaluations

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Dangerous Capability Evaluations offers weak deception robustness - Deceptive model might hide capabilities during evals."

Red Teaming

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Red Teaming offers none deception robustness - Deceptive model would pass red teaming."

Alignment Evaluations

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessHigh priority recommendation
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:PRIORITIZE
Potential Insight
"Alignment Evaluations offers weak deception robustness - Deceptive model could fake alignment on evals, is rated PRIORITIZE (Critical gap; we need better ways to measure alignment)."

Third-Party Model Auditing

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Third-Party Model Auditing offers weak deception robustness - Auditors face same detection challenges."

AI Safety Cases

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"AI Safety Cases is rated PRIORITIZE (Promising framework; severely underdeveloped for AI)."

Capability Elicitation

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Capability Elicitation offers weak deception robustness - Deceptive model might hide capabilities during elicitation."

Sleeper Agent Detection

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if works)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:UNKNOWN
Recommendation:PRIORITIZE
Potential Insight
"Sleeper Agent Detection is rated PRIORITIZE (Core alignment problem; needs much more work)."

Output Filtering

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW
Capability Uplift:TAX
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Output Filtering offers none deception robustness - Deceptive model could bypass or manipulate filters, does not scale to superintelligence - Sophisticated users/models can evade filters."

Refusal Training

Safety Approaches
View
Matched Criteria
Weak/no deception robustnessDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:TAX
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Refusal Training offers none deception robustness - Refusals are behavioral, not goal-level, does not scale to superintelligence - Jailbreaks consistently found; arms race."

Monitoring / Trip Wires

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Monitoring / Trip Wires offers weak deception robustness - Deceptive AI would avoid triggering monitors."

Circuit Breakers / Inference Interventions

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:TAX
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Circuit Breakers / Inference Interventions offers weak deception robustness - Deceptive model could generate harm before circuit breaks."

Compute Governance

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:PRIORITIZE
Potential Insight
"Compute Governance is rated PRIORITIZE (One of few levers to affect timeline; very underfunded)."

Evals-Based Deployment Gates

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Evals-Based Deployment Gates offers weak deception robustness - Deceptive models could pass evals."

Pause / Moratorium

Safety Approaches
View
Matched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:HIGH (if implemented)
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:MAINTAIN
Potential Insight
"Pause / Moratorium has unclear net impact on world safety."

International AI Governance

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:PRIORITIZE
Potential Insight
"International AI Governance is rated PRIORITIZE (Critical infrastructure; severely underdeveloped)."

Corrigibility Research

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if solved)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"Corrigibility Research is rated PRIORITIZE (Severely underfunded for importance; key unsolved problem)."

Eliciting Latent Knowledge (ELK)

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if solved)
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:STRONG (if solved)
Recommendation:PRIORITIZE
Potential Insight
"Eliciting Latent Knowledge (ELK) is rated PRIORITIZE (Solves deception problem if successful; needs breakthrough)."

Capability Unlearning / Removal

Safety Approaches
View
Matched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:HIGH (if works)
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Capability Unlearning / Removal offers weak deception robustness - Model might hide rather than truly unlearn capabilities."

AI Control

Safety Approaches
View
Matched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"AI Control is rated PRIORITIZE (Fundamental requirement; increasingly important with agentic AI)."

Mesa-Optimization

Accident Risks
View
Matched Criteria
Severe + hard to detect
Key Ratings
Severity:CATASTROPHIC
Detectability:VERY_DIFFICULT
Evidence:THEORETICAL
Timeline:UNCERTAIN
Potential Insight
"Mesa-Optimization: catastrophic severity but very difficult to detect; Well-established theoretically (Hubinger et al. 2019). Whether current LLMs are mesa-optimizers is debated.."

Instrumental Convergence

Accident Risks
View
Matched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:EXISTENTIAL
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Instrumental Convergence: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner et al. 2021). Empirical evidence: 78% alignment faking (Anthropic 2024), 79% shutdown resistance (Palisade 2025).."

Deceptive Alignment

Accident Risks
View
Matched Criteria
Severe + hard to detectLab-demonstrated catastrophic risk
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:NEAR_TERM
Potential Insight
"Deceptive Alignment: existential severity but very difficult to detect; already demonstrated in lab settings; Sleeper agents study (Anthropic 2024) showed deception persists through safety training. 78% alignment faking rate observed.."

Goal Misgeneralization

Accident Risks
View
Matched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Goal Misgeneralization: already demonstrated in lab settings; relevant to current AI systems; Demonstrated in RL environments (CoinRun, etc.). Less clear if this occurs in LLMs.."

Scheming

Accident Risks
View
Matched Criteria
Severe + hard to detectLab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:CATASTROPHIC
Detectability:DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Scheming: catastrophic severity but difficult to detect; already demonstrated in lab settings; relevant to current AI systems; Apollo Research (Dec 2024): All 5 tested frontier models demonstrated in-context scheming. o1 disabled oversight in 5% of trials.."

Sandbagging

Accident Risks
View
Matched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Sandbagging: already demonstrated in lab settings; relevant to current AI systems; Claude 3.5 Sonnet strategically underperformed in evaluations (Apollo Research 2024). Multiple models showed sandbagging capability.."

Power-Seeking

Accident Risks
View
Matched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:EXISTENTIAL
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Power-Seeking: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner 2021). Empirical: o3 sabotaged shutdown in 79% of tests (Palisade 2025).."

Corrigibility Failure

Accident Risks
View
Matched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:CATASTROPHIC
Detectability:EASY
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Corrigibility Failure: already demonstrated in lab settings; relevant to current AI systems; o3 sabotaged shutdown in 79% of tests (Palisade 2025). 7% even with explicit "allow shutdown" instruction. Claude 3.7 showed 0% resistance.."

Treacherous Turn

Accident Risks
View
Matched Criteria
Severe + hard to detect
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:THEORETICAL
Timeline:MEDIUM_TERM
Potential Insight
"Treacherous Turn: existential severity but very difficult to detect; Theoretical reasoning + proof-of-concept. Sleeper agents study shows deception can persist; actual treacherous turn not yet observed.."

Sharp Left Turn

Accident Risks
View
Matched Criteria
Severe + hard to detect
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:SPECULATIVE
Timeline:MEDIUM_TERM
Potential Insight
"Sharp Left Turn: existential severity but very difficult to detect; Theoretical scenario. No direct evidence. Some analogies in capability jumps.."

Emergent Capabilities

Accident Risks
View
Matched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:MODERATE
Evidence:OBSERVED_CURRENT
Timeline:CURRENT
Potential Insight
"Emergent Capabilities: relevant to current AI systems; Well-documented in scaling research (GPT-4, etc.). Some capabilities appear suddenly at scale.."