Table Candidates
These table rows have rating combinations that suggest they contain surprising or important information worth extracting as standalone insights. Each card includes a potential insight template you can copy and refine.
38Total Candidates
27Safety Approaches
11Accident Risks
Filter by source:
What makes a table row insight-worthy?
- Safety Approaches: Capability-dominant differential progress, weak/no deception robustness, PRIORITIZE/DEFUND recommendations, unclear net safety
- Accident Risks: Catastrophic/existential severity combined with difficult detectability, lab-demonstrated evidence of severe risks
RLHF
Safety ApproachesMatched Criteria
Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safetyDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:DOMINANT
Differential Progress:CAPABILITY-DOMINANT
Deception Robust:NONE
Recommendation:REDUCE
Potential Insight
"RLHF provides more capability uplift (DOMINANT) than safety benefit (LOW-MEDIUM), offers none deception robustness - A deceptive model could easily learn to produce human-approved outputs while having different goals, does not scale to superintelligence - Human feedback can't scale to superhuman tasks; humans can't evaluate what they can't understand, is recommended to reduce funding (Already overfunded; marginal safety $ better spent elsewhere), has unclear net impact on world safety."
Constitutional AI / RLAIF
Safety ApproachesMatched Criteria
Weak/no deception robustnessUnclear/harmful net safety
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SIGNIFICANT
Differential Progress:CAPABILITY-LEANING
Deception Robust:WEAK
Recommendation:MAINTAIN
Potential Insight
"Constitutional AI / RLAIF offers weak deception robustness - If base model is deceptive, constitutional AI oversight inherits limitations, has unclear net impact on world safety."
AI Safety via Debate
Safety ApproachesMatched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:UNKNOWN
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:PARTIAL
Recommendation:INCREASE
Potential Insight
"AI Safety via Debate has unclear net impact on world safety."
Weak-to-Strong Generalization
Safety ApproachesMatched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:UNKNOWN
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:UNKNOWN
Recommendation:INCREASE
Potential Insight
"Weak-to-Strong Generalization has unclear net impact on world safety."
Reward Modeling
Safety ApproachesMatched Criteria
Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safety
Key Ratings
Safety Uplift:LOW
Capability Uplift:SIGNIFICANT
Differential Progress:CAPABILITY-DOMINANT
Deception Robust:NONE
Recommendation:REDUCE
Potential Insight
"Reward Modeling provides more capability uplift (SIGNIFICANT) than safety benefit (LOW), offers none deception robustness - Reward model can be gamed by sophisticated policy, is recommended to reduce funding (Already heavily funded; inherits RLHF problems), has unclear net impact on world safety."
Model Specifications
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Model Specifications offers weak deception robustness - Specs define behavior; don't ensure genuine compliance."
Adversarial Training
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:SOME
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Adversarial Training offers none deception robustness - Adversarial training targets external attacks, not internal deception."
Mechanistic Interpretability
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:LOW (now) / HIGH (potential)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:STRONG (if works)
Recommendation:PRIORITIZE
Potential Insight
"Mechanistic Interpretability is rated PRIORITIZE (One of few paths to detecting deception; needs much more investment)."
Dangerous Capability Evaluations
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Dangerous Capability Evaluations offers weak deception robustness - Deceptive model might hide capabilities during evals."
Red Teaming
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Red Teaming offers none deception robustness - Deceptive model would pass red teaming."
Alignment Evaluations
Safety ApproachesMatched Criteria
Weak/no deception robustnessHigh priority recommendation
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:PRIORITIZE
Potential Insight
"Alignment Evaluations offers weak deception robustness - Deceptive model could fake alignment on evals, is rated PRIORITIZE (Critical gap; we need better ways to measure alignment)."
Third-Party Model Auditing
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Third-Party Model Auditing offers weak deception robustness - Auditors face same detection challenges."
AI Safety Cases
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"AI Safety Cases is rated PRIORITIZE (Promising framework; severely underdeveloped for AI)."
Capability Elicitation
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Capability Elicitation offers weak deception robustness - Deceptive model might hide capabilities during elicitation."
Sleeper Agent Detection
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if works)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:UNKNOWN
Recommendation:PRIORITIZE
Potential Insight
"Sleeper Agent Detection is rated PRIORITIZE (Core alignment problem; needs much more work)."
Output Filtering
Safety ApproachesMatched Criteria
Weak/no deception robustnessDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW
Capability Uplift:TAX
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Output Filtering offers none deception robustness - Deceptive model could bypass or manipulate filters, does not scale to superintelligence - Sophisticated users/models can evade filters."
Refusal Training
Safety ApproachesMatched Criteria
Weak/no deception robustnessDoes not scale to superintelligence
Key Ratings
Safety Uplift:LOW-MEDIUM
Capability Uplift:TAX
Differential Progress:BALANCED
Deception Robust:NONE
Recommendation:MAINTAIN
Potential Insight
"Refusal Training offers none deception robustness - Refusals are behavioral, not goal-level, does not scale to superintelligence - Jailbreaks consistently found; arms race."
Monitoring / Trip Wires
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Monitoring / Trip Wires offers weak deception robustness - Deceptive AI would avoid triggering monitors."
Circuit Breakers / Inference Interventions
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:TAX
Differential Progress:SAFETY-LEANING
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Circuit Breakers / Inference Interventions offers weak deception robustness - Deceptive model could generate harm before circuit breaks."
Compute Governance
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:PRIORITIZE
Potential Insight
"Compute Governance is rated PRIORITIZE (One of few levers to affect timeline; very underfunded)."
Evals-Based Deployment Gates
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:MEDIUM
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Evals-Based Deployment Gates offers weak deception robustness - Deceptive models could pass evals."
Pause / Moratorium
Safety ApproachesMatched Criteria
Unclear/harmful net safety
Key Ratings
Safety Uplift:HIGH (if implemented)
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:MAINTAIN
Potential Insight
"Pause / Moratorium has unclear net impact on world safety."
International AI Governance
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:MEDIUM-HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:N/A
Recommendation:PRIORITIZE
Potential Insight
"International AI Governance is rated PRIORITIZE (Critical infrastructure; severely underdeveloped)."
Corrigibility Research
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if solved)
Capability Uplift:NEUTRAL
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"Corrigibility Research is rated PRIORITIZE (Severely underfunded for importance; key unsolved problem)."
Eliciting Latent Knowledge (ELK)
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH (if solved)
Capability Uplift:SOME
Differential Progress:SAFETY-LEANING
Deception Robust:STRONG (if solved)
Recommendation:PRIORITIZE
Potential Insight
"Eliciting Latent Knowledge (ELK) is rated PRIORITIZE (Solves deception problem if successful; needs breakthrough)."
Capability Unlearning / Removal
Safety ApproachesMatched Criteria
Weak/no deception robustness
Key Ratings
Safety Uplift:HIGH (if works)
Capability Uplift:NEGATIVE
Differential Progress:SAFETY-DOMINANT
Deception Robust:WEAK
Recommendation:INCREASE
Potential Insight
"Capability Unlearning / Removal offers weak deception robustness - Model might hide rather than truly unlearn capabilities."
AI Control
Safety ApproachesMatched Criteria
High priority recommendation
Key Ratings
Safety Uplift:HIGH
Capability Uplift:TAX
Differential Progress:SAFETY-DOMINANT
Deception Robust:PARTIAL
Recommendation:PRIORITIZE
Potential Insight
"AI Control is rated PRIORITIZE (Fundamental requirement; increasingly important with agentic AI)."
Mesa-Optimization
Accident RisksMatched Criteria
Severe + hard to detect
Key Ratings
Severity:CATASTROPHIC
Detectability:VERY_DIFFICULT
Evidence:THEORETICAL
Timeline:UNCERTAIN
Potential Insight
"Mesa-Optimization: catastrophic severity but very difficult to detect; Well-established theoretically (Hubinger et al. 2019). Whether current LLMs are mesa-optimizers is debated.."
Instrumental Convergence
Accident RisksMatched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:EXISTENTIAL
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Instrumental Convergence: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner et al. 2021). Empirical evidence: 78% alignment faking (Anthropic 2024), 79% shutdown resistance (Palisade 2025).."
Deceptive Alignment
Accident RisksMatched Criteria
Severe + hard to detectLab-demonstrated catastrophic risk
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:NEAR_TERM
Potential Insight
"Deceptive Alignment: existential severity but very difficult to detect; already demonstrated in lab settings; Sleeper agents study (Anthropic 2024) showed deception persists through safety training. 78% alignment faking rate observed.."
Goal Misgeneralization
Accident RisksMatched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Goal Misgeneralization: already demonstrated in lab settings; relevant to current AI systems; Demonstrated in RL environments (CoinRun, etc.). Less clear if this occurs in LLMs.."
Scheming
Accident RisksMatched Criteria
Severe + hard to detectLab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:CATASTROPHIC
Detectability:DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Scheming: catastrophic severity but difficult to detect; already demonstrated in lab settings; relevant to current AI systems; Apollo Research (Dec 2024): All 5 tested frontier models demonstrated in-context scheming. o1 disabled oversight in 5% of trials.."
Sandbagging
Accident RisksMatched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:DIFFICULT
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Sandbagging: already demonstrated in lab settings; relevant to current AI systems; Claude 3.5 Sonnet strategically underperformed in evaluations (Apollo Research 2024). Multiple models showed sandbagging capability.."
Power-Seeking
Accident RisksMatched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:EXISTENTIAL
Detectability:MODERATE
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Power-Seeking: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner 2021). Empirical: o3 sabotaged shutdown in 79% of tests (Palisade 2025).."
Corrigibility Failure
Accident RisksMatched Criteria
Lab-demonstrated catastrophic riskCurrent timeline + severe
Key Ratings
Severity:CATASTROPHIC
Detectability:EASY
Evidence:DEMONSTRATED_LAB
Timeline:CURRENT
Potential Insight
"Corrigibility Failure: already demonstrated in lab settings; relevant to current AI systems; o3 sabotaged shutdown in 79% of tests (Palisade 2025). 7% even with explicit "allow shutdown" instruction. Claude 3.7 showed 0% resistance.."
Treacherous Turn
Accident RisksMatched Criteria
Severe + hard to detect
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:THEORETICAL
Timeline:MEDIUM_TERM
Potential Insight
"Treacherous Turn: existential severity but very difficult to detect; Theoretical reasoning + proof-of-concept. Sleeper agents study shows deception can persist; actual treacherous turn not yet observed.."
Sharp Left Turn
Accident RisksMatched Criteria
Severe + hard to detect
Key Ratings
Severity:EXISTENTIAL
Detectability:VERY_DIFFICULT
Evidence:SPECULATIVE
Timeline:MEDIUM_TERM
Potential Insight
"Sharp Left Turn: existential severity but very difficult to detect; Theoretical scenario. No direct evidence. Some analogies in capability jumps.."
Emergent Capabilities
Accident RisksMatched Criteria
Current timeline + severe
Key Ratings
Severity:HIGH
Detectability:MODERATE
Evidence:OBSERVED_CURRENT
Timeline:CURRENT
Potential Insight
"Emergent Capabilities: relevant to current AI systems; Well-documented in scaling research (GPT-4, etc.). Some capabilities appear suddenly at scale.."