Table Candidates

These table rows have rating combinations that suggest they contain surprising or important information worth extracting as standalone insights. Each card includes a potential insight template you can copy and refine.

38Total Candidates

27Safety Approaches

11Accident Risks

Filter by source:

What makes a table row insight-worthy?

Safety Approaches: Capability-dominant differential progress, weak/no deception robustness, PRIORITIZE/DEFUND recommendations, unclear net safety
Accident Risks: Catastrophic/existential severity combined with difficult detectability, lab-demonstrated evidence of severe risks

RLHF

Safety Approaches

View

Matched Criteria

Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safetyDoes not scale to superintelligence

Key Ratings

Safety Uplift:LOW-MEDIUM

Capability Uplift:DOMINANT

Differential Progress:CAPABILITY-DOMINANT

Deception Robust:NONE

Recommendation:REDUCE

Potential Insight

"RLHF provides more capability uplift (DOMINANT) than safety benefit (LOW-MEDIUM), offers none deception robustness - A deceptive model could easily learn to produce human-approved outputs while having different goals, does not scale to superintelligence - Human feedback can't scale to superhuman tasks; humans can't evaluate what they can't understand, is recommended to reduce funding (Already overfunded; marginal safety $ better spent elsewhere), has unclear net impact on world safety."

Constitutional AI / RLAIF

Safety Approaches

View

Matched Criteria

Weak/no deception robustnessUnclear/harmful net safety

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:SIGNIFICANT

Differential Progress:CAPABILITY-LEANING

Deception Robust:WEAK

Recommendation:MAINTAIN

Potential Insight

"Constitutional AI / RLAIF offers weak deception robustness - If base model is deceptive, constitutional AI oversight inherits limitations, has unclear net impact on world safety."

AI Safety via Debate

Safety Approaches

View

Matched Criteria

Unclear/harmful net safety

Key Ratings

Safety Uplift:UNKNOWN

Capability Uplift:SOME

Differential Progress:SAFETY-LEANING

Deception Robust:PARTIAL

Recommendation:INCREASE

Potential Insight

"AI Safety via Debate has unclear net impact on world safety."

Weak-to-Strong Generalization

Safety Approaches

View

Matched Criteria

Unclear/harmful net safety

Key Ratings

Safety Uplift:UNKNOWN

Capability Uplift:SOME

Differential Progress:SAFETY-LEANING

Deception Robust:UNKNOWN

Recommendation:INCREASE

Potential Insight

"Weak-to-Strong Generalization has unclear net impact on world safety."

Reward Modeling

Safety Approaches

View

Matched Criteria

Capability-dominant (questionable safety value)Weak/no deception robustnessReduce funding recommendationUnclear/harmful net safety

Key Ratings

Safety Uplift:LOW

Capability Uplift:SIGNIFICANT

Differential Progress:CAPABILITY-DOMINANT

Deception Robust:NONE

Recommendation:REDUCE

Potential Insight

"Reward Modeling provides more capability uplift (SIGNIFICANT) than safety benefit (LOW), offers none deception robustness - Reward model can be gamed by sophisticated policy, is recommended to reduce funding (Already heavily funded; inherits RLHF problems), has unclear net impact on world safety."

Model Specifications

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:SOME

Differential Progress:SAFETY-LEANING

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Model Specifications offers weak deception robustness - Specs define behavior; don't ensure genuine compliance."

Adversarial Training

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:LOW-MEDIUM

Capability Uplift:SOME

Differential Progress:BALANCED

Deception Robust:NONE

Recommendation:MAINTAIN

Potential Insight

"Adversarial Training offers none deception robustness - Adversarial training targets external attacks, not internal deception."

Mechanistic Interpretability

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:LOW (now) / HIGH (potential)

Capability Uplift:NEUTRAL

Differential Progress:SAFETY-DOMINANT

Deception Robust:STRONG (if works)

Recommendation:PRIORITIZE

Potential Insight

"Mechanistic Interpretability is rated PRIORITIZE (One of few paths to detecting deception; needs much more investment)."

Dangerous Capability Evaluations

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:NEUTRAL

Differential Progress:SAFETY-DOMINANT

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Dangerous Capability Evaluations offers weak deception robustness - Deceptive model might hide capabilities during evals."

Red Teaming

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:LOW-MEDIUM

Capability Uplift:NEUTRAL

Differential Progress:BALANCED

Deception Robust:NONE

Recommendation:MAINTAIN

Potential Insight

"Red Teaming offers none deception robustness - Deceptive model would pass red teaming."

Alignment Evaluations

Safety Approaches

View

Matched Criteria

Weak/no deception robustnessHigh priority recommendation

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:NEUTRAL

Differential Progress:SAFETY-DOMINANT

Deception Robust:WEAK

Recommendation:PRIORITIZE

Potential Insight

"Alignment Evaluations offers weak deception robustness - Deceptive model could fake alignment on evals, is rated PRIORITIZE (Critical gap; we need better ways to measure alignment)."

Third-Party Model Auditing

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:LOW-MEDIUM

Capability Uplift:NEUTRAL

Differential Progress:SAFETY-DOMINANT

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Third-Party Model Auditing offers weak deception robustness - Auditors face same detection challenges."

AI Safety Cases

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:MEDIUM-HIGH

Capability Uplift:TAX

Differential Progress:SAFETY-DOMINANT

Deception Robust:PARTIAL

Recommendation:PRIORITIZE

Potential Insight

"AI Safety Cases is rated PRIORITIZE (Promising framework; severely underdeveloped for AI)."

Capability Elicitation

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:SOME

Differential Progress:SAFETY-LEANING

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Capability Elicitation offers weak deception robustness - Deceptive model might hide capabilities during elicitation."

Sleeper Agent Detection

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:HIGH (if works)

Capability Uplift:NEUTRAL

Differential Progress:SAFETY-DOMINANT

Deception Robust:UNKNOWN

Recommendation:PRIORITIZE

Potential Insight

"Sleeper Agent Detection is rated PRIORITIZE (Core alignment problem; needs much more work)."

Output Filtering

Safety Approaches

View

Matched Criteria

Weak/no deception robustnessDoes not scale to superintelligence

Key Ratings

Safety Uplift:LOW

Capability Uplift:TAX

Differential Progress:BALANCED

Deception Robust:NONE

Recommendation:MAINTAIN

Potential Insight

"Output Filtering offers none deception robustness - Deceptive model could bypass or manipulate filters, does not scale to superintelligence - Sophisticated users/models can evade filters."

Refusal Training

Safety Approaches

View

Matched Criteria

Weak/no deception robustnessDoes not scale to superintelligence

Key Ratings

Safety Uplift:LOW-MEDIUM

Capability Uplift:TAX

Differential Progress:BALANCED

Deception Robust:NONE

Recommendation:MAINTAIN

Potential Insight

"Refusal Training offers none deception robustness - Refusals are behavioral, not goal-level, does not scale to superintelligence - Jailbreaks consistently found; arms race."

Monitoring / Trip Wires

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:NEUTRAL

Differential Progress:SAFETY-DOMINANT

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Monitoring / Trip Wires offers weak deception robustness - Deceptive AI would avoid triggering monitors."

Circuit Breakers / Inference Interventions

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:TAX

Differential Progress:SAFETY-LEANING

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Circuit Breakers / Inference Interventions offers weak deception robustness - Deceptive model could generate harm before circuit breaks."

Compute Governance

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:MEDIUM-HIGH

Capability Uplift:NEGATIVE

Differential Progress:SAFETY-DOMINANT

Deception Robust:N/A

Recommendation:PRIORITIZE

Potential Insight

"Compute Governance is rated PRIORITIZE (One of few levers to affect timeline; very underfunded)."

Evals-Based Deployment Gates

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:MEDIUM

Capability Uplift:TAX

Differential Progress:SAFETY-DOMINANT

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Evals-Based Deployment Gates offers weak deception robustness - Deceptive models could pass evals."

Pause / Moratorium

Safety Approaches

View

Matched Criteria

Unclear/harmful net safety

Key Ratings

Safety Uplift:HIGH (if implemented)

Capability Uplift:NEGATIVE

Differential Progress:SAFETY-DOMINANT

Deception Robust:N/A

Recommendation:MAINTAIN

Potential Insight

"Pause / Moratorium has unclear net impact on world safety."

International AI Governance

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:MEDIUM-HIGH

Capability Uplift:TAX

Differential Progress:SAFETY-DOMINANT

Deception Robust:N/A

Recommendation:PRIORITIZE

Potential Insight

"International AI Governance is rated PRIORITIZE (Critical infrastructure; severely underdeveloped)."

Corrigibility Research

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:HIGH (if solved)

Capability Uplift:NEUTRAL

Differential Progress:SAFETY-DOMINANT

Deception Robust:PARTIAL

Recommendation:PRIORITIZE

Potential Insight

"Corrigibility Research is rated PRIORITIZE (Severely underfunded for importance; key unsolved problem)."

Eliciting Latent Knowledge (ELK)

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:HIGH (if solved)

Capability Uplift:SOME

Differential Progress:SAFETY-LEANING

Deception Robust:STRONG (if solved)

Recommendation:PRIORITIZE

Potential Insight

"Eliciting Latent Knowledge (ELK) is rated PRIORITIZE (Solves deception problem if successful; needs breakthrough)."

Capability Unlearning / Removal

Safety Approaches

View

Matched Criteria

Weak/no deception robustness

Key Ratings

Safety Uplift:HIGH (if works)

Capability Uplift:NEGATIVE

Differential Progress:SAFETY-DOMINANT

Deception Robust:WEAK

Recommendation:INCREASE

Potential Insight

"Capability Unlearning / Removal offers weak deception robustness - Model might hide rather than truly unlearn capabilities."

AI Control

Safety Approaches

View

Matched Criteria

High priority recommendation

Key Ratings

Safety Uplift:HIGH

Capability Uplift:TAX

Differential Progress:SAFETY-DOMINANT

Deception Robust:PARTIAL

Recommendation:PRIORITIZE

Potential Insight

"AI Control is rated PRIORITIZE (Fundamental requirement; increasingly important with agentic AI)."

Mesa-Optimization

Accident Risks

View

Matched Criteria

Severe + hard to detect

Key Ratings

Severity:CATASTROPHIC

Detectability:VERY_DIFFICULT

Evidence:THEORETICAL

Timeline:UNCERTAIN

Potential Insight

"Mesa-Optimization: catastrophic severity but very difficult to detect; Well-established theoretically (Hubinger et al. 2019). Whether current LLMs are mesa-optimizers is debated.."

Instrumental Convergence

Accident Risks

View

Matched Criteria

Lab-demonstrated catastrophic riskCurrent timeline + severe

Key Ratings

Severity:EXISTENTIAL

Detectability:MODERATE

Evidence:DEMONSTRATED_LAB

Timeline:CURRENT

Potential Insight

"Instrumental Convergence: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner et al. 2021). Empirical evidence: 78% alignment faking (Anthropic 2024), 79% shutdown resistance (Palisade 2025).."

Deceptive Alignment

Accident Risks

View

Matched Criteria

Severe + hard to detectLab-demonstrated catastrophic risk

Key Ratings

Severity:EXISTENTIAL

Detectability:VERY_DIFFICULT

Evidence:DEMONSTRATED_LAB

Timeline:NEAR_TERM

Potential Insight

"Deceptive Alignment: existential severity but very difficult to detect; already demonstrated in lab settings; Sleeper agents study (Anthropic 2024) showed deception persists through safety training. 78% alignment faking rate observed.."

Goal Misgeneralization

Accident Risks

View

Matched Criteria

Current timeline + severe

Key Ratings

Severity:HIGH

Detectability:MODERATE

Evidence:DEMONSTRATED_LAB

Timeline:CURRENT

Potential Insight

"Goal Misgeneralization: already demonstrated in lab settings; relevant to current AI systems; Demonstrated in RL environments (CoinRun, etc.). Less clear if this occurs in LLMs.."

Scheming

Accident Risks

View

Matched Criteria

Severe + hard to detectLab-demonstrated catastrophic riskCurrent timeline + severe

Key Ratings

Severity:CATASTROPHIC

Detectability:DIFFICULT

Evidence:DEMONSTRATED_LAB

Timeline:CURRENT

Potential Insight

"Scheming: catastrophic severity but difficult to detect; already demonstrated in lab settings; relevant to current AI systems; Apollo Research (Dec 2024): All 5 tested frontier models demonstrated in-context scheming. o1 disabled oversight in 5% of trials.."

Sandbagging

Accident Risks

View

Matched Criteria

Current timeline + severe

Key Ratings

Severity:HIGH

Detectability:DIFFICULT

Evidence:DEMONSTRATED_LAB

Timeline:CURRENT

Potential Insight

"Sandbagging: already demonstrated in lab settings; relevant to current AI systems; Claude 3.5 Sonnet strategically underperformed in evaluations (Apollo Research 2024). Multiple models showed sandbagging capability.."

Power-Seeking

Accident Risks

View

Matched Criteria

Lab-demonstrated catastrophic riskCurrent timeline + severe

Key Ratings

Severity:EXISTENTIAL

Detectability:MODERATE

Evidence:DEMONSTRATED_LAB

Timeline:CURRENT

Potential Insight

"Power-Seeking: already demonstrated in lab settings; relevant to current AI systems; Formal proofs (Turner 2021). Empirical: o3 sabotaged shutdown in 79% of tests (Palisade 2025).."

Corrigibility Failure

Accident Risks

View

Matched Criteria

Lab-demonstrated catastrophic riskCurrent timeline + severe

Key Ratings

Severity:CATASTROPHIC

Detectability:EASY

Evidence:DEMONSTRATED_LAB

Timeline:CURRENT

Potential Insight

"Corrigibility Failure: already demonstrated in lab settings; relevant to current AI systems; o3 sabotaged shutdown in 79% of tests (Palisade 2025). 7% even with explicit "allow shutdown" instruction. Claude 3.7 showed 0% resistance.."

Treacherous Turn

Accident Risks

View

Matched Criteria

Severe + hard to detect

Key Ratings

Severity:EXISTENTIAL

Detectability:VERY_DIFFICULT

Evidence:THEORETICAL

Timeline:MEDIUM_TERM

Potential Insight

"Treacherous Turn: existential severity but very difficult to detect; Theoretical reasoning + proof-of-concept. Sleeper agents study shows deception can persist; actual treacherous turn not yet observed.."

Sharp Left Turn

Accident Risks

View

Matched Criteria

Severe + hard to detect

Key Ratings

Severity:EXISTENTIAL

Detectability:VERY_DIFFICULT

Evidence:SPECULATIVE

Timeline:MEDIUM_TERM

Potential Insight

"Sharp Left Turn: existential severity but very difficult to detect; Theoretical scenario. No direct evidence. Some analogies in capability jumps.."

Emergent Capabilities

Accident Risks

View

Matched Criteria

Current timeline + severe

Key Ratings

Severity:HIGH

Detectability:MODERATE

Evidence:OBSERVED_CURRENT

Timeline:CURRENT

Potential Insight

"Emergent Capabilities: relevant to current AI systems; Well-documented in scaling research (GPT-4, etc.). Some capabilities appear suddenly at scale.."