LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.crux content improve <id>ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.Set updateFrequency in frontmatterEntityEntityYAML entity definition with type, description, and related entries.Add entity YAML in data/entities/Edit historyEdit historyTracked changes from improve pipeline runs and manual edits.crux edit-log view <id>OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables0/ ~1TablesData tables for structured comparisons and reference material.Add data tables to the pageDiagrams0DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.Add Mermaid diagrams or Squiggle modelsInt. links22/ ~3Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Ext. links0/ ~1Ext. linksLinks to external websites, papers, and resources outside the wiki.Add links to external sourcesFootnotes0/ ~2FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citationsReferences0/ ~1ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>
Issues1
StructureNo tables or diagrams - consider adding visual content
Safety Responses
Overview
This section documents interventions and approaches being developed to address AI safety risks. Responses span technical research, governance mechanisms, institutional development, and public engagement.
Response Categories
Technical AlignmentSafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
Research aimed at ensuring AI systems behave as intended:
Mechanistic InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100 - Understanding model internals
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100 - Reinforcement learning from human feedback
Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100 - Training with explicit principles
AI ControlSafety AgendaAI ControlAI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, offering 40-60% catastrophic risk reduction if align...Quality: 75/100 - Limiting AI autonomy regardless of alignment
EvaluationsSafety AgendaAI EvaluationsEvaluations and red-teaming reduce detectable dangerous capabilities by 30-50x when combined with training interventions (o3 covert actions: 13% → 0.4%), but face fundamental limitations against so...Quality: 72/100 - Testing for dangerous capabilities
GovernancePolicyCompute ThresholdsComprehensive analysis of compute thresholds (EU: 10^25 FLOP, US: 10^26 FLOP) as regulatory triggers for AI governance, documenting that algorithmic efficiency improvements of ~2x every 8-17 months...Quality: 91/100
Policy and regulatory approaches:
Compute GovernancePolicyCompute ThresholdsComprehensive analysis of compute thresholds (EU: 10^25 FLOP, US: 10^26 FLOP) as regulatory triggers for AI governance, documenting that algorithmic efficiency improvements of ~2x every 8-17 months...Quality: 91/100 - Controlling AI through hardware
Export ControlsPolicyUS AI Chip Export ControlsComprehensive empirical analysis finds US chip export controls provide 1-3 year delays on Chinese AI development but face severe enforcement gaps (140,000 GPUs smuggled in 2024, only 1 BIS officer ...Quality: 73/100 - Restricting chip access
Responsible Scaling PoliciesPolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100 - Lab commitments
LegislationPolicyEU AI ActComprehensive overview of the EU AI Act's risk-based regulatory framework, particularly its two-tier approach to foundation models that distinguishes between standard and systemic risk AI systems. ...Quality: 55/100 - Government regulation
InstitutionsPolicyAI Safety Institutes (AISIs)Analysis of government AI Safety Institutes finding they've achieved rapid institutional growth (UK: 0→100+ staff in 18 months) and secured pre-deployment access to frontier models, but face critic...Quality: 69/100
Organizations and structures for AI safety:
AI Safety InstitutesPolicyAI Safety Institutes (AISIs)Analysis of government AI Safety Institutes finding they've achieved rapid institutional growth (UK: 0→100+ staff in 18 months) and secured pre-deployment access to frontier models, but face critic...Quality: 69/100 - Government research bodies
Standards BodiesPolicyAI Standards DevelopmentComprehensive analysis of AI standards bodies (ISO/IEC, IEEE, NIST, CEN-CENELEC) showing how voluntary technical standards become de facto requirements through regulatory integration, particularly ...Quality: 69/100 - Technical standard development
Epistemic ToolsApproachAI Governance Coordination TechnologiesComprehensive analysis of coordination mechanisms for AI safety showing racing dynamics could compress safety timelines by 2-5 years, with \$500M+ government investment in AI Safety Institutes achi...Quality: 91/100
Technologies to preserve information integrity:
Coordination TechnologiesApproachAI Governance Coordination TechnologiesComprehensive analysis of coordination mechanisms for AI safety showing racing dynamics could compress safety timelines by 2-5 years, with \$500M+ government investment in AI Safety Institutes achi...Quality: 91/100 - Enabling cooperation
Content AuthenticationApproachAI Content AuthenticationContent authentication via C2PA and watermarking (10B+ images) offers superior robustness to failing detection methods (55% accuracy), with EU AI Act mandates by August 2026 driving adoption among ...Quality: 58/100 - Verifying authentic media
Prediction MarketsApproachPrediction Markets (AI Forecasting)Prediction markets achieve Brier scores of 0.16-0.24 (15-25% better than polls) by aggregating dispersed information through financial incentives, with platforms handling \$1-3B annually. For AI sa...Quality: 56/100 - Aggregating forecasts
Field BuildingApproachAI Safety Training ProgramsComprehensive guide to AI safety training programs including MATS (78% alumni in alignment work, 100+ scholars annually), Anthropic Fellows (\$2,100/week stipend, 40%+ hired full-time), LASR Labs (...Quality: 70/100
Growing the AI safety research community:
Training ProgramsApproachAI Safety Training ProgramsComprehensive guide to AI safety training programs including MATS (78% alumni in alignment work, 100+ scholars annually), Anthropic Fellows (\$2,100/week stipend, 40%+ hired full-time), LASR Labs (...Quality: 70/100 - Researcher development
Corporate InfluenceCruxCorporate Influence on AI PolicyComprehensive analysis of corporate influence pathways (working inside labs, shareholder activism, whistleblowing) showing mixed effectiveness: safety teams influenced GPT-4 delays and responsible ...Quality: 66/100 - Engaging industry
AI Bio-Capability Evaluations - Measuring AI biological uplift (VCT, red-teaming)
Evaluating Responses
Each response page includes assessments of:
Tractability - How feasible is progress?
Neglectedness - How much attention is it getting?
Potential Impact - How much could it help if successful?
See the Intervention PortfolioApproachAI Safety Intervention PortfolioProvides a strategic framework for AI safety resource allocation by mapping 13+ interventions against 4 risk categories, evaluating each on ITN dimensions, and identifying portfolio gaps (epistemic...Quality: 91/100 for comparative analysis.
AI Standards DevelopmentPolicyAI Standards DevelopmentComprehensive analysis of AI standards bodies (ISO/IEC, IEEE, NIST, CEN-CENELEC) showing how voluntary technical standards become de facto requirements through regulatory integration, particularly ...Quality: 69/100AI Safety Institutes (AISIs)PolicyAI Safety Institutes (AISIs)Analysis of government AI Safety Institutes finding they've achieved rapid institutional growth (UK: 0→100+ staff in 18 months) and secured pre-deployment access to frontier models, but face critic...Quality: 69/100Responsible Scaling Policies (RSPs)PolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100EU AI ActPolicyEU AI ActComprehensive overview of the EU AI Act's risk-based regulatory framework, particularly its two-tier approach to foundation models that distinguishes between standard and systemic risk AI systems. ...Quality: 55/100
Approaches
AI Safety Training ProgramsApproachAI Safety Training ProgramsComprehensive guide to AI safety training programs including MATS (78% alumni in alignment work, 100+ scholars annually), Anthropic Fellows (\$2,100/week stipend, 40%+ hired full-time), LASR Labs (...Quality: 70/100Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100AI Content AuthenticationApproachAI Content AuthenticationContent authentication via C2PA and watermarking (10B+ images) offers superior robustness to failing detection methods (55% accuracy), with EU AI Act mandates by August 2026 driving adoption among ...Quality: 58/100Prediction Markets (AI Forecasting)ApproachPrediction Markets (AI Forecasting)Prediction markets achieve Brier scores of 0.16-0.24 (15-25% better than polls) by aggregating dispersed information through financial incentives, with platforms handling \$1-3B annually. For AI sa...Quality: 56/100
Safety Research
AI EvaluationsSafety AgendaAI EvaluationsEvaluations and red-teaming reduce detectable dangerous capabilities by 30-50x when combined with training interventions (o3 covert actions: 13% → 0.4%), but face fundamental limitations against so...Quality: 72/100InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
Key Debates
Corporate Influence on AI PolicyCruxCorporate Influence on AI PolicyComprehensive analysis of corporate influence pathways (working inside labs, shareholder activism, whistleblowing) showing mixed effectiveness: safety teams influenced GPT-4 delays and responsible ...Quality: 66/100
Concepts
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100Biosecurity OverviewBiosecurity OverviewAn overview of the EA/x-risk biosecurity portfolio, spanning DNA synthesis screening, pathogen surveillance, medical countermeasures, AI capability evaluations, physical defenses, and governance re...Quality: 50/100
Analysis
AI Risk Activation Timeline ModelAnalysisAI Risk Activation Timeline ModelComprehensive framework mapping AI risk activation windows with specific probability assessments: current risks already active (disinformation 95%+, spear phishing active), near-term critical windo...Quality: 66/100