Comprehensive review of AI alignment approaches finding current methods (RLHF, Constitutional AI) show 75%+ effectiveness on measurable safety metr...
Structured access (API-only deployment) provides meaningful safety benefits through monitoring (80-95% detection rates), intervention capability, a...
Comprehensive analysis of coordination mechanisms for AI safety showing racing dynamics could compress safety timelines by 2-5 years, with \$500M+ ...
Hybrid AI-human systems achieve 15-40% error reduction across domains through six design patterns, with evidence from Meta (23% false positive redu...
Provides a strategic framework for AI safety resource allocation by mapping 13+ interventions against 4 risk categories, evaluating each on ITN dim...
Reviews empirical evidence that frontier models (o1, Claude 3.5, Gemini 1.5) exhibit in-context scheming capabilities at rates of 0.3-13%, includin...
Comprehensive analysis of AI sandboxing as defense-in-depth, synthesizing METR's 2025 evaluations (GPT-5 time horizon ~2h, capabilities doubling ev...
Tool-use restrictions provide hard limits on AI agent capabilities through defense-in-depth approaches combining permissions, sandboxing, and human...
Executive Order 14110 (Oct 2023) established compute thresholds (10^26 FLOP general, 10^23 biological) and created AISI, but was revoked after 15 m...
Comprehensive analysis of compute thresholds (EU: 10^25 FLOP, US: 10^26 FLOP) as regulatory triggers for AI governance, documenting that algorithmi...
Comprehensive analysis of pause advocacy as an AI safety intervention, estimating 15-40% probability of meaningful policy implementation by 2030 wi...
Safety cases are structured arguments adapted from nuclear/aviation to justify AI system safety, with UK AISI publishing templates in 2024 and 3 of...
Capability elicitation—systematically discovering what AI models can actually do through scaffolding, prompting, and fine-tuning—reveals 2-10x perf...
Comprehensive empirical analysis of voluntary AI safety commitments showing 53% mean compliance rate across 30 indicators (ranging from 13% for App...
Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awar...
Comprehensive analysis of international AI coordination mechanisms shows growing but limited progress: 11-country AI Safety Institute network with ...
Comprehensive analysis of deepfake detection showing best commercial detectors achieve 78-87% in-the-wild accuracy vs 96%+ in controlled settings, ...
Comprehensive review of sparse autoencoders (SAEs) for mechanistic interpretability, covering Anthropic's 34M features from Claude 3 Sonnet (90% in...
Weak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show...
Comprehensive empirical analysis finds US chip export controls provide 1-3 year delays on Chinese AI development but face severe enforcement gaps (...
AI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, of...
Comprehensive overview of AI evaluation methods spanning dangerous capability assessment, safety properties, and deception detection, with categori...
Comprehensive analysis of pause/moratorium proposals finding they would provide very high safety benefits if implemented (buying time for safety re...
California SB 53 represents the first U.S. state law specifically targeting frontier AI safety through transparency requirements, incident reportin...
Comprehensive analysis of AI standards bodies (ISO/IEC, IEEE, NIST, CEN-CENELEC) showing how voluntary technical standards become de facto requirem...
AI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows...
Representation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80...
Major AI labs invest \$300-500M annually in safety (5-10% of R&D) through responsible scaling policies and dedicated teams, but face 30-40% safety ...
Analysis of government AI Safety Institutes finding they've achieved rapid institutional growth (UK: 0→100+ staff in 18 months) and secured pre-dep...
Analyzes two compute monitoring approaches: cloud KYC (implementable in 1-2 years, covers ~60% of frontier training via AWS/Azure/Google) and hardw...
Comprehensive guide to AI safety training programs including MATS (78% alumni in alignment work, 100+ scholars annually), Anthropic Fellows (\$2,10...
Third-party auditing organizations (METR, Apollo, UK/US AISIs) now evaluate all major frontier models pre-deployment, discovering that AI task hori...
The Council of Europe's AI Framework Convention represents the first legally binding international AI treaty, establishing human rights-focused gov...
Comprehensive analysis of international AI compute governance finds 10-25% chance of meaningful regimes by 2035, but potential for 30-60% reduction...
The New York RAISE Act represents the first comprehensive state-level AI safety legislation with enforceable requirements for frontier AI developer...
This comprehensive technical guide documents methods to reduce AI hallucinations in wiki content from 3-27% to 0-6% through RAG, verification techn...
Comprehensive synthesis showing dangerous capability evaluations are now standard practice (95%+ frontier models) but face critical limitations: AI...
Capability unlearning removes dangerous capabilities (e.g., bioweapon synthesis) from AI models through gradient-based methods, representation engi...
Comprehensive review of alignment evaluation methods showing Apollo Research found 1-13% scheming rates across frontier models, while anti-scheming...
Process supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 6...
RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development ...
Comprehensive analysis of epistemic security finds human deepfake detection at near-chance levels (55.5%), AI detection dropping 45-50% on novel co...
Three international AI safety summits (2023-2025) achieved first formal recognition of catastrophic AI risks from 28+ countries, established 10+ AI...
Comprehensive tracking of US state AI legislation shows explosive growth from ~40 bills in 2019 to 1,080+ in 2025, with only 11% passage rate but r...
Surveys US legal authority (DPA, IEEPA, CLOUD Act, FISA 702) over \$700B+ in commercial AI infrastructure concentrated in 5-6 companies, concluding...
Comprehensive survey of sleeper agent detection methods finding current approaches achieve only 5-40% success rates despite \$15-35M annual investm...
Comprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbr...
Evaluations and red-teaming reduce detectable dangerous capabilities by 30-50x when combined with training interventions (o3 covert actions: 13% → ...
Models increasingly detect evaluation contexts and behave differently—Claude Sonnet 4.5 at 58% detection rate (vs. 22% for Opus 4.1), with Opus 4.6...
Davidad's provably safe AI agenda aims to create AI systems with mathematical safety guarantees through formal verification of world models and val...
Process supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchm...
Comprehensive analysis of epistemic infrastructure showing AI fact-checking achieves 85-87% accuracy at \$0.10-\$1.00 per claim versus \$50-200 for...
Evals-based deployment gates create formal checkpoints requiring AI systems to pass safety evaluations before deployment, with EU AI Act imposing f...
Mechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 7...
Constitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x imp...
Formal verification seeks mathematical proofs of AI safety properties but faces a ~100,000x scale gap between verified systems (~10k parameters) an...
RAND analysis identifies attestation-based licensing as most feasible hardware-enabled governance mechanism with 5-10 year timeline, while 100,000+...
Comprehensive analysis of AI safety field-building showing growth from 400 to 1,100 FTEs (2022-2025) at 21-30% annual growth rates, with training p...
Cooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~\$1-20M/year investment primari...
Anthropic allocates 15-25% of R&D (~\$100-200M annually) to safety research including the world's largest interpretability team (40-60 researchers)...