Longterm Wiki

deployment-safetyapi-accessproliferation-control

Structured Access / API-Only

Structured access (API-only deployment) provides meaningful safety benefits through monitoring (80-95% detection rates), intervention capability, a...

game-theorygovernanceinternational-cooperation

2.9k words

AI Governance Coordination Technologies

Comprehensive analysis of coordination mechanisms for AI safety showing racing dynamics could compress safety timelines by 2-5 years, with $500M+ g...

thresholdsgovernanceus-aisi

4.5k words

US Executive Order on Safe, Secure, and Trustworthy AI

Executive Order 14110 (Oct 2023) established compute thresholds (10^26 FLOP general, 10^23 biological) and created AISI, but was revoked after 15 m...

human-ai-interactionai-controldecision-making

2.4k words

AI-Human Hybrid Systems

Hybrid AI-human systems achieve 15-40% error reduction across domains through six design patterns, with evidence from Meta (23% false positive redu...

resource-allocationfield-analysisfunding

2.8k words

AI Safety Intervention Portfolio

Provides a strategic framework for AI safety resource allocation by mapping 13+ interventions against 4 risk categories, evaluating each on ITN dim...

schemingdeception-detectionbehavioral-testing

3.3k words

Scheming & Deception Detection

Reviews empirical evidence that frontier models (o1, Claude 3.5, Gemini 1.5) exhibit in-context scheming capabilities at rates of 0.3-13%, includin...

containmentdefense-in-depthagent-safety

4.3k words

Sandboxing / Containment

Comprehensive analysis of AI sandboxing as defense-in-depth, synthesizing METR's 2025 evaluations (GPT-5 time horizon ~2h, capabilities doubling ev...

agent-safetycapability-restrictionsdefense-in-depth

3.9k words

Tool-Use Restrictions

Tool-use restrictions provide hard limits on AI agent capabilities through defense-in-depth approaches combining permissions, sandboxing, and human...

pausedevelopment-moratoriumpolitical-advocacy

5.3k words

Pause Advocacy

Comprehensive analysis of pause advocacy as an AI safety intervention, estimating 15-40% probability of meaningful policy implementation by 2030 wi...

safety-casesgovernancedeployment-decisions

4.1k words

AI Safety Cases

Safety cases are structured arguments adapted from nuclear/aviation to justify AI system safety, with UK AISI publishing templates in 2024 and 3 of...

elicitationsandbaggingscaffolding

Capability Elicitation

Capability elicitation—systematically discovering what AI models can actually do through scaffolding, prompting, and fine-tuning—reveals 2-10x perf...

self-regulationindustry-commitmentsresponsible-scaling

4.6k words

Voluntary AI Safety Commitments

Comprehensive empirical analysis of voluntary AI safety commitments showing 53% mean compliance rate across 30 indicators (ranging from 13% for App...

alignment-theorydeception-detectionbelief-extraction

2.5k words

Eliciting Latent Knowledge (ELK)

Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awar...

2.9k words

Deepfake Detection

Comprehensive analysis of deepfake detection showing best commercial detectors achieve 78-87% in-the-wild accuracy vs 96%+ in controlled settings, ...

interpretabilityfeature-extractionmonosemanticity

3.2k words

Sparse Autoencoders (SAEs)

Comprehensive review of sparse autoencoders (SAEs) for mechanistic interpretability, covering Anthropic's 34M features from Claude 3 Sonnet (90% in...

weak-to-strongscalable-oversightsuperalignment

2.9k words

Weak-to-Strong Generalization

Weak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show...

4.2k words

US AI Chip Export Controls

Comprehensive empirical analysis finds US chip export controls provide 1-3 year delays on Chinese AI development but face severe enforcement gaps (...

ai-controlsafety-research

3.1k words

AI Control

AI Control is a defensive safety approach that maintains control over potentially misaligned AI through monitoring, containment, and redundancy, of...

evaluationsafety-testingdeployment-decisions

AI Evaluation

Comprehensive overview of AI evaluation methods spanning dangerous capability assessment, safety properties, and deception detection, with categori...

regulationstate-policyfrontier-models

2.5k words

California SB 53

California SB 53 represents the first U.S. state law specifically targeting frontier AI safety through transparency requirements, incident reportin...

scalable-oversightadversarial-methodssuperhuman-alignment

AI Safety via Debate

AI Safety via Debate uses adversarial AI systems arguing opposing positions to enable human oversight of superhuman AI. Recent empirical work shows...

behavior-steeringactivation-engineeringdeception-detection

1.8k words

Representation Engineering

Representation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80...

corporate-safetysafety-teamsvoluntary-commitments

1.3k words

Corporate AI Safety Responses

Major AI labs invest $300-500M annually in safety (5-10% of R&D) through responsible scaling policies and dedicated teams, but face 30-40% safety t...

training-programstalent-pipelinefield-building

2.2k words

AI Safety Training Programs

Comprehensive guide to AI safety training programs including MATS (78% alumni in alignment work, 100+ scholars annually), Anthropic Fellows ($2,100...

4.2k words

AI Safety Institutes (AISIs)

Analysis of government AI Safety Institutes finding they've achieved rapid institutional growth (UK: 0→100+ staff in 18 months) and secured pre-dep...

monitoringcloud-kyccompute-governance

4.4k words

Compute Monitoring

Analyzes two compute monitoring approaches: cloud KYC (implementable in 1-2 years, covers ~60% of frontier training via AWS/Azure/Google) and hardw...

third-party-auditingindependent-evaluationgovernance

3.8k words

Third-Party Model Auditing

Third-party auditing organizations (METR, Apollo, UK/US AISIs) now evaluate all major frontier models pre-deployment, discovering that AI task hori...

2.6k words

Council of Europe Framework Convention on Artificial Intelligence

The Council of Europe's AI Framework Convention represents the first legally binding international AI treaty, establishing human rights-focused gov...

4.1k words

Reducing Hallucinations in AI-Generated Wiki Content

This comprehensive technical guide documents methods to reduce AI hallucinations in wiki content from 3-27% to 0-6% through RAG, verification techn...

regulationstate-policyfrontier-models

2.7k words

New York RAISE Act

The New York RAISE Act represents the first comprehensive state-level AI safety legislation with enforceable requirements for frontier AI developer...

dangerous-capabilitiesbioweaponscybersecurity

3.6k words

Dangerous Capability Evaluations

Comprehensive synthesis showing dangerous capability evaluations are now standard practice (95%+ frontier models) but face critical limitations: AI...

unlearningcapability-removalmisuse-prevention

Capability Unlearning / Removal

Capability unlearning removes dangerous capabilities (e.g., bioweapon synthesis) from AI models through gradient-based methods, representation engi...

alignment-evaluationscheming-detectionsycophancy

3.8k words

Alignment Evaluations

Comprehensive review of alignment evaluation methods showing Apollo Research found 1-13% scheming rates across frontier models, while anti-scheming...

formal-methodsmathematical-guaranteesworld-modeling

2.3k words

Provably Safe AI (davidad agenda)

Davidad's provably safe AI agenda aims to create AI systems with mathematical safety guarantees through formal verification of world models and val...

scalable-oversightsafety-research

5.7k words

Scalable Oversight

Process supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 6...

evaluation-gamingdeceptionscheming

Evaluation Awareness

Models increasingly detect evaluation contexts and behave differently—Claude Sonnet 4.5 at 58% detection rate (vs. 22% for Opus 4.1), with Opus 4.6...

sleeper-agentsbackdoor-detectiondeceptive-alignment

4.3k words

Sleeper Agent Detection

Comprehensive survey of sleeper agent detection methods finding current approaches achieve only 5-40% success rates despite $15-35M annual investme...

disinformationdeepfakestrust

3.4k words

AI-Era Epistemic Security

Comprehensive analysis of epistemic security finds human deepfake detection at near-chance levels (55.5%), AI detection dropping 45-50% on novel co...

2.6k words

AI Output Filtering

Comprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbr...

evaluationssafety-research

2.7k words

AI Evaluations

Evaluations and red-teaming reduce detectable dangerous capabilities by 30-50x when combined with training interventions (o3 covert actions: 13% → ...

process-supervisionchain-of-thoughtreasoning-verification

Process Supervision

Process supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchm...

evaluationsdeployment-gateseu-ai-act

4.1k words

Evals-Based Deployment Gates

Evals-based deployment gates create formal checkpoints requiring AI systems to pass safety evaluations before deployment, with EU AI Act imposing f...

7 words

Value Learning

Training AI systems to infer and adopt human values from observation and interaction

value-learningalignment

interpretabilitysafety-research

3.7k words

Interpretability

Mechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 7...

constitutional-airlaifharmlessness

1.5k words

Constitutional AI

Constitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x imp...

hardware-governancechip-trackingexport-controls

Hardware-Enabled Governance

RAND analysis identifies attestation-based licensing as most feasible hardware-enabled governance mechanism with 5-10 year timeline, while 100,000+...

formal-methodsmathematical-guaranteessafety-verification

2.1k words

Formal Verification (AI Safety)

Formal verification seeks mathematical proofs of AI safety properties but faces a ~100,000x scale gap between verified systems (~10k parameters) an...

regulationchinacontent-control

3.3k words

China AI Regulatory Framework

Comprehensive analysis of China's AI regulatory framework covering 5+ major regulations affecting 50,000+ companies, with enforcement focusing on c...

llm-as-judgeautomated-evalsred-teaming

Scalable Eval Approaches

Survey of practical approaches for scaling AI evaluation. LLM-as-judge has reached ~40% production adoption with 80%+ human agreement, but Dorner e...

knowledge-managementpublic-goodsinformation-infrastructure

2.7k words

AI-Era Epistemic Infrastructure

Comprehensive analysis of epistemic infrastructure showing AI fact-checking achieves 85-87% accuracy at $0.10-$1.00 per claim versus $50-200 for hu...

field-buildingtalent-pipelinetraining-programs

3.6k words

AI Safety Field Building Analysis

Comprehensive analysis of AI safety field-building showing growth from 400 to 1,100 FTEs (2022-2025) at 21-30% annual growth rates, with training p...

2.0k words

Cooperative AI

Cooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~$1-20M/year investment primaril...

whistlebloweremployee-protectionsinformation-asymmetry

2.6k words

AI Whistleblower Protections

Comprehensive analysis of AI whistleblower protections showing severe gaps in current law (no federal protection for AI safety disclosures) with bi...

responsible-scalingvoluntary-commitmentssafety-thresholds

3.4k words

Responsible Scaling Policies

Comprehensive analysis of Responsible Scaling Policies showing 20 companies with published frameworks as of Dec 2025, with SaferAI grading major po...

runtime-safetyinference-interventionjailbreak-defense

3.2k words

Circuit Breakers / Inference Interventions

Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting ac...

red-teamingsafety-testing

1.4k words

Red Teaming

Red teaming is a systematic adversarial evaluation methodology for identifying AI vulnerabilities and dangerous capabilities before deployment, wit...

dpopreference-optimizationrlhf

2.8k words

Preference Optimization Methods

DPO and related preference optimization methods reduce alignment training costs by 40-60% while matching RLHF performance on dialogue tasks, though...