Longterm Wiki

Approaches

AI safety approaches, techniques, and strategies -- from alignment methods and evaluation frameworks to governance mechanisms and deployment safeguards.

Approaches
66
With Description
66
Unique Tags
174

Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends ext

Agent foundations research (MIRI's mathematical frameworks for aligned agency) faces low tractability after 10+ years with core problems unsolved, leading to MIRI's 2024 strategic pivot away from the field. Assessment shows ~15-25% probability the work is essential, 60-75% confidence in low tractabi

Technical approaches to ensuring AI systems pursue intended goals and remain aligned with human values throughout training and deployment. Current methods show promise but face fundamental scalability challenges, with oversight success dropping to 52% at 400 Elo capability gaps.

alignment, scalable-oversight, rlhf ...Wiki →

Content authentication technologies aim to establish verifiable provenance for digital content - allowing users to confirm where content came from, whether it has been modified, and whether it was created by AI or humans. The goal is to rebuild trust in digital media by creating technical guarantees of authenticity that complement human judgment. The leading approach is the C2PA (Coalition for Content Provenance and Authenticity) standard, backed by major technology companies. C2PA embeds cryptographically signed metadata into content at the point of creation - when a photo is taken, when a video is recorded, when an AI generates an image. This creates a chain of custody that can be verified later. Other approaches include invisible watermarking (SynthID), blockchain-based verification, and forensic analysis tools that detect signs of synthetic generation or manipulation. The key challenges are adoption and circumvention. Content authentication only works if it becomes universal - if users come to expect provenance information and distrust content without it. But metadata can be stripped, watermarks can potentially be removed or spoofed, and AI-generated content without credentials can still circulate. The race between authentication and forgery capability is uncertain, but authentication provides one of the few technical defenses against the coming flood of synthetic content.

deepfakes, digital-evidence, verification ...Wiki →

A proposed epistemic infrastructure making knowledge provenance transparent and traversable—enabling anyone to see the chain of citations, original data sources, methodological assumptions, and reliability scores for any claim they encounter.

Methods and frameworks for evaluating AI system safety, capabilities, and alignment properties before deployment, including dangerous capability detection, robustness testing, and deceptive behavior assessment.

evaluation, safety-testing, deployment-decisions ...Wiki →

AI systems are emerging as tools for holding powerful actors accountable, analyzing public records, tracing financial flows, monitoring environmental violations, and documenting human rights abuses at previously impossible scale. The ICIJ's AI-assisted investigations revealed $32+ trillion in hidden wealth. This sousveillance dynamic represents the beneficial flip side of AI surveillance capabilities.

FLF's inaugural 12-week fellowship (July-October 2025) combined research fellowship with startup incubator format. 30 fellows received $25-50K stipends to build AI tools for human reasoning. Produced 25+ projects across epistemic tools (Community Notes AI, fact-checking), forecasting (Deliberation M

Coordination technologies are tools and mechanisms that enable actors to cooperate on collective challenges when individual incentives favor defection. For AI safety, these technologies address the fundamental problem that racing to develop AI faster may be individually rational but collectively catastrophic. For epistemic security, they help coordinate defensive responses to disinformation. These technologies draw on mechanism design, game theory, and institutional economics. Examples include: verification protocols that allow actors to confirm others' compliance with agreements (critical for AI safety treaties); commitment devices that make defection from cooperative arrangements costly; signaling mechanisms that allow actors to credibly communicate intentions; and platforms that make coordination focal points more visible. For AI governance specifically, coordination technologies might include compute monitoring systems that verify compliance with training restrictions, international registries of advanced AI systems, and mechanisms for sharing safety research while protecting commercial interests. The fundamental insight from Elinor Ostrom's work is that collective action problems are not unsolvable - but they require deliberate institutional design. The urgency of AI risk makes developing effective coordination mechanisms for this domain a priority.

game-theory, governance, international-cooperation ...Wiki →

Analysis of interventions to improve safety culture within AI labs. Evidence from 2024-2025 shows significant gaps: no company scored above C+ overall (FLI Winter 2025), all received D or below on existential safety, and xAI released Grok 4 without any safety documentation.

safety-culture, organizational-practices, safety-teams ...Wiki →

Reviews standard policy interventions (reskilling, UBI, portable benefits, automation taxes) for managing AI-driven job displacement, citing WEF projection of 14 million net job losses by 2027 and 23% of US workers already using GenAI weekly. Finds medium tractability and grades as B-tier priority,

Comprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbreaks per UK AISI testing, though Anthropic's Constitutional Classifiers blocked 95.6% of attacks. Concludes filtering provides marginal safety benefits f

Public education initiatives show measurable but modest impacts: MIT programs increased accurate AI risk perception by 34%, while 67% of Americans and 73% of policymakers still lack sufficient AI understanding. Research-backed communication strategies (Yale framing research showing 28% concern incre

Structured arguments with supporting evidence that an AI system is safe for deployment, adapted from high-stakes industries like nuclear and aviation to provide rigorous documentation of safety claims and assumptions. As of 2025, 3 of 4 frontier labs have committed to safety case frameworks.

safety-cases, governance, deployment-decisions ...Wiki →

Analysis of AI safety field-building interventions including education programs (ARENA, MATS, BlueDot). The field grew from approximately 400 FTEs in 2022 to 1,100 FTEs in 2025 (21-30% annual growth), with training programs achieving 37% career conversion rates.

field-building, talent-pipeline, training-programs ...Wiki →

Strategic overview of AI safety interventions analyzing ~$650M annual investment across 1,100 FTEs. Maps 13+ interventions against 4 risk categories with ITN prioritization, finding 85% of external funding from 5 sources and safety/capabilities ratio at 0.5-1.3%.

resource-allocation, field-analysis, funding ...Wiki →

Fellowships, PhD programs, research mentorship, and career transition pathways for growing the AI safety research workforce, including MATS, Anthropic Fellows, SPAR, and academic programs.

training-programs, talent-pipeline, field-building ...Wiki →

AI Safety via Debate proposes using adversarial AI systems to argue opposing positions while humans judge, designed to scale alignment to superhuman capabilities. While theoretically promising and specifically designed to address RLHF's scalability limitations, it remains experimental with limited empirical validation.

scalable-oversight, adversarial-methods, superhuman-alignment ...Wiki →

A proposed system for systematically assessing the track records of public actors by topic, scoring factual claims against sources, predictions against outcomes, and promises against delivery. Aims to heal broken feedback loops where bold claims face no consequences.

Using current AI systems to assist with alignment research tasks including red-teaming, interpretability, and recursive oversight. AI-assisted red-teaming reduces jailbreak success rates from 86% to 4.4%, and weak-to-strong generalization can recover GPT-3.5-level performance from GPT-2 supervision.

ai-assisted-research, red-teaming, interpretability ...Wiki →

AI-assisted deliberation uses AI to scale meaningful democratic dialogue beyond the constraints of traditional town halls and focus groups. Rather than replacing human deliberation with AI decisions, these tools use AI to facilitate, synthesize, and scale genuine human discussion - enabling thousands or millions of people to engage in deliberative processes that traditionally require small groups. Pioneering systems like Polis cluster participant opinions to surface areas of consensus and reveal the structure of disagreement. Taiwan's vTaiwan platform has used these tools to engage citizens in policy development on contentious issues. Anthropic's Collective Constitutional AI experiment used similar methods to gather public input on how AI systems should behave. The core insight is that AI can help identify common ground, summarize diverse viewpoints, and translate between different perspectives at scales previously impossible. For AI governance, these tools offer a path to democratically legitimate AI policy. Rather than leaving AI development decisions to companies or technical elites, deliberation platforms could engage broader publics in decisions about how AI should be developed and deployed. For epistemic security, deliberative processes can help societies navigate contested questions by surfacing genuine consensus where it exists and clarifying the structure of genuine disagreement where it doesn't.

democratic-innovation, collective-intelligence, governance ...Wiki →

A proposed automated system for detecting and flagging persuasive-but-misleading rhetoric, including logical fallacies, emotionally loaded language, selective quoting, and citation misrepresentation. Could serve as a reading aid or author-side linting tool.

AI-augmented forecasting combines the pattern-recognition and data-processing capabilities of AI systems with the contextual judgment and calibration of human forecasters. This hybrid approach aims to produce more accurate predictions about future events than either humans or AI alone, particularly for questions relevant to policy and risk assessment. Current systems take several forms. AI can aggregate and weight forecasts from many human predictors, adjusting for individual track records and biases. AI can assist forecasters by synthesizing relevant information, identifying base rates, and flagging considerations that might otherwise be missed. More ambitiously, AI systems can generate their own forecasts that human superforecasters then evaluate and combine with their own judgments. For AI safety and epistemic security, improved forecasting offers several benefits. Better predictions about AI capabilities help with governance timing. Forecasting AI-related risks provides early warning. Publicly visible forecasts create accountability for claims about AI development. The key challenge is calibration - ensuring that probability estimates are meaningful across diverse domains and maintaining accuracy as AI systems become the subject of the forecasts themselves.

forecasting, prediction-markets, ai-capabilities ...Wiki →

Epistemic infrastructure refers to the foundational systems that societies depend on for creating, verifying, preserving, and accessing knowledge. Just as physical infrastructure (roads, power grids) underlies economic activity, epistemic infrastructure (archives, scientific publishing, fact-checking networks, educational institutions) underlies society's capacity to know things collectively. This infrastructure is under stress and requires deliberate investment. Current epistemic infrastructure includes elements like Wikipedia (the largest attempt at collaborative knowledge creation), the Internet Archive (preserving digital history), academic peer review (verifying scientific claims), journalism (investigating and reporting events), and educational systems (transmitting knowledge across generations). Each of these faces AI-related threats: Wikipedia can be corrupted with AI-generated misinformation, archives struggle to authenticate materials, peer review cannot keep pace with AI-generated fraud, and journalism is economically threatened. Strengthening epistemic infrastructure requires treating it as a public good deserving of investment. This might include: funding for fact-checking organizations and investigative journalism, technical infrastructure for content authentication, archives designed for an AI-generated-content world, AI systems explicitly designed to support human knowledge creation rather than replace it, and educational programs that teach critical evaluation in an AI context. The alternative - letting epistemic infrastructure decay while AI advances - leads to knowledge monopolies, trust collapse, and reality fragmentation.

knowledge-management, public-goods, information-infrastructure ...Wiki →

Epistemic security refers to protecting society's collective capacity for truth-finding in an era when AI can generate convincing false content at unprecedented scale. Just as national security protects against physical threats, epistemic security protects against threats to our ability to know what is true and form shared beliefs about reality. The threat landscape includes AI-generated deepfakes that can fabricate video evidence, language models that can produce unlimited quantities of persuasive misinformation, and systems that can personalize deceptive content to individual vulnerabilities. These capabilities threaten the basic information infrastructure that democratic societies depend on - the shared understanding of facts that enables public deliberation, elections, and collective decision-making. Defending epistemic security requires multiple layers: technical tools for content authentication and provenance, media literacy education that teaches critical evaluation of information sources, institutional reforms that increase resilience to manipulation, and regulatory frameworks that create accountability for platforms and AI developers. The challenge is that offensive capabilities (generating false content) are advancing faster than defensive capabilities (detecting it), creating an asymmetry that favors attackers.

disinformation, deepfakes, trust ...Wiki →

AI-human hybrid systems are designs that deliberately combine AI capabilities with human judgment to achieve outcomes better than either could produce alone. Rather than full automation or human-only processes, hybrid systems aim to capture the benefits of AI (scale, speed, consistency, pattern recognition) while preserving the benefits of human judgment (contextual understanding, values, robustness to novel situations). Effective hybrid systems require careful design to avoid the pathologies of both pure automation and nominal human oversight. Automation bias leads humans to defer to AI even when AI is wrong. Rubber-stamp oversight gives an illusion of human control without substance. The challenge is creating systems where humans genuinely contribute and AI genuinely assists, rather than one side dominating or the partnership failing. Examples of promising hybrid approaches include: AI systems that flag decisions for human review based on uncertainty or stakes, rather than automating all decisions; human-in-the-loop systems where AI drafts and humans edit; collaborative intelligence systems where AI and humans have complementary roles; and AI tutoring systems that guide rather than replace learning. For AI safety, hybrid systems represent a middle ground between naive confidence in human oversight and resignation to full AI autonomy.

human-ai-interaction, ai-control, decision-making ...Wiki →

Systematic testing of AI models for alignment properties including honesty, corrigibility, goal stability, and absence of deceptive behavior. Apollo Research found 1-13% scheming rates across frontier models, while TruthfulQA shows 58-85% accuracy on factual questions.

alignment-evaluation, scheming-detection, sycophancy ...Wiki →

Systematic methods to discover what AI models can actually do, including hidden capabilities that may not appear in standard benchmarks, through scaffolding, fine-tuning, and specialized prompting techniques. METR research shows AI agent task completion doubles every 7 months.

elicitation, sandbagging, scaffolding ...Wiki →

Methods to remove specific dangerous capabilities from trained AI models, directly addressing misuse risks by eliminating harmful knowledge, though current techniques face challenges around verification, capability recovery, and general performance degradation.

unlearning, capability-removal, misuse-prevention ...Wiki →

Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with only 1% capability loss, while Anthropic's Constitutional Classifiers block 95.6% of jailbreaks. However, the UK AISI challenge found all 22 tested models could eventually be broken.

runtime-safety, inference-intervention, jailbreak-defense ...Wiki →

A proposed cross-platform context layer extending X's community notes model across the entire internet, using AI classifiers to serve consensus-vetted context on potentially misleading content. Estimated cost of $0.01–0.10 per post using current AI models.

Anthropic's Constitutional AI methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-10x improvements in harmlessness while maintaining helpfulness across major model deployments.

constitutional-ai, rlaif, harmlessness ...Wiki →

Cooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~$1-20M/year investment primarily at DeepMind and academic groups. The field remains largely theoretical with limited production deployment, facing fundamental challenges in defining co

CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, the approach faces a substantial theory-practice gap with no production deployments and only $1-5M/year

How major AI companies are responding to safety concerns through internal policies, responsible scaling frameworks, safety teams, and disclosure practices, with analysis of effectiveness and industry trends.

corporate-safety, safety-teams, voluntary-commitments ...Wiki →

Systematic testing of AI models for dangerous capabilities including bioweapons assistance, cyberattack potential, autonomous self-replication, and persuasion/manipulation abilities to inform deployment decisions and safety policies. Now standard practice with 95%+ frontier model coverage.

dangerous-capabilities, bioweapons, cybersecurity ...Wiki →

Comprehensive analysis of deepfake detection showing best commercial detectors achieve 78-87% in-the-wild accuracy vs 96%+ in controlled settings, with Deepfake-Eval-2024 benchmark revealing 45-50% performance drops on real-world content. Human detection averages 55.5% (meta-analysis of 56 papers).

Forethought Foundation's five proposed technologies for improving collective epistemics: community notes for everything, rhetoric highlighting, reliability tracking, epistemic virtue evals, and provenance tracing. These design sketches aim to shift society toward high-honesty equilibria.

ELK is the unsolved problem of extracting an AI's true beliefs rather than human-approved outputs. ARC's 2022 prize contest received 197 proposals and awarded $274K, but the $50K and $100K solution prizes remain unclaimed. The problem remains fundamentally unsolved after 3+ years of focused research.

alignment-theory, deception-detection, belief-extraction ...Wiki →

A proposed suite of open benchmarks evaluating AI models on epistemic virtues: calibration, clarity, bias resistance, sycophancy avoidance, and manipulation detection. Includes the concept of 'pedantic mode' for maximally accurate AI outputs.

Benchmark saturation is accelerating—MMLU lasted 4 years, MMLU-Pro 18 months, HLE roughly 12 months—while safety-critical evaluations for CBRN, cyber, and AI R&D capabilities are losing signal at frontier labs, raising questions about whether evaluation-based governance frameworks can keep pace with capability growth.

benchmarks, evaluation-gap, responsible-scaling ...Wiki →

AI models increasingly detect when they are being evaluated and adjust their behavior accordingly. Claude Sonnet 4.5 detected evaluation contexts 58% of the time, and for Opus 4.6 Apollo Research reported evaluation awareness so strong they could not properly assess alignment. Awareness scales as a power law with model size.

evaluation-gaming, deception, scheming ...Wiki →

Mathematical proofs of AI system properties and behavior bounds, offering potentially strong safety guarantees if achievable but currently limited to small systems and facing fundamental challenges scaling to modern neural networks.

formal-methods, mathematical-guarantees, safety-verification ...Wiki →

Research into how learned goals fail to generalize correctly to new situations, a core alignment problem where AI systems pursue proxy objectives that diverge from intended goals when deployed outside their training distribution.

Multi-agent safety research addresses coordination failures, conflict, and collusion risks when multiple AI systems interact. A 2025 report from 50+ researchers across DeepMind, Anthropic, and academia identifies seven key risk factors and finds that even individually safe systems may contribute to harm through interaction.

multi-agent-systems, coordination, collusion-risk ...Wiki →

Analysis of whether releasing AI model weights publicly is net positive or negative for safety. The July 2024 NTIA report recommends monitoring but not restricting open weights, while research shows fine-tuning can remove safety training in as few as 200 examples.

open-source, model-weights, misuse-risk ...Wiki →

Advocacy for slowing or halting frontier AI development until adequate safety measures are in place. Analysis suggests 15-40% probability of meaningful policy implementation by 2030, with potential to provide 2-5 years of additional safety research time if achieved.

pause, development-moratorium, political-advocacy ...Wiki →

Prediction markets use market mechanisms to aggregate beliefs about future events, producing probability estimates that reflect the collective knowledge of participants. Unlike polls or expert surveys, prediction markets create incentives for truthful revelation of beliefs - participants profit by being right, not by appearing smart or conforming to social expectations. This makes them resistant to many of the biases that afflict other forecasting methods. Empirically, prediction markets have strong track records. They consistently outperform expert panels on questions with clear resolution criteria. Platforms like Polymarket, Metaculus, and Manifold generate forecasts on AI development, geopolitical events, and scientific questions that often prove more accurate than institutional predictions. The Good Judgment Project demonstrated that carefully selected forecasters using prediction market-like mechanisms could outperform intelligence analysts with access to classified information. For AI governance and epistemic security, prediction markets offer several valuable functions. They can provide credible forecasts of AI capability development, helping policymakers time interventions appropriately. They can surface genuine expert consensus (or lack thereof) on contested questions. They can create accountability for AI labs' claims about safety and timelines. And they can provide a coordination mechanism for collective knowledge that is resistant to the manipulation that undermines traditional media and expert systems.

forecasting, information-aggregation, mechanism-design ...Wiki →

Post-RLHF training techniques including DPO, ORPO, KTO, IPO, and GRPO that align language models with human preferences more efficiently than reinforcement learning. DPO reduces costs by 40-60% while matching RLHF performance on dialogue tasks, though PPO still outperforms on reasoning and safety tasks.

dpo, preference-optimization, rlhf ...Wiki →

Linear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vulnerable to adversarial hiding and only detect linearly separable features, limiting their standalone s

Process supervision trains AI systems to produce correct reasoning steps, not just correct final answers, improving transparency and auditability of AI reasoning while achieving significant gains in mathematical and coding tasks.

process-supervision, chain-of-thought, reasoning-verification ...Wiki →

An ambitious research agenda to design AI systems with mathematical safety guarantees from the ground up, led by ARIA's 59M pound Safeguarded AI programme with the goal of creating superintelligent systems that are provably beneficial through formal verification of world models and value specifications.

formal-methods, mathematical-guarantees, world-modeling ...Wiki →

Technical and procedural strategies to ground AI-generated content in verified information and reduce factual errors in wiki articles, covering RAG, verification techniques, prompt engineering, and human oversight.

Refusal training teaches AI models to decline harmful requests rather than comply. While universally deployed and achieving 99%+ refusal rates on explicit violations, jailbreak techniques bypass defenses with 1.5-6.5% success rates, and over-refusal blocks 12-43% of legitimate queries.

refusal-training, jailbreaking, safety-training ...Wiki →

A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enabling behavior steering without retraining through activation interventions.

behavior-steering, activation-engineering, deception-detection ...Wiki →

Reward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is universally adopted but inherits fundamental limitations including reward hacking (which worsens with capab

Sandboxing limits AI system access to resources, networks, and capabilities as a defense-in-depth measure. METR's August 2025 evaluation found GPT-5's time horizon at approximately 2 hours, insufficient for autonomous replication. AI boxing experiments show 60-70% social engineering escape rates.

containment, defense-in-depth, agent-safety ...Wiki →

Practical approaches for scaling AI evaluation to keep pace with capability growth, including LLM-as-judge (40% production adoption but theoretically capped at 2x sample efficiency), automated behavioral evals, AI-assisted red teaming, CoT monitoring, and debate-based evaluation achieving 76-88% accuracy.

llm-as-judge, automated-evals, red-teaming ...Wiki →

Research and evaluation methods for identifying when AI models engage in strategic deception—pretending to be aligned while secretly pursuing other goals—including behavioral tests, internal monitoring, and emerging detection techniques. Frontier models exhibit in-context scheming at rates of 0.3-13%.

scheming, deception-detection, behavioral-testing ...Wiki →

Methods to detect AI models that behave safely during training and evaluation but defect under specific deployment conditions, addressing the core threat of deceptive alignment through behavioral testing, interpretability, and monitoring approaches. Current methods achieve only 5-40% success rates.

sleeper-agents, backdoor-detection, deceptive-alignment ...Wiki →

Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints. Anthropic's 2024 research extracted 34 million features from Claude 3 Sonnet with 90% interpretability scores, while Goodfire raised $50M in 2025 and released first-ever SAEs for the 671B-parameter DeepSeek R1 reasoning model.

interpretability, feature-extraction, monosemanticity ...Wiki →

Structured access provides AI capabilities through controlled APIs rather than releasing model weights, maintaining developer control over deployment and enabling monitoring, intervention, and policy enforcement. Enterprise LLM spend reached $8.4B by mid-2025 under this model, but effectiveness depends on maintaining capability gaps with open-weight models.

deployment-safety, api-access, proliferation-control ...Wiki →

External organizations independently assess AI models for safety and dangerous capabilities. METR, Apollo Research, and government AI Safety Institutes now conduct pre-deployment evaluations of all major frontier models, with the field evolving from voluntary arrangements to EU AI Act mandatory requirements.

third-party-auditing, independent-evaluation, governance ...Wiki →

Tool-use restrictions limit what actions and APIs AI systems can access, directly constraining their potential for harm. This approach is critical for agentic AI systems, providing hard limits on capabilities regardless of model intentions, with METR evaluations showing agentic task completion horizons doubling every 7 months.

agent-safety, capability-restrictions, defense-in-depth ...Wiki →

Weak-to-strong generalization investigates whether weak supervisors can reliably elicit good behavior from stronger AI systems. OpenAI's ICML 2024 research shows GPT-2-level models can recover 80% of GPT-4's performance gap with auxiliary confidence loss, but reward modeling achieves only 20-40% PGR.

weak-to-strong, scalable-oversight, superalignment ...Wiki →

Analysis of X.com's epistemic practices and impact on information quality. Community Notes reduces repost virality by 46% but only 8-10% of notes display. Engagement-driven algorithms amplify low-credibility content, API restrictions ended 100+ research projects, and verification changes degraded trust signals.