Approaches

AI safety approaches, techniques, and strategies -- from alignment methods and evaluation frameworks to governance mechanisms and deployment safeguards.

Approaches

With Description

Unique Tags

216

Showing 82 of 82 approaches

	Description		Wiki
Adversarial Training	Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends external threats, not internal model misalignment.		wiki
Agent Foundations	Agent foundations research (MIRI's mathematical frameworks for aligned agency) faces low tractability after 10+ years with core problems unsolved, leading to MIRI's 2024 strategic pivot away from the field. Assessment shows ~15-25% probability the work is essential, 60-75% confidence in low tractability, and the field receiving minimal new funding.		wiki
AI Alignment	Technical approaches to ensuring AI systems pursue intended goals and remain aligned with human values throughout training and deployment. Current methods show promise but face fundamental scalability challenges, with oversight success dropping to 52% at 400 Elo capability gaps.	alignmentscalable-oversightrlhfdeceptive-alignmentsafety-research	wiki
AI Content Authentication	Content authentication technologies aim to establish verifiable provenance for digital content - allowing users to confirm where content came from, whether it has been modified, and whether it was created by AI or humans. The goal is to rebuild trust in digital media by creating technical guarantees of authenticity that complement human judgment. The leading approach is the C2PA (Coalition for Content Provenance and Authenticity) standard, backed by major technology companies. C2PA embeds cryptographically signed metadata into content at the point of creation - when a photo is taken, when a video is recorded, when an AI generates an image. This creates a chain of custody that can be verified later. Other approaches include invisible watermarking (SynthID), blockchain-based verification, and forensic analysis tools that detect signs of synthetic generation or manipulation. The key challenges are adoption and circumvention. Content authentication only works if it becomes universal - if users come to expect provenance information and distrust content without it. But metadata can be stripped, watermarks can potentially be removed or spoofed, and AI-generated content without credentials can still circulate. The race between authentication and forgery capability is uncertain, but authentication provides one of the few technical defenses against the coming flood of synthetic content.	deepfakesdigital-evidenceverificationwatermarkingtrust	wiki
AI Content Provenance Tracing	A proposed epistemic infrastructure making knowledge provenance transparent and traversable—enabling anyone to see the chain of citations, original data sources, methodological assumptions, and reliability scores for any claim they encounter.		wiki
AI Evaluation	Methods and frameworks for evaluating AI system safety, capabilities, and alignment properties before deployment, including dangerous capability detection, robustness testing, and deceptive behavior assessment.	evaluationsafety-testingdeployment-decisionscapability-assessmentgovernance	wiki
AI for Accountability and Anti-Corruption	AI systems are emerging as tools for holding powerful actors accountable, analyzing public records, tracing financial flows, monitoring environmental violations, and documenting human rights abuses at previously impossible scale. The ICIJ's AI-assisted investigations revealed $32+ trillion in hidden wealth. This sousveillance dynamic represents the beneficial flip side of AI surveillance capabilities.		wiki
AI for Human Reasoning Fellowship	FLF's inaugural 12-week fellowship (July-October 2025) combined research fellowship with startup incubator format. 30 fellows received $25-50K stipends plus $5K compute budgets to build AI tools for human reasoning. Produced 25+ projects across epistemic tools (Community Notes AI with 2.5M+ views, Polis 2.0 with 1,000+ participants), forecasting (Deliberation Markets, Sentinel risk monitoring), coordination (Pivotal, Collective Agency), and evaluation (DeliberationBench, Society Library). Notable outcomes include Sylience (founded from Chord project) and Deep Future (compresses week-long RAND scenario planning to 10-15 minutes). Program directed by Ben Goldhaber with advisors including Andreas Stuhlmuller (Elicit), Anthony Aguirre, and Brendan Fong.	epistemic-toolsbeneficial-aiincubationforecasting	wiki
AI Governance Coordination Technologies	Coordination technologies are tools and mechanisms that enable actors to cooperate on collective challenges when individual incentives favor defection. For AI safety, these technologies address the fundamental problem that racing to develop AI faster may be individually rational but collectively catastrophic. For epistemic security, they help coordinate defensive responses to disinformation. These technologies draw on mechanism design, game theory, and institutional economics. Examples include: verification protocols that allow actors to confirm others' compliance with agreements (critical for AI safety treaties); commitment devices that make defection from cooperative arrangements costly; signaling mechanisms that allow actors to credibly communicate intentions; and platforms that make coordination focal points more visible. For AI governance specifically, coordination technologies might include compute monitoring systems that verify compliance with training restrictions, international registries of advanced AI systems, and mechanisms for sharing safety research while protecting commercial interests. The fundamental insight from Elinor Ostrom's work is that collective action problems are not unsolvable - but they require deliberate institutional design. The urgency of AI risk makes developing effective coordination mechanisms for this domain a priority.	game-theorygovernanceinternational-cooperationmechanism-designverification	wiki
AI Governance Research and Analysis	Academic and think tank research informing AI policy development. Ranges from technical governance (compute governance, hardware-enabled mechanisms) to political-institutional analysis (state capacity, regulatory design) to rights-based frameworks (algorithmic accountability, AI and human rights). Key institutions by focus: Technical governance — GovAI (compute governance, international strategy), IAPS (frontier security, fellowship pipeline). Rights and accountability — AI Now Institute (power concentration, annual landscape reports), AlgorithmWatch (algorithmic accountability journalism, EU focus), Algorithmic Justice League (bias auditing, TSA facial recognition). Policy design — Brennan Center (AI legislation tracking across all states, democracy and AI agenda), Brookings AI (most-cited DC think tank on AI policy), RAND (defense and national security AI), Carnegie AI (international affairs and AI geopolitics), Stanford HAI (AI Index, 200+ affiliated faculty). CDT bridges research and advocacy via its AI Governance Lab. Research effectiveness depends on translation to policy — GovAI and Brennan Center have among the highest cited-by-policymakers ratios in the field.	researchthink-tankgovernance-designintervention-type	wiki
AI Lab Safety Culture	Analysis of interventions to improve safety culture within AI labs. Evidence from 2024-2025 shows significant gaps: no company scored above C+ overall (FLI Winter 2025), all received D or below on existential safety, and xAI released Grok 4 without any safety documentation.	safety-cultureorganizational-practicessafety-teamswhistleblowerindustry-accountability	wiki
AI Labor Transition & Economic Resilience	Reviews standard policy interventions (reskilling, UBI, portable benefits, automation taxes) for managing AI-driven job displacement, citing WEF projection of 14 million net job losses by 2027 and 23% of US workers already using GenAI weekly. Finds medium tractability and grades as B-tier priority, ranking below direct AI safety interventions.		wiki
AI Litigation as Democratic Defense	Using courts to challenge government and corporate AI deployments that threaten democratic governance, civil rights, and privacy. Includes FOIA litigation to compel transparency about government AI use, constitutional challenges to AI-enabled surveillance, product liability suits against harmful AI systems, and challenges to federal preemption of state AI regulation. Key practitioners: ACLU (NSA AI FOIA, DOGE data access), Democracy Forward (FOIA on federal AI use for deregulation and employee monitoring), Protect Democracy (SNAP data "panopticon" suit), EFF (AI surveillance investigations), EPIC (housing AI litigation). The 36-state AG coalition opposing federal preemption represents the largest coordinated legal response. Litigation serves both direct (blocking harmful deployments) and indirect (establishing precedent, forcing disclosure) functions. The December 2025 Trump state preemption EO and January 2026 DOJ AI Litigation Task Force created an adversarial dynamic where both pro-regulation and anti-regulation sides are using courts simultaneously.	litigationdemocracycivil-libertiesfoiaintervention-type	wiki
AI Model Specifications	Model specifications are explicit documents defining AI behavior, now published by all major frontier labs (Anthropic, OpenAI, Google, Meta) as of 2025. While they improve transparency and enable external scrutiny, they face a fundamental spec-reality gap—specifications don't guarantee implementation, and current compliance verification mechanisms are limited.		wiki
AI Non-Extremization Coordination	Cross-company coordination approaches to prevent AI systems from driving users toward more extreme positions, drawing on analogies to financial systemic risk regulation and existing AI safety coordination frameworks.	ai-governanceinternational-coordinationepistemic	wiki
AI Output Filtering	Comprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbreaks per UK AISI testing, though Anthropic's Constitutional Classifiers blocked 95.6% of attacks. Concludes filtering provides marginal safety benefits for production deployment but is insufficient as a standalone safety measure.		wiki
AI Risk Public Education	Public education initiatives show measurable but modest impacts: MIT programs increased accurate AI risk perception by 34%, while 67% of Americans and 73% of policymakers still lack sufficient AI understanding. Research-backed communication strategies (Yale framing research showing 28% concern increase) suggest targeted campaigns can shift public opinion on AI risk.		wiki
AI Safety Cases	Structured arguments with supporting evidence that an AI system is safe for deployment, adapted from high-stakes industries like nuclear and aviation to provide rigorous documentation of safety claims and assumptions. As of 2025, 3 of 4 frontier labs have committed to safety case frameworks.	safety-casesgovernancedeployment-decisionsauditingresponsible-scaling	wiki
AI Safety Field Building Analysis	Analysis of AI safety field-building interventions including education programs (ARENA, MATS, BlueDot). The field grew from approximately 400 FTEs in 2022 to 1,100 FTEs in 2025 (21-30% annual growth), with training programs achieving 37% career conversion rates.	field-buildingtalent-pipelinetraining-programsworkforce-growthfunding-analysis	wiki
AI Safety Intervention Portfolio	Strategic overview of AI safety interventions analyzing ~$650M annual investment across 1,100 FTEs. Maps 13+ interventions against 4 risk categories with ITN prioritization, finding 85% of external funding from 5 sources and safety/capabilities ratio at 0.5-1.3%.	resource-allocationfield-analysisfundingprioritizationsafety-research	wiki
AI Safety Policy Lobbying	Direct engagement with legislators and executive branch officials to shape AI regulation and safety policy. Spans the spectrum from EA-aligned catastrophic risk prevention to broader AI accountability advocacy. Key organizations operate in two clusters. The AI safety / longtermist cluster includes Americans for Responsible Innovation (bipartisan congressional lobbying, EA-funded), AI Policy Institute (public opinion polling + advocacy), Center for AI Policy (shut down May 2025, \$484K total lobbying spend), CAIS Action Fund (translating technical safety research to policy), and IAPS (fellowship pipeline). The broader AI accountability cluster includes Access Now (led 110+ org EU AI Act coalition), Encode Justice (youth-led advocacy to Congress/OSTP), and FLI (Pro-Human AI Declaration with 150+ org coalition, EU AI Act advocacy). Lobbying effectiveness varies: direct congressional engagement (ARI, CAIP) is high-touch but expensive per-contact; coalition lobbying (Access Now, FLI) scales better internationally; public opinion campaigning (AIPI) builds political cover for legislators. The collapse of CAIP due to funding suggests the field is underfunded relative to industry lobbying. Verified lobbying data (Senate LDA, 2025): all six pro-safety orgs with federal filings combined spent roughly \$3.4M — less than Meta spends in a single quarter (\$26.3M total 2025). 774 organizations lobbied on AI in 2025, with 82% representing corporate interests. ARI is the largest pro-safety AI lobbying spender at \$2.1M/year.	lobbyingpolicy-advocacyai-safetyintervention-type	wiki
AI Safety Training Programs	Fellowships, PhD programs, research mentorship, and career transition pathways for growing the AI safety research workforce, including MATS, Anthropic Fellows, SPAR, and academic programs.	training-programstalent-pipelinefield-buildingresearch-mentorshipcareer-development	wiki
AI Safety via Debate	AI Safety via Debate proposes using adversarial AI systems to argue opposing positions while humans judge, designed to scale alignment to superhuman capabilities. While theoretically promising and specifically designed to address RLHF's scalability limitations, it remains experimental with limited empirical validation.	scalable-oversightadversarial-methodssuperhuman-alignmentalignment-theoryhuman-judgment	wiki
AI System Reliability Tracking	A proposed system for systematically assessing the track records of public actors by topic, scoring factual claims against sources, predictions against outcomes, and promises against delivery. Aims to heal broken feedback loops where bold claims face no consequences.		wiki
AI-Assisted Alignment	Using current AI systems to assist with alignment research tasks including red-teaming, interpretability, and recursive oversight. AI-assisted red-teaming reduces jailbreak success rates from 86% to 4.4%, and weak-to-strong generalization can recover GPT-3.5-level performance from GPT-2 supervision.	ai-assisted-researchred-teaminginterpretabilityrecursive-oversightscalable-alignment	wiki
AI-Assisted Deliberation	AI-assisted deliberation uses AI to scale meaningful democratic dialogue beyond the constraints of traditional town halls and focus groups. Rather than replacing human deliberation with AI decisions, these tools use AI to facilitate, synthesize, and scale genuine human discussion - enabling thousands or millions of people to engage in deliberative processes that traditionally require small groups. Pioneering systems like Polis cluster participant opinions to surface areas of consensus and reveal the structure of disagreement. Taiwan's vTaiwan platform has used these tools to engage citizens in policy development on contentious issues. Anthropic's Collective Constitutional AI experiment used similar methods to gather public input on how AI systems should behave. The core insight is that AI can help identify common ground, summarize diverse viewpoints, and translate between different perspectives at scales previously impossible. For AI governance, these tools offer a path to democratically legitimate AI policy. Rather than leaving AI development decisions to companies or technical elites, deliberation platforms could engage broader publics in decisions about how AI should be developed and deployed. For epistemic security, deliberative processes can help societies navigate contested questions by surfacing genuine consensus where it exists and clarifying the structure of genuine disagreement where it doesn't.	democratic-innovationcollective-intelligencegovernanceparticipatory-democracyconsensus-building	wiki
AI-Assisted Diplomacy and Negotiation	Using AI for diplomatic preparation and multi-party negotiation, including knowledge graph construction for actor mapping, scenario modeling, and multi-agent coordinator models for exploring agreement spaces.	ai-governanceinternational-coordination	wiki
AI-Assisted Rhetoric Highlighting	A proposed automated system for detecting and flagging persuasive-but-misleading rhetoric, including logical fallacies, emotionally loaded language, selective quoting, and citation misrepresentation. Could serve as a reading aid or author-side linting tool.		wiki
AI-Augmented Forecasting	AI-augmented forecasting combines the pattern-recognition and data-processing capabilities of AI systems with the contextual judgment and calibration of human forecasters. This hybrid approach aims to produce more accurate predictions about future events than either humans or AI alone, particularly for questions relevant to policy and risk assessment. Current systems take several forms. AI can aggregate and weight forecasts from many human predictors, adjusting for individual track records and biases. AI can assist forecasters by synthesizing relevant information, identifying base rates, and flagging considerations that might otherwise be missed. More ambitiously, AI systems can generate their own forecasts that human superforecasters then evaluate and combine with their own judgments. For AI safety and epistemic security, improved forecasting offers several benefits. Better predictions about AI capabilities help with governance timing. Forecasting AI-related risks provides early warning. Publicly visible forecasts create accountability for claims about AI development. The key challenge is calibration - ensuring that probability estimates are meaningful across diverse domains and maintaining accuracy as AI systems become the subject of the forecasts themselves.	forecastingprediction-marketsai-capabilitiesdecision-makingcalibration	wiki
AI-Era Epistemic Infrastructure	Epistemic infrastructure refers to the foundational systems that societies depend on for creating, verifying, preserving, and accessing knowledge. Just as physical infrastructure (roads, power grids) underlies economic activity, epistemic infrastructure (archives, scientific publishing, fact-checking networks, educational institutions) underlies society's capacity to know things collectively. This infrastructure is under stress and requires deliberate investment. Current epistemic infrastructure includes elements like Wikipedia (the largest attempt at collaborative knowledge creation), the Internet Archive (preserving digital history), academic peer review (verifying scientific claims), journalism (investigating and reporting events), and educational systems (transmitting knowledge across generations). Each of these faces AI-related threats: Wikipedia can be corrupted with AI-generated misinformation, archives struggle to authenticate materials, peer review cannot keep pace with AI-generated fraud, and journalism is economically threatened. Strengthening epistemic infrastructure requires treating it as a public good deserving of investment. This might include: funding for fact-checking organizations and investigative journalism, technical infrastructure for content authentication, archives designed for an AI-generated-content world, AI systems explicitly designed to support human knowledge creation rather than replace it, and educational programs that teach critical evaluation in an AI context. The alternative - letting epistemic infrastructure decay while AI advances - leads to knowledge monopolies, trust collapse, and reality fragmentation.	knowledge-managementpublic-goodsinformation-infrastructureverificationai-for-good	wiki
AI-Era Epistemic Security	Epistemic security refers to protecting society's collective capacity for truth-finding in an era when AI can generate convincing false content at unprecedented scale. Just as national security protects against physical threats, epistemic security protects against threats to our ability to know what is true and form shared beliefs about reality. The threat landscape includes AI-generated deepfakes that can fabricate video evidence, language models that can produce unlimited quantities of persuasive misinformation, and systems that can personalize deceptive content to individual vulnerabilities. These capabilities threaten the basic information infrastructure that democratic societies depend on - the shared understanding of facts that enables public deliberation, elections, and collective decision-making. Defending epistemic security requires multiple layers: technical tools for content authentication and provenance, media literacy education that teaches critical evaluation of information sources, institutional reforms that increase resilience to manipulation, and regulatory frameworks that create accountability for platforms and AI developers. The challenge is that offensive capabilities (generating false content) are advancing faster than defensive capabilities (detecting it), creating an asymmetry that favors attackers.	disinformationdeepfakestrustmedia-literacycontent-authenticationinformation-security	wiki
AI-Human Hybrid Systems	AI-human hybrid systems are designs that deliberately combine AI capabilities with human judgment to achieve outcomes better than either could produce alone. Rather than full automation or human-only processes, hybrid systems aim to capture the benefits of AI (scale, speed, consistency, pattern recognition) while preserving the benefits of human judgment (contextual understanding, values, robustness to novel situations). Effective hybrid systems require careful design to avoid the pathologies of both pure automation and nominal human oversight. Automation bias leads humans to defer to AI even when AI is wrong. Rubber-stamp oversight gives an illusion of human control without substance. The challenge is creating systems where humans genuinely contribute and AI genuinely assists, rather than one side dominating or the partnership failing. Examples of promising hybrid approaches include: AI systems that flag decisions for human review based on uncertainty or stakes, rather than automating all decisions; human-in-the-loop systems where AI drafts and humans edit; collaborative intelligence systems where AI and humans have complementary roles; and AI tutoring systems that guide rather than replace learning. For AI safety, hybrid systems represent a middle ground between naive confidence in human oversight and resignation to full AI autonomy.	human-ai-interactionai-controldecision-makingautomation-biasai-safety	wiki
Alignment Evaluations	Systematic testing of AI models for alignment properties including honesty, corrigibility, goal stability, and absence of deceptive behavior. Apollo Research found 1-13% scheming rates across frontier models, while TruthfulQA shows 58-85% accuracy on factual questions.	alignment-evaluationscheming-detectionsycophancycorrigibilitybehavioral-testing	wiki
Capability Elicitation	Systematic methods to discover what AI models can actually do, including hidden capabilities that may not appear in standard benchmarks, through scaffolding, fine-tuning, and specialized prompting techniques. METR research shows AI agent task completion doubles every 7 months.	elicitationsandbaggingscaffoldingcapability-assessmenthidden-capabilities	wiki
Capability Unlearning / Removal	Methods to remove specific dangerous capabilities from trained AI models, directly addressing misuse risks by eliminating harmful knowledge, though current techniques face challenges around verification, capability recovery, and general performance degradation.	unlearningcapability-removalmisuse-preventionmodel-editingbioweapons	wiki
Circuit Breakers / Inference Interventions	Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with only 1% capability loss, while Anthropic's Constitutional Classifiers block 95.6% of jailbreaks. However, the UK AISI challenge found all 22 tested models could eventually be broken.	runtime-safetyinference-interventionjailbreak-defenseadversarial-robustnessdefense-in-depth	wiki
Community Notes for Everything	A proposed cross-platform context layer extending X's community notes model across the entire internet, using AI classifiers to serve consensus-vetted context on potentially misleading content. Estimated cost of $0.01–0.10 per post using current AI models.		wiki
Compute Monitoring	Framework analyzing compute monitoring approaches for AI governance, finding that cloud KYC targeting 10^26 FLOP threshold is implementable now via three major providers controlling 60%+ of cloud infrastructure, while hardware-level governance faces 3-5 year development timelines.	monitoringcloud-kyccompute-governancetraining-runsverification	wiki
Constitutional AI	Anthropic's Constitutional AI methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-10x improvements in harmlessness while maintaining helpfulness across major model deployments.	constitutional-airlaifharmlessnesstraining-methodologyanthropic	wiki
Cooperative AI	Cooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~$1-20M/year investment primarily at DeepMind and academic groups. The field remains largely theoretical with limited production deployment, facing fundamental challenges in defining cooperation objectives across diverse multi-agent scenarios.		wiki
Cooperative IRL (CIRL)	CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, the approach faces a substantial theory-practice gap with no production deployments and only $1-5M/year in annual research funding.		wiki
Corporate AI Safety Responses	How major AI companies are responding to safety concerns through internal policies, responsible scaling frameworks, safety teams, and disclosure practices, with analysis of effectiveness and industry trends.	corporate-safetysafety-teamsvoluntary-commitmentsindustry-practicesracing-dynamics	wiki
Dangerous Capability Evaluations	Systematic testing of AI models for dangerous capabilities including bioweapons assistance, cyberattack potential, autonomous self-replication, and persuasion/manipulation abilities to inform deployment decisions and safety policies. Now standard practice with 95%+ frontier model coverage.	dangerous-capabilitiesbioweaponscybersecurityself-replicationdeployment-decisionsresponsible-scaling	wiki
Deepfake Detection	Comprehensive analysis of deepfake detection showing best commercial detectors achieve 78-87% in-the-wild accuracy vs 96%+ in controlled settings, with Deepfake-Eval-2024 benchmark revealing 45-50% performance drops on real-world content. Human detection averages 55.5% (meta-analysis of 56 papers).		wiki
Design Sketches for Collective Epistemics	Forethought Foundation's five proposed technologies for improving collective epistemics: community notes for everything, rhetoric highlighting, reliability tracking, epistemic virtue evals, and provenance tracing. These design sketches aim to shift society toward high-honesty equilibria.		wiki
Eliciting Latent Knowledge (ELK)	ELK is the unsolved problem of extracting an AI's true beliefs rather than human-approved outputs. ARC's 2022 prize contest received 197 proposals and awarded $274K, but the $50K and $100K solution prizes remain unclaimed. The problem remains fundamentally unsolved after 3+ years of focused research.	alignment-theorydeception-detectionbelief-extractionarc-researchunsolved-problem	wiki
Epistemic Virtue Evals	A proposed suite of open benchmarks evaluating AI models on epistemic virtues: calibration, clarity, bias resistance, sycophancy avoidance, and manipulation detection. Includes the concept of 'pedantic mode' for maximally accurate AI outputs.		wiki
Eval Saturation & The Evals Gap	Benchmark saturation is accelerating—MMLU lasted 4 years, MMLU-Pro 18 months, HLE roughly 12 months—while safety-critical evaluations for CBRN, cyber, and AI R&D capabilities are losing signal at frontier labs, raising questions about whether evaluation-based governance frameworks can keep pace with capability growth.	benchmarksevaluation-gapresponsible-scalingsafety-evalsgovernance	wiki
Evals-Based Deployment Gates	Evals-based deployment gates require AI models to pass safety evaluations before deployment or capability scaling. The EU AI Act mandates conformity assessments for high-risk systems with fines up to EUR 35M or 7% global turnover, while UK AISI has evaluated 30+ frontier models.	evaluationsdeployment-gateseu-ai-actsafety-testingthird-party-audits	wiki
Evaluation Awareness	AI models increasingly detect when they are being evaluated and adjust their behavior accordingly. Claude Sonnet 4.5 detected evaluation contexts 58% of the time, and for Opus 4.6 Apollo Research reported evaluation awareness so strong they could not properly assess alignment. Awareness scales as a power law with model size.	evaluation-gamingdeceptionschemingscaling-lawsbehavioral-evaluation	wiki
Forecasting-Based Policy Triggers	Mechanisms that use pre-defined forecast thresholds to automatically trigger policy actions, drawing on forecast-based financing models from humanitarian disaster response and climate adaptation.	ai-governanceforecasting	wiki
Formal Verification (AI Safety)	Mathematical proofs of AI system properties and behavior bounds, offering potentially strong safety guarantees if achievable but currently limited to small systems and facing fundamental challenges scaling to modern neural networks.	formal-methodsmathematical-guaranteessafety-verificationprovable-safetylong-term-research	wiki
Goal Misgeneralization Research	Research into how learned goals fail to generalize correctly to new situations, a core alignment problem where AI systems pursue proxy objectives that diverge from intended goals when deployed outside their training distribution.		wiki
Government AI Use Monitoring	Systematic tracking and transparency efforts focused on how governments deploy AI systems, particularly for surveillance, enforcement, and decision-making that affects citizens' rights. Became critically important in 2025-2026 as the Trump administration rapidly expanded government AI use (nearly 3,000 AI use cases across 29 agencies in 2025, a 75% increase over 2024). Key practitioners: Revolving Door Project (comprehensive "Tracking Uses of AI in the Trump Administration" tracker covering DOGE AI, CMS Medicare AI, ICE surveillance, State Department visa AI), Freedom House (Freedom on the Net annual reports, 15th consecutive year of internet freedom decline; specific reporting on AI-enabled immigration enforcement), Brennan Center (policing and technology program, AI and national security research series), and Tech Oversight Project (rapid-response media campaigns on tech accountability). FOIA litigation (Democracy Forward, ACLU) serves as an enforcement mechanism when voluntary transparency fails. The Revolving Door Project tracker is currently the most comprehensive single-source inventory of government AI deployments. State-level monitoring is thinner — the Brennan Center tracker covers state legislation but not state government AI deployments.	government-monitoringtransparencysurveillanceaccountabilityintervention-type	wiki
Grassroots AI Safety and Democracy Activism	Bottom-up movements mobilizing public concern about AI into organized political action. Includes protest movements (PauseAI, with groups in 30+ cities), youth-led advocacy (Encode Justice, 600 members in 40 countries), narrative-shaping organizations (Center for Humane Technology, "Your Undivided Attention" podcast), and cross-partisan coalition building (FLI's Pro-Human AI Declaration, signed by groups ranging from AFL-CIO to Congress of Christian Leaders, with individual signatories including Steve Bannon and Ralph Nader). Also includes Humans First (CAIS-incubated bipartisan anti-AI-replacement campaign, though contested as astroturf by some critics). Mozilla's Democracy x AI Cohort (\$50K per project, 10 projects) represents funder investment in grassroots AI-democracy tools. FLI polling (Feb 2026, n=1,004) showed overwhelming public preference for "pro-human" AI approach over minimal regulation, suggesting grassroots campaigning has a receptive audience. Primary strengths: political cover for legislators, bipartisan framing, media attention. Primary weakness: limited policy specificity, risk of being outspent by industry campaigns.	grassrootsactivismpublic-opinioncoalition-buildingintervention-type	wiki
Hardware-Enabled Governance	Technical mechanisms built into AI chips enabling monitoring, access control, and enforcement of AI governance policies. RAND analysis identifies attestation-based licensing as most feasible with 5-10 year timeline, while an estimated 100,000+ export-controlled GPUs were smuggled to China in 2024.	hardware-governancechip-trackingexport-controlscompute-governanceremote-attestation	wiki
Intervention Evaluation for Political Stability	Frameworks for rigorously evaluating political stability interventions, covering evidence synthesis methods, cost-effectiveness analysis, and measurement challenges in polarization reduction programs.	ai-governanceepistemic	wiki
MAIM (Mutually Assured AI Malfunction)	A deterrence framework proposed by Dan Hendrycks, Eric Schmidt, and Alexandr Wang in their 2025 paper 'Superintelligence Strategy'. MAIM posits that rival states will naturally deter each other from pursuing unilateral AI dominance because destabilizing AI projects can be sabotaged through an escalation ladder from espionage to kinetic strikes. Part of a three-pillar framework including nonproliferation and competitiveness.	ai-deterrenceinternational-governancegeopoliticssuperintelligencenational-security	wiki
Multi-Agent Safety	Multi-agent safety research addresses coordination failures, conflict, and collusion risks when multiple AI systems interact. A 2025 report from 50+ researchers across DeepMind, Anthropic, and academia identifies seven key risk factors and finds that even individually safe systems may contribute to harm through interaction.	multi-agent-systemscoordinationcollusion-riskgame-theoryagent-safety	wiki
Open Source AI Safety	Analysis of whether releasing AI model weights publicly is net positive or negative for safety. The July 2024 NTIA report recommends monitoring but not restricting open weights, while research shows fine-tuning can remove safety training in as few as 200 examples.	open-sourcemodel-weightsmisuse-riskdecentralizationsafety-training	wiki
Pause Advocacy	Advocacy for slowing or halting frontier AI development until adequate safety measures are in place. Analysis suggests 15-40% probability of meaningful policy implementation by 2030, with potential to provide 2-5 years of additional safety research time if achieved.	pausedevelopment-moratoriumpolitical-advocacypublic-opinionracing-dynamics	wiki
Prediction Markets (AI Forecasting)	Prediction markets use market mechanisms to aggregate beliefs about future events, producing probability estimates that reflect the collective knowledge of participants. Unlike polls or expert surveys, prediction markets create incentives for truthful revelation of beliefs - participants profit by being right, not by appearing smart or conforming to social expectations. This makes them resistant to many of the biases that afflict other forecasting methods. Empirically, prediction markets have strong track records. They consistently outperform expert panels on questions with clear resolution criteria. Platforms like Polymarket, Metaculus, and Manifold generate forecasts on AI development, geopolitical events, and scientific questions that often prove more accurate than institutional predictions. The Good Judgment Project demonstrated that carefully selected forecasters using prediction market-like mechanisms could outperform intelligence analysts with access to classified information. For AI governance and epistemic security, prediction markets offer several valuable functions. They can provide credible forecasts of AI capability development, helping policymakers time interventions appropriately. They can surface genuine expert consensus (or lack thereof) on contested questions. They can create accountability for AI labs' claims about safety and timelines. And they can provide a coordination mechanism for collective knowledge that is resistant to the manipulation that undermines traditional media and expert systems.	forecastinginformation-aggregationmechanism-designcollective-intelligencedecision-making	wiki
Preference Optimization Methods	Post-RLHF training techniques including DPO, ORPO, KTO, IPO, and GRPO that align language models with human preferences more efficiently than reinforcement learning. DPO reduces costs by 40-60% while matching RLHF performance on dialogue tasks, though PPO still outperforms on reasoning and safety tasks.	dpopreference-optimizationrlhftraining-efficiencyalignment-training	wiki
Probing / Linear Probes	Linear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vulnerable to adversarial hiding and only detect linearly separable features, limiting their standalone safety utility without complementary techniques.		wiki
Process Supervision	Process supervision trains AI systems to produce correct reasoning steps, not just correct final answers, improving transparency and auditability of AI reasoning while achieving significant gains in mathematical and coding tasks.	process-supervisionchain-of-thoughtreasoning-verificationreward-modelingtransparency	wiki
Provably Safe AI (davidad agenda)	An ambitious research agenda to design AI systems with mathematical safety guarantees from the ground up, led by ARIA's 59M pound Safeguarded AI programme with the goal of creating superintelligent systems that are provably beneficial through formal verification of world models and value specifications.	formal-methodsmathematical-guaranteesworld-modelingvalue-specificationaria-programmelong-term-research	wiki
Reducing Hallucinations in AI-Generated Wiki Content	Technical and procedural strategies to ground AI-generated content in verified information and reduce factual errors in wiki articles, covering RAG, verification techniques, prompt engineering, and human oversight.		wiki
Refusal Training	Refusal training teaches AI models to decline harmful requests rather than comply. While universally deployed and achieving 99%+ refusal rates on explicit violations, jailbreak techniques bypass defenses with 1.5-6.5% success rates, and over-refusal blocks 12-43% of legitimate queries.	refusal-trainingjailbreakingsafety-trainingrlhfover-refusalmisuse-prevention	wiki
Representation Engineering	A top-down approach to understanding and controlling AI behavior by reading and modifying concept-level representations in neural networks, enabling behavior steering without retraining through activation interventions.	behavior-steeringactivation-engineeringdeception-detectioninterpretabilityinference-time-intervention	wiki
Responsible Scaling Policies	Responsible Scaling Policies (RSPs) are voluntary commitments by AI labs to pause scaling when capability or safety thresholds are crossed. As of December 2025, 20 companies have published policies, though SaferAI grades the three major frameworks 1.9-2.2/5 for specificity.	responsible-scalingvoluntary-commitmentssafety-thresholdsfrontier-labsthird-party-evaluation	wiki
Reward Modeling	Reward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is universally adopted but inherits fundamental limitations including reward hacking (which worsens with capability scaling) and difficulty specifying complex human values.		wiki
Sandboxing / Containment	Sandboxing limits AI system access to resources, networks, and capabilities as a defense-in-depth measure. METR's August 2025 evaluation found GPT-5's time horizon at approximately 2 hours, insufficient for autonomous replication. AI boxing experiments show 60-70% social engineering escape rates.	containmentdefense-in-depthagent-safetycontainer-securitydeployment-safety	wiki
Scalable Eval Approaches	Practical approaches for scaling AI evaluation to keep pace with capability growth, including LLM-as-judge (40% production adoption but theoretically capped at 2x sample efficiency), automated behavioral evals, AI-assisted red teaming, CoT monitoring, and debate-based evaluation achieving 76-88% accuracy.	llm-as-judgeautomated-evalsred-teamingscalable-evaluationaudit-capacity	wiki
Scheming & Deception Detection	Research and evaluation methods for identifying when AI models engage in strategic deception—pretending to be aligned while secretly pursuing other goals—including behavioral tests, internal monitoring, and emerging detection techniques. Frontier models exhibit in-context scheming at rates of 0.3-13%.	schemingdeception-detectionbehavioral-testingchain-of-thoughtinterpretability	wiki
Sleeper Agent Detection	Methods to detect AI models that behave safely during training and evaluation but defect under specific deployment conditions, addressing the core threat of deceptive alignment through behavioral testing, interpretability, and monitoring approaches. Current methods achieve only 5-40% success rates.	sleeper-agentsbackdoor-detectiondeceptive-alignmentinterpretabilityai-control	wiki
Sparse Autoencoders (SAEs)	Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints. Anthropic's 2024 research extracted 34 million features from Claude 3 Sonnet with 90% interpretability scores, while Goodfire raised $50M in 2025 and released first-ever SAEs for the 671B-parameter DeepSeek R1 reasoning model.	interpretabilityfeature-extractionmonosemanticityneural-network-analysissafety-tooling	wiki
Structured Access / API-Only	Structured access provides AI capabilities through controlled APIs rather than releasing model weights, maintaining developer control over deployment and enabling monitoring, intervention, and policy enforcement. Enterprise LLM spend reached $8.4B by mid-2025 under this model, but effectiveness depends on maintaining capability gaps with open-weight models.	deployment-safetyapi-accessproliferation-controlenterprise-aimodel-distribution	wiki
Third-Party Model Auditing	External organizations independently assess AI models for safety and dangerous capabilities. METR, Apollo Research, and government AI Safety Institutes now conduct pre-deployment evaluations of all major frontier models, with the field evolving from voluntary arrangements to EU AI Act mandatory requirements.	third-party-auditingindependent-evaluationgovernancedeployment-oversightregulatory-compliance	wiki
Tool-Use Restrictions	Tool-use restrictions limit what actions and APIs AI systems can access, directly constraining their potential for harm. This approach is critical for agentic AI systems, providing hard limits on capabilities regardless of model intentions, with METR evaluations showing agentic task completion horizons doubling every 7 months.	agent-safetycapability-restrictionsdefense-in-depthdeployment-safetypermission-systemsmcp-security	wiki
Voluntary AI Commitments Enforcement	An analysis of non-binding AI safety pledges made by leading AI companies, their enforcement mechanisms, compliance records, and limitations as a governance approach.	ai-governancevoluntary-commitments	wiki
Weak-to-Strong Generalization	Weak-to-strong generalization investigates whether weak supervisors can reliably elicit good behavior from stronger AI systems. OpenAI's ICML 2024 research shows GPT-2-level models can recover 80% of GPT-4's performance gap with auxiliary confidence loss, but reward modeling achieves only 20-40% PGR.	weak-to-strongscalable-oversightsuperalignmentsupervisionreward-modeling	wiki
X.com Platform Epistemics	Analysis of X.com's epistemic practices and impact on information quality. Community Notes reduces repost virality by 46% but only 8-10% of notes display. Engagement-driven algorithms amplify low-credibility content, API restrictions ended 100+ research projects, and verification changes degraded trust signals.		wiki