Edited 1 day ago3.4k words15 backlinksUpdated every 3 weeksDue in 3 weeks
62QualityGood •Quality: 62/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 10051ImportanceUsefulImportance: 51/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.28ResearchMinimalResearch Value: 28/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Summary
Comprehensive analysis of Responsible Scaling Policies showing 20 companies with published frameworks as of Dec 2025, with SaferAI grading major policies 1.9-2.2/5 for specificity. Evidence suggests moderate effectiveness hindered by voluntary nature, competitive pressure among 3+ labs, and ~7-month capability doubling potentially outpacing evaluation science, though third-party verification (METR evaluated 5+ models) and Seoul Summit commitments (16 signatories) represent meaningful coordination progress.
Content8/13
LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.EntityEntityYAML entity definition with type, description, and related entries.Edit historyEdit historyTracked changes from improve pipeline runs and manual edits.crux edit-log view <id>OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables29/ ~14TablesData tables for structured comparisons and reference material.Diagrams3/ ~1DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.Int. links48/ ~27Int. linksLinks to other wiki pages. More internal links = better graph connectivity.–Ext. links13/ ~17Ext. linksLinks to external websites, papers, and resources outside the wiki.Add links to external sourcesFootnotes0/ ~10FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citationsReferences21/ ~10ReferencesCurated external resources linked via <R> components or cited_by in YAML.Quotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:4.2 R:6.8 A:6.5 C:7.3RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks15BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Issues2
QualityRated 62 but structure suggests 100 (underrated by 38 points)
Links11 links could use <R> components
Responsible Scaling Policies
Policy
Responsible Scaling Policies
Comprehensive analysis of Responsible Scaling Policies showing 20 companies with published frameworks as of Dec 2025, with SaferAI grading major policies 1.9-2.2/5 for specificity. Evidence suggests moderate effectiveness hindered by voluntary nature, competitive pressure among 3+ labs, and ~7-month capability doubling potentially outpacing evaluation science, though third-party verification (METR evaluated 5+ models) and Seoul Summit commitments (16 signatories) represent meaningful coordination progress.
AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials (\$380B valuation, \$14B ARR), safety research (Constitutional AI, mechanistic interpretability, model welfare), governance (LTBT struc...Quality: 74/100OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to Public Benefit Corporation, with detailed analysis of governance crisis, 2024-2025 ownership restructuri...Quality: 62/100Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100METROrganizationMETRMETR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabi...Quality: 66/100
3.4k words · 15 backlinks
Overview
Responsible Scaling Policies (RSPsPolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100) are self-imposed commitments by AI labs to tie AI development to safety progress. The core idea is simple: before scaling to more capable systems, labs commit to demonstrating that their safety measures are adequate for the risks those systems would pose. If evaluations reveal dangerous capabilities without adequate safeguards, development should pause until safety catches up.
AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials (\$380B valuation, \$14B ARR), safety research (Constitutional AI, mechanistic interpretability, model welfare), governance (LTBT struc...Quality: 74/100 introduced the first RSP↗🔗 web★★★★☆AnthropicResponsible Scaling Policygovernancecapabilitiestool-useagentic+1Source ↗ in September 2023, establishing "AI Safety Levels" (ASL-1 through ASL-4+) analogous to biosafety levels. OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to Public Benefit Corporation, with detailed analysis of governance crisis, 2024-2025 ownership restructuri...Quality: 62/100 followed with its Preparedness Framework↗🔗 web★★★★☆OpenAIPreparedness Frameworkbiosecuritydual-use-researchx-riskSource ↗ in December 2023, and Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100 published its Frontier Safety Framework↗🔗 web★★★★☆Google DeepMindGoogle DeepMind: Introducing the Frontier Safety FrameworksafetySource ↗ in May 2024. By late 2024, twelve major AI companies↗🔗 web★★★★☆METRMETR's analysis of 12 companiesevaluationsdangerous-capabilitiesautonomous-replicationSource ↗ had published some form of frontier AI safety policy, and the Seoul Summit↗🏛️ government★★★★☆UK GovernmentSeoul Frontier AI Commitmentsself-regulationindustry-commitmentsresponsible-scalinggovernance+1Source ↗ secured voluntary commitments from sixteen companies.
RSPs represent a significant governance innovation because they create a mechanism for safety-capability coupling without requiring external regulation. As of December 2025, 20 companies have published frontier AI safety policies, up from 12 at the May 2024 Seoul Summit. Third-party evaluators like METR have conducted pre-deployment assessments of 5+ major models. However, RSPs face fundamental challenges: they are 100% voluntary with no legal enforcement, labs set their own thresholds (leading to SaferAI grades of only 1.9-2.2 out of 5), competitive pressure among 3+ frontier labs creates incentives to interpret policies permissively, and capability doubling times of approximately 7 months may outpace evaluation science.
Quick Assessment
Dimension
Assessment
Evidence
Adoption Rate
High
20 companies with published policies as of Dec 2025; 16 original Seoul signatories
Third-Party Verification
Growing
METR evaluated GPT-4.5, Claude 3.5, o3/o4-mini; UK/US AISIs conducting evaluations
Threshold Specificity
Medium-Low
SaferAI grade: dropped from 2.2 to 1.9 after Oct 2024 RSP update
Compliance Track Record
Mixed
Anthropic self-reported evaluations 3 days late; no major policy violations yet documented
Enforcement Mechanism
None
100% voluntary; no legal penalties for non-compliance
Competitive Pressure Risk
High
Racing dynamics incentivize permissive interpretation; 3+ major labs competing
Evaluation Coverage
Partial
12 of 20 companies with published policies have external eval arrangements
Risk Assessment & Impact
Dimension
Rating
Assessment
Safety Uplift
Medium
Creates tripwires; effectiveness depends on follow-through
Capability Uplift
Neutral
Not capability-focused
Net World Safety
Helpful
Better than nothing; implementation uncertain
Lab Incentive
Moderate
PR value; may become required; some genuine commitment
Scalability
Unknown
Depends on whether commitments are honored
Deception Robustness
Partial
External policy; but evals could be fooled
SI Readiness
Unlikely
Pre-SI intervention; can't constrain SI itself
Research Investment
Dimension
Estimate
Source
Lab Policy Team Size
5-20 FTEs per major lab
Industry estimates
External Policy Orgs
$5-15M/yr combined
METROrganizationMETRMETR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabi...Quality: 66/100, Apollo, policy institutes
Government Evaluation
$20-50M/yr
UK AISI (≈$100M budget), US AISI
Total Ecosystem
$50-100M/yr
Cross-sector estimate
Recommendation: Increase 3-5x (needs enforcement mechanisms and external verification capacity)
Differential Progress: Safety-dominant (pure governance; no capability benefit)
Comparison of Major Scaling Policies
The three leading frontier AI labs have published distinct but conceptually similar frameworks. All share the core structure of capability thresholds triggering escalating safeguards, but differ in specificity, governance, and scope.
RSPs create a framework linking capability levels to safety requirements. The core mechanism involves three interconnected processes: capability evaluation, safeguard assessment, and escalation decisions.
Loading diagram...
RSP Ecosystem
The effectiveness of RSPs depends on a network of actors providing oversight, verification, and accountability:
Loading diagram...
Key Components
Component
Description
Purpose
Capability Thresholds
Defined capability levels that trigger requirements
Create clear tripwires
Safety Levels
Required safeguards for each capability tier
Ensure safety scales with capability
Evaluations
Tests to determine capability and safety level
Provide evidence for decisions
Pause Commitments
Agreement to halt if safety is insufficient
Core accountability mechanism
Public Commitment
Published policy creates external accountability
Enable monitoring
Anthropic's AI Safety Levels (ASL)
Anthropic's ASL system↗🔗 webanthropickb-sourceSource ↗ is modeled after Biosafety Levels (BSL-1 through BSL-4) used for handling dangerous pathogens. Each level specifies both capability thresholds and required safeguards.
Level
Capability Definition
Deployment Safeguards
Security Standard
ASL-1
No meaningful catastrophic risk
Standard terms of service
Basic security hygiene
ASL-2
Meaningful uplift but not beyond publicly available info
Content filtering, usage policies
Current security measures
ASL-3
Significantly enhances non-state actor capabilities beyond public sources
Could substantially accelerate CBRN development or enable autonomous harm
Nation-state level protections (details TBD)
Air-gapped systems, extensive vetting
Current Status (January 2026): All Claude models currently operate at ASL-2. Anthropic activated ASL-3 safeguard development in May 2025 following evaluations of Claude Opus 4.
RSP v2.2 Changes: The October 2024 update↗🔗 web★★★★☆AnthropicAnthropic: Announcing our updated Responsible Scaling PolicygovernancecapabilitiesSource ↗ separated "ASL" to refer to safeguard standards rather than model categories, introducing distinct "Capability Thresholds" and "Required Safeguards." Critics argue↗🔗 webAnthropic's Responsible Scaling Policy Update Makes a Step BackwardsAnthropic's recent Responsible Scaling Policy update reduces specificity and concrete metrics for AI safety thresholds. The changes shift from quantitative benchmarks to more qu...governancecapabilitiessafetyevaluationSource ↗ this reduced specificity compared to v1.
OpenAI's Preparedness Framework
OpenAI's Preparedness Framework↗🔗 web★★★★☆OpenAIPreparedness Frameworkbiosecuritydual-use-researchx-riskSource ↗ underwent a major revision in April 2025 (v2.0), simplifying from four risk levels to two actionable thresholds.
Risk Domain
High Threshold
Critical Threshold
Bio/Chemical
Meaningful assistance to novices creating known threats
Simplified from Low/Medium/High/Critical to just High and Critical
Removed "Persuasion" as tracked category (now handled through standard safety)
Added explicit threshold for recursive self-improvement: achieving generational improvement (e.g., o1 to o3) in 1/5th the 2024 development time
Safety Advisory Group (SAG) now oversees all threshold determinations
Recent Evaluations: OpenAI's January 2026 o3/o4-mini system card↗🔗 web★★★★☆METRmetr.orgevaluationsdangerous-capabilitiesautonomous-replicationSource ↗ reported neither model reached High threshold in any tracked category, though biological and cyber capabilities continue trending upward.
Anthropic RSP v2.0 (criticized for reduced specificity)
Apr 2025
Framework update
OpenAI Preparedness v2.0 (simplified to High/Critical)
May 2025
First ASL-3
Anthropic activates elevated safeguards for Claude Opus 4
Oct 2025
Policy count
20 companies with published policies
Dec 2025
Third-party coverage
12 companies with METR arrangements
Seoul Summit Commitments (May 2024)
The Seoul AI Safety Summit↗🏛️ government★★★★☆UK GovernmentSeoul Frontier AI Commitmentsself-regulationindustry-commitmentsresponsible-scalinggovernance+1Source ↗ achieved a historic first: 16 frontier AI companies from the US, Europe, Middle East, and Asia signed binding-intent commitments. Signatories included Amazon, Anthropic, Cohere, G42, Google, IBM, Inflection AI, Meta, Microsoft, Mistral AI, Naver, OpenAI, Samsung, Technology Innovation Institute, xAI, and Zhipu.ai.
Commitment
Description
Compliance Verification
Safety Framework Publication
Publish framework by France Summit 2025
Public disclosure
Pre-deployment Evaluations
Test models for severe risks before deployment
Self-reported system cards
Dangerous Capability Reporting
Report discoveries to governments and other labs
Voluntary disclosure
Non-deployment Commitment
Do not deploy if risks cannot be mitigated
Self-assessed
Red-teaming
Internal and external adversarial testing
Third-party verification emerging
Cybersecurity
Protect model weights from theft
Industry standards
Follow-up: An additional 4 companies have joined since May 2024. The France AI Action Summit↗🔗 webKey Outcomes of the AI Seoul SummitSource ↗ (February 2025) reviewed compliance and expanded commitments.
Third-Party Evaluation Ecosystem
METR↗🔗 web★★★★☆METRmetr.orgsoftware-engineeringcode-generationprogramming-aisocial-engineering+1Source ↗ (Model Evaluation and Threat Research) has emerged as the leading independent evaluator, having conducted pre-deployment assessments for both Anthropic and OpenAI. Founded by Beth BarnesPersonBeth BarnesAI safety researcher, founder of METR (formerly ARC Evals). Focus on evaluating dangerous AI capabilities. (former OpenAI alignment researcher) in December 2023, METR does not accept compensation for evaluations to maintain independence.
METR's Role: METR's GPT-4.5 pre-deployment evaluation↗🔗 web★★★★☆METRmetr.orgevaluationsdangerous-capabilitiesautonomous-replicationSource ↗ piloted a new form of third-party oversight: verifying developers' internal evaluation results rather than conducting fully independent assessments. This approach may scale better while maintaining accountability.
Coverage Gap: As of late 2025, METR's analysis↗🔗 web★★★★☆METRMETR's analysis of 12 companiesevaluationsdangerous-capabilitiesautonomous-replicationSource ↗ found that while 12 companies have published frontier safety policies, third-party evaluation coverage remains inconsistent, with most evaluations occurring only for the largest US labs.
Limitations and Challenges
Structural Issues
Issue
Description
Severity
Voluntary
No legal enforcement mechanism
High
Self-defined thresholds
Labs set their own standards
High
Competitive pressure
Incentive to interpret permissively
High
Evaluation limitations
Evals may miss important risks
High
Public commitment only
Limited verification of compliance
Medium
Evolving policies
Policies can be changed by labs
Medium
The Evaluation Problem
RSPs are only as good as the evaluations that trigger them:
Challenge
Explanation
Unknown risks
Can't test for capabilities we haven't imagined
Sandbagging
Models might hide capabilities during evaluation
Elicitation difficulty
True capabilities may not be revealed
Threshold calibration
Hard to know where thresholds should be
Deceptive alignment
Sophisticated models may game evaluations
Competitive Dynamics
Scenario
Lab Behavior
Safety Outcome
Mutual commitment
All labs follow RSPs
Good
One defector
Others follow, one cuts corners
Bad (defector advantages)
Many defectors
Race to bottom
Very Bad
External pressure
Regulation enforces standards
Potentially Good
Key Cruxes
Summary of Disagreements
Crux
Optimistic View
Pessimistic View
Key Evidence
Lab Commitment
Reputational stake, genuine safety motivation
No enforcement, commercial pressure dominates
0 documented major violations; 3 procedural issues self-reported
12 companies published policies; significant variation in specificity
SaferAI: RSP Update Critique↗🔗 webAnthropic's Responsible Scaling Policy Update Makes a Step BackwardsAnthropic's recent Responsible Scaling Policy update reduces specificity and concrete metrics for AI safety thresholds. The changes shift from quantitative benchmarks to more qu...governancecapabilitiessafetyevaluationSource ↗
Anthropic v2.0
Reduced specificity from quantitative to qualitative thresholds
Public tracking; external pressure for strengthening
Evaluation Methodologies
RSP effectiveness depends on the quality of evaluations that trigger safeguard requirements. Current approaches include:
Capability Evaluation Approaches
Evaluation Type
Description
Strengths
Weaknesses
Benchmark suites
Standardized tests (MMLU, HumanEval, etc.)
Reproducible, comparable
May not capture dangerous capabilities
Red-teaming
Adversarial testing by experts
Finds real-world attack vectors
Expensive, not comprehensive
Uplift studies
Compare AI-assisted vs. unassisted task completion
Directly measures counterfactual risk
Hard to simulate real adversaries
Autonomous agent tasks
Long-horizon task completion↗🔗 web★★★★☆METRMeasuring AI Ability to Complete Long Tasks - METRResearch by METR demonstrates that AI models' ability to complete tasks is exponentially increasing, with task completion time doubling approximately every 7 months. This metric...capabilitiesSource ↗
Anthropic's recent Responsible Scaling Policy update reduces specificity and concrete metrics for AI safety thresholds. The changes shift from quantitative benchmarks to more qualitative descriptions of potential risks.
Research by METR demonstrates that AI models' ability to complete tasks is exponentially increasing, with task completion time doubling approximately every 7 months. This metric provides insights into AI's real-world capability progression.
Alignment Research CenterOrganizationAlignment Research CenterComprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (Decembe...Quality: 57/100
Risks
Bioweapons RiskRiskBioweapons RiskComprehensive synthesis of AI-bioweapons evidence through early 2026, including the FRI expert survey finding 5x risk increase from AI capabilities (0.3% → 1.5% annual epidemic probability), Anthro...Quality: 91/100Emergent CapabilitiesRiskEmergent CapabilitiesEmergent capabilities—abilities appearing suddenly at scale without explicit training—pose high unpredictability risks. Wei et al. documented 137 emergent abilities; recent models show step-functio...Quality: 61/100
Approaches
Dangerous Capability EvaluationsApproachDangerous Capability EvaluationsComprehensive synthesis showing dangerous capability evaluations are now standard practice (95%+ frontier models) but face critical limitations: AI capabilities double every 7 months while external...Quality: 64/100Corporate AI Safety ResponsesApproachCorporate AI Safety ResponsesMajor AI labs invest \$300-500M annually in safety (5-10% of R&D) through responsible scaling policies and dedicated teams, but face 30-40% safety team turnover and significant implementation gaps ...Quality: 68/100AI Safety CasesApproachAI Safety CasesSafety cases are structured arguments adapted from nuclear/aviation to justify AI system safety, with UK AISI publishing templates in 2024 and 3 of 4 frontier labs committing to implementation. Apo...Quality: 91/100
Analysis
Long-Term Benefit Trust (Anthropic)AnalysisLong-Term Benefit Trust (Anthropic)Anthropic's Long-Term Benefit Trust represents an innovative but potentially limited governance mechanism where financially disinterested trustees can appoint board members to balance public benefi...Quality: 70/100Short AI Timeline Policy ImplicationsAnalysisShort AI Timeline Policy ImplicationsAnalyzes how AI policy priorities shift under 1-5 year timelines to transformative AI, arguing that interventions requiring less than 2 years (lab safety practices, compute monitoring, emergency co...Quality: 62/100Anthropic IPOAnalysisAnthropic IPOAnthropic is actively preparing for a potential 2026 IPO with concrete steps like hiring Wilson Sonsini and conducting bank discussions, though timeline uncertainty remains with prediction markets ...Quality: 65/100
Key Debates
Corporate Influence on AI PolicyCruxCorporate Influence on AI PolicyComprehensive analysis of corporate influence pathways (working inside labs, shareholder activism, whistleblowing) showing mixed effectiveness: safety teams influenced GPT-4 delays and responsible ...Quality: 66/100AI Governance and PolicyCruxAI Governance and PolicyComprehensive analysis of AI governance mechanisms estimating 30-50% probability of meaningful regulation by 2027 and 5-25% x-risk reduction potential through coordinated international approaches. ...Quality: 66/100Why Alignment Might Be HardArgumentWhy Alignment Might Be HardComprehensive synthesis of arguments for why AI alignment is technically difficult, covering specification problems (value complexity, Goodhart's Law, output-centric reframings), inner alignment fa...Quality: 61/100
Other
Dario AmodeiPersonDario AmodeiComprehensive biographical profile of Anthropic CEO Dario Amodei documenting his competitive safety development philosophy, 10-25% catastrophic risk estimate, 2026-2030 AGI timeline, and Constituti...Quality: 41/100Elon MuskPersonElon MuskComprehensive profile of Elon Musk's role in AI, documenting his early safety warnings (2014-2017), OpenAI founding and contentious departure, xAI launch and funding history, Neuralink BCI developm...Quality: 38/100Beth BarnesPersonBeth BarnesAI safety researcher, founder of METR (formerly ARC Evals). Focus on evaluating dangerous AI capabilities.
Concepts
Alignment Policy OverviewAlignment Policy OverviewThis is a stub overview page that lists four policy/governance topic areas (RSPs, model specs, evaluation governance, pause/moratorium) with one-line descriptions and links to deeper pages. It cont...Quality: 21/100
Policy
International AI Safety Summit SeriesPolicyInternational AI Safety Summit SeriesThree international AI safety summits (2023-2025) achieved first formal recognition of catastrophic AI risks from 28+ countries, established 10+ AI Safety Institutes with \$100-400M combined budget...Quality: 63/100
Safety Research
Anthropic Core ViewsSafety AgendaAnthropic Core ViewsAnthropic allocates 15-25% of R&D (~\$100-200M annually) to safety research including the world's largest interpretability team (40-60 researchers), while maintaining \$5B+ revenue by 2025. Their R...Quality: 62/100AI EvaluationsSafety AgendaAI EvaluationsEvaluations and red-teaming reduce detectable dangerous capabilities by 30-50x when combined with training interventions (o3 covert actions: 13% → 0.4%), but face fundamental limitations against so...Quality: 72/100