Skip to content

Responsible Scaling Policies

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:62 (Good)⚠️
Importance:78.5 (High)
Last edited:2026-01-29 (3 days ago)
Words:3.6k
Structure:
📊 30📈 3🔗 47📚 137%Score: 15/15
LLM Summary:Comprehensive analysis of Responsible Scaling Policies showing 20 companies with published frameworks as of Dec 2025, with SaferAI grading major policies 1.9-2.2/5 for specificity. Evidence suggests moderate effectiveness hindered by voluntary nature, competitive pressure among 3+ labs, and ~7-month capability doubling potentially outpacing evaluation science, though third-party verification (METR evaluated 5+ models) and Seoul Summit commitments (16 signatories) represent meaningful coordination progress.
Issues (2):
  • QualityRated 62 but structure suggests 100 (underrated by 38 points)
  • Links11 links could use <R> components
See also:LessWrong

Responsible Scaling Policies (RSPs) are self-imposed commitments by AI labs to tie AI development to safety progress. The core idea is simple: before scaling to more capable systems, labs commit to demonstrating that their safety measures are adequate for the risks those systems would pose. If evaluations reveal dangerous capabilities without adequate safeguards, development should pause until safety catches up.

Anthropic introduced the first RSP in September 2023, establishing “AI Safety Levels” (ASL-1 through ASL-4+) analogous to biosafety levels. OpenAI followed with its Preparedness Framework in December 2023, and Google DeepMind published its Frontier Safety Framework in May 2024. By late 2024, twelve major AI companies had published some form of frontier AI safety policy, and the Seoul Summit secured voluntary commitments from sixteen companies.

RSPs represent a significant governance innovation because they create a mechanism for safety-capability coupling without requiring external regulation. As of December 2025, 20 companies have published frontier AI safety policies, up from 12 at the May 2024 Seoul Summit. Third-party evaluators like METR have conducted pre-deployment assessments of 5+ major models. However, RSPs face fundamental challenges: they are 100% voluntary with no legal enforcement, labs set their own thresholds (leading to SaferAI grades of only 1.9-2.2 out of 5), competitive pressure among 3+ frontier labs creates incentives to interpret policies permissively, and capability doubling times of approximately 7 months may outpace evaluation science.

DimensionAssessmentEvidence
Adoption RateHigh20 companies with published policies as of Dec 2025; 16 original Seoul signatories
Third-Party VerificationGrowingMETR evaluated GPT-4.5, Claude 3.5, o3/o4-mini; UK/US AISIs conducting evaluations
Threshold SpecificityMedium-LowSaferAI grade: dropped from 2.2 to 1.9 after Oct 2024 RSP update
Compliance Track RecordMixedAnthropic self-reported evaluations 3 days late; no major policy violations yet documented
Enforcement MechanismNone100% voluntary; no legal penalties for non-compliance
Competitive Pressure RiskHighRacing dynamics incentivize permissive interpretation; 3+ major labs competing
Evaluation CoveragePartial12 of 20 companies with published policies have external eval arrangements
DimensionRatingAssessment
Safety UpliftMediumCreates tripwires; effectiveness depends on follow-through
Capability UpliftNeutralNot capability-focused
Net World SafetyHelpfulBetter than nothing; implementation uncertain
Lab IncentiveModeratePR value; may become required; some genuine commitment
ScalabilityUnknownDepends on whether commitments are honored
Deception RobustnessPartialExternal policy; but evals could be fooled
SI ReadinessUnlikelyPre-SI intervention; can’t constrain SI itself
DimensionEstimateSource
Lab Policy Team Size5-20 FTEs per major labIndustry estimates
External Policy Orgs$5-15M/yr combinedMETR, Apollo, policy institutes
Government Evaluation$20-50M/yrUK AISI (≈$100M budget), US AISI
Total Ecosystem$50-100M/yrCross-sector estimate
  • Recommendation: Increase 3-5x (needs enforcement mechanisms and external verification capacity)
  • Differential Progress: Safety-dominant (pure governance; no capability benefit)

The three leading frontier AI labs have published distinct but conceptually similar frameworks. All share the core structure of capability thresholds triggering escalating safeguards, but differ in specificity, governance, and scope.

AspectAnthropic RSPOpenAI PreparednessDeepMind FSF
First PublishedSeptember 2023December 2023May 2024
Current Versionv2.2 (May 2025)v2.0 (April 2025)v3.0 (October 2025)
Level StructureASL-1 through ASL-4+High / CriticalCCL-1 through CCL-4+
Risk DomainsCBRN, AI R&D, AutonomyBio/Chem, Cyber, Self-improvementAutonomy, Bio, Cyber, ML R&D, Manipulation
Governance BodyResponsible Scaling OfficerSafety Advisory Group (SAG)Frontier Safety Team
Third-Party EvalsMETR, UK AISIMETR, UK AISIInternal primarily
Pause CommitmentExplicit if safeguards insufficientImplicit (must have safeguards)Explicit for CCL thresholds
Board OverrideBoard can override RSOSAG advises; leadership decidesNot specified
LabCBRN ThresholdCyber ThresholdAutonomy/AI R&D Threshold
Anthropic ASL-3”Significantly enhances capabilities of non-state actors” beyond publicly available infoAutonomous cyberattacks on hardened targets”Substantially accelerates” AI R&D timeline
OpenAI High”Meaningful counterfactual assistance to novice actors” creating known threats”New risks of scaled cyberattacks”Self-improvement creating “new challenges for human control”
OpenAI Critical”Unprecedented new pathways to severe harm”Novel attack vectors at scaleRecursive self-improvement; 5x speed improvement
DeepMind CCL”Heightened risk of severe harm” from bio capabilities”Sophisticated cyber capabilities""Exceptional agency” and ML research capabilities

Sources: Anthropic RSP, OpenAI Preparedness Framework v2, DeepMind FSF v3

Loading diagram...

RSPs create a framework linking capability levels to safety requirements. The core mechanism involves three interconnected processes: capability evaluation, safeguard assessment, and escalation decisions.

Loading diagram...

The effectiveness of RSPs depends on a network of actors providing oversight, verification, and accountability:

Loading diagram...
ComponentDescriptionPurpose
Capability ThresholdsDefined capability levels that trigger requirementsCreate clear tripwires
Safety LevelsRequired safeguards for each capability tierEnsure safety scales with capability
EvaluationsTests to determine capability and safety levelProvide evidence for decisions
Pause CommitmentsAgreement to halt if safety is insufficientCore accountability mechanism
Public CommitmentPublished policy creates external accountabilityEnable monitoring

Anthropic’s ASL system is modeled after Biosafety Levels (BSL-1 through BSL-4) used for handling dangerous pathogens. Each level specifies both capability thresholds and required safeguards.

LevelCapability DefinitionDeployment SafeguardsSecurity Standard
ASL-1No meaningful catastrophic riskStandard terms of serviceBasic security hygiene
ASL-2Meaningful uplift but not beyond publicly available infoContent filtering, usage policiesCurrent security measures
ASL-3Significantly enhances non-state actor capabilities beyond public sourcesEnhanced refusals, red-teaming, monitoringHardened infrastructure, insider threat protections
ASL-4Could substantially accelerate CBRN development or enable autonomous harmNation-state level protections (details TBD)Air-gapped systems, extensive vetting

Current Status (January 2026): All Claude models currently operate at ASL-2. Anthropic activated ASL-3 safeguard development in May 2025 following evaluations of Claude Opus 4.

RSP v2.2 Changes: The October 2024 update separated “ASL” to refer to safeguard standards rather than model categories, introducing distinct “Capability Thresholds” and “Required Safeguards.” Critics argue this reduced specificity compared to v1.

OpenAI’s Preparedness Framework underwent a major revision in April 2025 (v2.0), simplifying from four risk levels to two actionable thresholds.

Risk DomainHigh ThresholdCritical Threshold
Bio/ChemicalMeaningful assistance to novices creating known threatsUnprecedented pathways to severe harm
CybersecurityNew risks of scaled attacks and exploitationNovel attack vectors threatening critical infrastructure
AI Self-improvementChallenges for human controlRecursive improvement; 5x development speed

Framework v2.0 Key Changes:

  • Simplified from Low/Medium/High/Critical to just High and Critical
  • Removed “Persuasion” as tracked category (now handled through standard safety)
  • Added explicit threshold for recursive self-improvement: achieving generational improvement (e.g., o1 to o3) in 1/5th the 2024 development time
  • Safety Advisory Group (SAG) now oversees all threshold determinations

Recent Evaluations: OpenAI’s January 2026 o3/o4-mini system card reported neither model reached High threshold in any tracked category, though biological and cyber capabilities continue trending upward.

LabPolicy NameInitialLatest VersionKey Features
AnthropicResponsible Scaling PolicySep 2023v2.2 (May 2025)ASL levels, deployment/security standards, external evals
OpenAIPreparedness FrameworkDec 2023v2.0 (Apr 2025)High/Critical thresholds, SAG governance, tracked categories
Google DeepMindFrontier Safety FrameworkMay 2024v3.0 (Oct 2025)CCL levels, manipulation risk domain added
xAISafety Framework2024v1.0Evaluation and deployment procedures
MetaFrontier Model Safety2024v1.0Purple-team evaluations, staged deployment
DateMilestoneCompanies/Details
Sep 2023First RSP publishedAnthropic RSP v1.0
Dec 2023Second frameworkOpenAI Preparedness Framework
May 2024Seoul Summit16 companies sign commitments
May 2024Third frameworkGoogle DeepMind FSF
Oct 2024Major revisionAnthropic RSP v2.0 (criticized for reduced specificity)
Apr 2025Framework updateOpenAI Preparedness v2.0 (simplified to High/Critical)
May 2025First ASL-3Anthropic activates elevated safeguards for Claude Opus 4
Oct 2025Policy count20 companies with published policies
Dec 2025Third-party coverage12 companies with METR arrangements

The Seoul AI Safety Summit achieved a historic first: 16 frontier AI companies from the US, Europe, Middle East, and Asia signed binding-intent commitments. Signatories included Amazon, Anthropic, Cohere, G42, Google, IBM, Inflection AI, Meta, Microsoft, Mistral AI, Naver, OpenAI, Samsung, Technology Innovation Institute, xAI, and Zhipu.ai.

CommitmentDescriptionCompliance Verification
Safety Framework PublicationPublish framework by France Summit 2025Public disclosure
Pre-deployment EvaluationsTest models for severe risks before deploymentSelf-reported system cards
Dangerous Capability ReportingReport discoveries to governments and other labsVoluntary disclosure
Non-deployment CommitmentDo not deploy if risks cannot be mitigatedSelf-assessed
Red-teamingInternal and external adversarial testingThird-party verification emerging
CybersecurityProtect model weights from theftIndustry standards

Follow-up: An additional 4 companies have joined since May 2024. The France AI Action Summit (February 2025) reviewed compliance and expanded commitments.

METR (Model Evaluation and Threat Research) has emerged as the leading independent evaluator, having conducted pre-deployment assessments for both Anthropic and OpenAI. Founded by Beth Barnes (former OpenAI alignment researcher) in December 2023, METR does not accept compensation for evaluations to maintain independence.

OrganizationRoleLabs EvaluatedKey Focus Areas
METRThird-party capability evalsAnthropic, OpenAIDangerous capability evaluations, autonomous agent tasks
Apollo ResearchAlignment and scheming evalsAnthropic, GoogleIn-context scheming, deceptive alignment detection
UK AI Safety InstituteGovernment evaluation bodyMultiple labsIndependent testing, joint evaluation protocols
US AI Safety Institute (NIST)US government coordinationMultiple labsAISIC consortium, standards development

METR’s Role: METR’s GPT-4.5 pre-deployment evaluation piloted a new form of third-party oversight: verifying developers’ internal evaluation results rather than conducting fully independent assessments. This approach may scale better while maintaining accountability.

Coverage Gap: As of late 2025, METR’s analysis found that while 12 companies have published frontier safety policies, third-party evaluation coverage remains inconsistent, with most evaluations occurring only for the largest US labs.

IssueDescriptionSeverity
VoluntaryNo legal enforcement mechanismHigh
Self-defined thresholdsLabs set their own standardsHigh
Competitive pressureIncentive to interpret permissivelyHigh
Evaluation limitationsEvals may miss important risksHigh
Public commitment onlyLimited verification of complianceMedium
Evolving policiesPolicies can be changed by labsMedium

RSPs are only as good as the evaluations that trigger them:

ChallengeExplanation
Unknown risksCan’t test for capabilities we haven’t imagined
SandbaggingModels might hide capabilities during evaluation
Elicitation difficultyTrue capabilities may not be revealed
Threshold calibrationHard to know where thresholds should be
Deceptive alignmentSophisticated models may game evaluations
ScenarioLab BehaviorSafety Outcome
Mutual commitmentAll labs follow RSPsGood
One defectorOthers follow, one cuts cornersBad (defector advantages)
Many defectorsRace to bottomVery Bad
External pressureRegulation enforces standardsPotentially Good
CruxOptimistic ViewPessimistic ViewKey Evidence
Lab CommitmentReputational stake, genuine safety motivationNo enforcement, commercial pressure dominates0 documented major violations; 3 procedural issues self-reported
Threshold AppropriatenessExpert judgment, iterative improvementConflict of interest, designed non-bindingSaferAI grades 1.9-2.2/5 for specificity
Evaluation Effectiveness5+ pre-deployment evals conducted; science improvingCan’t detect unknown unknowns; sandbagging possibleMETR found o3 “prone to reward hacking”
Competitive DynamicsMutual commitment creates equilibriumRace to bottom under pressure3+ frontier labs; ≈7-month capability doubling
TimelineGovernance can keep paceCapabilities outrun safeguards20 policies published in 26 months

Crux 1: Will Labs Honor Their Commitments?

Section titled “Crux 1: Will Labs Honor Their Commitments?”
Position: YesPosition: No
Reputational stake in commitmentCompetitive pressure to continue
Some genuine safety motivationNo enforcement mechanism
Third-party verification helpsHistory of moving goalposts
Public accountability creates pressureCommercial interests dominate

Crux 2: Are RSP Thresholds Set Appropriately?

Section titled “Crux 2: Are RSP Thresholds Set Appropriately?”
Position: AppropriatePosition: Too Permissive
Based on expert judgmentLabs set their own standards
Updated as understanding improvesConflict of interest
Better than no thresholdsMay be designed to be non-binding
Include safety marginsRacing pressure to minimize

Crux 3: Can Evaluations Trigger RSPs Effectively?

Section titled “Crux 3: Can Evaluations Trigger RSPs Effectively?”
Position: YesPosition: No
Eval science is improvingCan’t detect what we don’t test for
Third-party evals add accountabilityDeceptive models could sandbag
Explicit triggers create clarityThresholds may be wrong
Better than pure judgment callsGaming evaluations is incentivized
MetricValueSourceTrend
Companies with published policies20 (Dec 2025)METR Common Elements↑ from 12 in May 2024
Seoul Summit signatories16 (May 2024)UK Gov+4 since summit
Third-party pre-deployment evals5+ models (2024-25)METRGPT-4.5, Claude 3.5, o3, o4-mini
SaferAI Policy Grades1.9-2.2/5SaferAIAll major labs in “weak” category
Capability doubling time≈7 monthsMETRTask length agents can complete
Lab-reported compliance issues3+ proceduralAnthropic RSPSelf-reported in 2024 review
Models at elevated safety levels3 (Claude Opus 4, 4.1, Sonnet 4.5)AnthropicASL-3 activated May 2025
StrengthExplanation
Explicit commitmentsCreates accountability through specificity
Public pressureVisible commitments enable monitoring
Third-party verificationExternal evaluation adds credibility
Adaptive frameworkCan update as understanding improves
Industry coordinationCreates shared standards
WeaknessExplanation
Voluntary natureNo legal consequences for violations
Self-defined thresholdsConflict of interest in setting standards
Competitive pressureRacing incentives undermine commitment
Evaluation limitationsEvals may not catch real dangers
Policy evolutionLabs can change policies over time
ImprovementMechanismFeasibility
Third-party verificationIndependent audit of complianceHigh
Standardized thresholdsIndustry-wide capability definitionsMedium
Mandatory reportingLegal requirements for disclosureMedium
Binding commitmentsLegal liability for violationsLow-Medium
International coordinationCross-border standardsLow
ImprovementDescription
Regulatory backstopGovernment enforcement if voluntary fails
Standardized evalsShared evaluation suites across labs
International treatyBinding international commitments
Continuous verificationOngoing monitoring rather than point-in-time

Good fit if you believe:

  • Industry self-governance can work with proper incentives
  • Creating accountability structures is valuable
  • Incremental governance improvements help
  • RSPs can evolve into stronger mechanisms

Less relevant if you believe:

  • Voluntary commitments are inherently unreliable
  • Labs will never meaningfully constrain themselves
  • Focus should be on mandatory regulation
  • Evaluations can’t capture real risks
DocumentOrganizationLatest VersionURL
Responsible Scaling PolicyAnthropicv2.2 (May 2025)anthropic.com/responsible-scaling-policy
RSP Announcement & UpdatesAnthropicOngoinganthropic.com/news/rsp-updates
Preparedness FrameworkOpenAIv2.0 (Apr 2025)cdn.openai.com/preparedness-framework-v2.pdf
Frontier Safety FrameworkGoogle DeepMindv3.0 (Oct 2025)deepmind.google/frontier-safety-framework
Seoul Summit CommitmentsUK GovernmentMay 2024gov.uk/frontier-ai-safety-commitments
SourceFocusKey Finding
METR: Common Elements AnalysisCross-lab comparison12 companies published policies; significant variation in specificity
SaferAI: RSP Update CritiqueAnthropic v2.0Reduced specificity from quantitative to qualitative thresholds
FAS: Can Preparedness Frameworks Pull Their Weight?Framework effectivenessQuestions whether voluntary commitments can constrain behavior
METR: RSP Analysis (2023)Original RSP assessmentEarly evaluation of the RSP concept and implementation
CritiqueExplanationCounterargument
Voluntary and unenforceableNo legal mechanism to ensure complianceReputational costs and potential regulatory backstop
Labs set their own thresholdsInherent conflict of interestThird-party input and public accountability
Competitive pressureRacing dynamics undermine commitmentMutual commitment creates coordination equilibrium
Evaluation limitationsCan’t test for unknown capabilitiesImproving eval science; multiple redundant assessments
Policy evolutionLabs can weaken policies over timePublic tracking; external pressure for strengthening

RSP effectiveness depends on the quality of evaluations that trigger safeguard requirements. Current approaches include:

Evaluation TypeDescriptionStrengthsWeaknesses
Benchmark suitesStandardized tests (MMLU, HumanEval, etc.)Reproducible, comparableMay not capture dangerous capabilities
Red-teamingAdversarial testing by expertsFinds real-world attack vectorsExpensive, not comprehensive
Uplift studiesCompare AI-assisted vs. unassisted task completionDirectly measures counterfactual riskHard to simulate real adversaries
Autonomous agent tasksLong-horizon task completionTests agentic capabilitiesScaffolding matters; hard to standardize
Expert knowledge testsDomain-specific Q&A (e.g., virology)Measures depth in dangerous domainsExperts may not know all dangerous knowledge
MetricCurrent BenchmarkASL-3 Trigger (Anthropic)High Trigger (OpenAI)
Bio knowledgeExpert-level Q&AExceeds 95th percentile virologistMeaningful uplift for novices
Cyber capabilityCTF performanceAutonomous exploitation of hardened targetsScaled attack assistance
AI R&D automationRE-Bench performanceSubstantially accelerates timeline5x speedup threshold
Autonomous task length1-hour tasksMulti-day autonomous operationExtended resource acquisition

Current evaluations face fundamental challenges that limit RSP effectiveness:

  1. Unknown unknowns: Cannot test for capabilities not yet imagined
  2. Sandbagging risk: Models may underperform intentionally during evaluation
  3. Elicitation gap: True capabilities may exceed measured capabilities
  4. Threshold calibration: Optimal threshold placement is uncertain
  5. Combinatorial risks: Safe capabilities may combine dangerously

RSPs affect the Ai Transition Model through multiple pathways:

ParameterImpactMechanism
Safety Culture StrengthPositiveCreates explicit accountability mechanisms and public commitments
Racing DynamicsMixedCould reduce racing if mutually honored; or create false confidence
Human Oversight QualityPositiveFormalizes oversight requirements and third-party evaluation
International CoordinationPositiveSeoul commitments demonstrate cross-border coordination feasibility

RSPs represent an important governance innovation that creates explicit links between capabilities and safety requirements. Their current contribution to safety is moderate but improving: the 2025 policy updates and Seoul commitments demonstrate industry convergence on the RSP concept, while third-party evaluation coverage expands. However, effectiveness depends critically on:

  1. Voluntary compliance in the absence of legal enforcement
  2. Evaluation quality and ability to detect dangerous capabilities
  3. Competitive dynamics and whether labs will honor commitments under pressure
  4. Governance structures within labs that can override commercial interests

RSPs should be understood as a foundation for stronger governance rather than a complete solution. Their greatest value may be in establishing precedents and norms that can later be codified into binding regulation.