Skip to content

Provably Safe AI (davidad agenda)

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:65 (Good)⚠️
Importance:72.5 (High)
Last edited:2025-01-28 (12 months ago)
Words:2.3k
Structure:
📊 21📈 1🔗 16📚 173%Score: 14/15
LLM Summary:Davidad's provably safe AI agenda aims to create AI systems with mathematical safety guarantees through formal verification of world models and values, primarily funded by ARIA's £59M Safeguarded AI programme. The approach faces extreme technical challenges (world modeling, value specification) with uncertain tractability but would provide very high effectiveness if successful, addressing misalignment, deception, and power-seeking through proof-based constraints.
Issues (2):
  • QualityRated 65 but structure suggests 93 (underrated by 28 points)
  • StaleLast edited 369 days ago - may need review

Provably Safe AI represents one of the most ambitious research agendas in AI safety: designing advanced AI systems where safety is guaranteed by mathematical construction rather than verified through empirical testing. Led primarily by David “davidad” Dalrymple---the youngest-ever MIT graduate degree recipient at age 16, former Filecoin co-inventor, and current Programme Director at ARIA---this agenda seeks to create AI systems whose safety properties can be formally proven before deployment, potentially enabling safe development even of superintelligent systems.

The foundational paper Towards Guaranteed Safe AI, published in May 2024 by Dalrymple, Joar Skalse, and co-authors including Yoshua Bengio, Stuart Russell, Max Tegmark, and Sanjit Seshia, introduces the core framework of “guaranteed safe (GS) AI.” The approach requires three components: a world model (mathematical description of how the AI affects the outside world), a safety specification (mathematical description of acceptable effects), and a verifier (auditable proof certificate that the AI satisfies the specification relative to the world model).

The core insight motivating this work is that current AI safety approaches fundamentally rely on empirical testing and behavioral observation, which may be insufficient for systems capable of strategic deception or operating in novel circumstances. If we could instead construct AI systems where safety properties are mathematically guaranteed, we could have much stronger assurance that these properties will hold regardless of the system’s capabilities or intentions. This parallels how we trust airplanes because their structural properties are verified through engineering analysis, not just flight testing. Crucially, this approach does not require interpretability to be solved and could still rule out deceptive alignment.

The agenda faces extraordinary technical challenges. It requires solving multiple hard problems including: constructing formal world models that accurately represent relevant aspects of reality, specifying human values in mathematically precise terms, building AI systems that can reason within these formal frameworks, and proving that the resulting systems satisfy desired safety properties. Many researchers consider one or more of these components to be fundamentally intractable. However, proponents argue that even partial progress could provide valuable safety assurances, and that the potential payoff justifies significant research investment.

DimensionAssessmentEvidenceTimeline
Safety UpliftCritical (if works)Would provide mathematical safety guaranteesLong-term (10+ years)
Capability UpliftTaxConstraints likely reduce capabilitiesExpected
Net World SafetyHelpfulBest-case: solves alignment; current: research onlyLong-term
Lab IncentiveWeakVery long-term; no near-term commercial valueCurrent
Research Investment$10-50M/yrARIA funding; some academic workCurrent
Current AdoptionNoneResearch only; ARIA programmeCurrent

The Guaranteed Safe AI framework operates through a verification pipeline that checks every action before execution:

Loading diagram...

The key insight is that the proof checker can be much simpler than the proof generator. While generating proofs of safety may require sophisticated AI reasoning, verifying those proofs can be done by a small, formally verified checker---similar to how a calculator can verify a proof even if generating it requires human creativity.

ComponentDescriptionChallenge Level
Formal World ModelMathematical representation of relevant realityExtreme
Value SpecificationPrecise formal statement of human values/safetyExtreme
Bounded AgentAI system operating within formal constraintsHigh
Proof GenerationAI produces proofs that actions are safeVery High
Proof VerificationIndependent checking of safety proofsMedium
RequirementDescriptionCurrent Status
World Model ExpressivenessMust capture relevant aspects of realityTheoretical
World Model AccuracyModel must correspond to actual worldUnknown how to guarantee
Value CompletenessMust specify all relevant valuesPhilosophical challenge
Proof TractabilityProofs must be feasible to generateOpen question
Verification IndependenceChecker must be simpler than proverPossible in principle

ARIA’s Safeguarded AI programme is the primary funder of provably safe AI research globally. The programme aims to develop quantitative safety guarantees for AI “in the way we have come to expect for nuclear power and passenger aviation.”

AspectDetail
Total Funding£59 million over 3-5 years
LeadDavid “davidad” Dalrymple (Programme Director)
FocusGuaranteed safe AI through formal methods
TA1 (Theory)£18M for core research agenda (Phase 2)
TA3 (Applications)£5.4M for cyber-physical domains
TA1.4 (Sociotechnical)£3.4M for integration research
ThemeGoalApproach
World ModellingFormal models of physical and social realityCategory theory, physics-based models
Value LearningFormally specifying human preferencesUtility theory, formal ethics
Safe AgentsAI that reasons within formal constraintsTheorem proving AI
VerificationProving safety of AI actionsFormal verification, proof assistants
PropertyRequirementChallenge
SoundnessModel doesn’t claim impossibleCritical
CompletenessModel covers relevant phenomenaExtremely hard
ComputabilityReasoning in model is tractableTrade-off with expressiveness
UpdateabilityCan refine model with evidenceMust maintain guarantees
ApproachMechanismLimitations
Utility FunctionsNumerical representation of preferencesMay miss important values
Constraint-BasedHard constraints on behaviorDoesn’t capture positive goals
Formal EthicsEncode ethical frameworksPhilosophy is contested
Learned PreferencesInfer from human behaviorMay not generalize
AspectDescriptionStatus
Action VerificationProve each action is safe before executionTheoretical
Policy VerificationProve entire policy satisfies propertiesTheoretical
Runtime VerificationCheck proofs at execution timeMore tractable
Incremental VerificationBuild on previous proofsEfficiency technique

Proponents point to the remarkable success of formal verification in other domains. CompCert, a formally verified C compiler, was tested for six CPU-years by the Csmith random testing tool without finding a single wrong-code error---the only compiler tested for which this was true. seL4, a formally verified operating system microkernel, has machine-checked mathematical proofs for Arm, RISC-V, and Intel architectures and is used in aerospace, autonomous aviation, and IoT platforms. These demonstrate that formal verification can scale to real systems.

ArgumentSupportProponents
Formal methods have scaled beforeCompCert, seL4 showed scaling possibleFormal methods community
AI can help with proofsAutomated theorem proving advancing rapidlyAI researchers
Partial solutions valuableEven limited guarantees helpdavidad
Alternative to behavioral safetyFundamentally different approachSafety researchers
ConcernArgumentProponents
World model problemCan’t formally specify realityMany researchers
Value specification problemHuman values resist formalizationPhilosophers
Capability taxConstraints make systems unusableCapability researchers
Verification complexityProofs may be intractableComplexity theorists
UncertaintyRange of ViewsCritical For
World model feasibilityPossible to impossibleEntire agenda
Value specification feasibilityPossible to impossibleEntire agenda
Capability preservationMinimal to prohibitive taxPractical deployment
Proof tractabilityFeasible to intractableRuntime safety
ApproachSimilarityKey Difference
Formal VerificationBoth use proofsProvably safe designs from scratch
Constitutional AiBoth specify rulesMathematical vs natural language
AI ControlBoth aim for safetyInternal guarantees vs external constraints
InterpretabilityBoth seek understandingFormal specification vs empirical analysis
DimensionAssessmentRationale
Technical ScalabilityUnknownCore question of the research agenda
Deception RobustnessStrong (by design)Proofs rule out deception
SI ReadinessYes (if works)Designed for this; success uncertain
Capability ScalabilityUnknownCapability tax may be prohibitive
DimensionRatingNotes
TractabilityLowCore components (world modeling, value specification) may be fundamentally intractable; requires major breakthroughs
ScalabilityUnknownIf solved, would apply to arbitrarily capable systems by design
Current MaturityEarly ResearchNo working prototypes; theoretical frameworks only
Time Horizon10-20+ yearsLong-term moonshot research program
Key ProponentsARIA, davidad£59M committed; some academic groups
Effectiveness if AchievedVery HighWould provide mathematical guarantees against misalignment and deception

If successful, provably safe AI would provide mathematical guarantees against several core AI safety risks:

RiskRelevanceHow It Helps
Misalignment PotentialHighSafety specification formally encodes human values; verifier proves actions satisfy specification
Deceptive AlignmentHighProofs preclude deception by construction---no gap between training and deployment behavior
Power-Seeking AIHighActions mathematically constrained to proven-safe subset; cannot acquire unauthorized resources
Goal MisgeneralizationHighFormal world model and specification prevent unintended optimization targets
SchemingMediumBounded agent cannot pursue hidden agendas if all actions must pass verification

As of early 2025, provably safe AI remains in the early research phase with no working prototypes:

MilestoneStatusDetails
Foundational TheoryPublishedTowards Guaranteed Safe AI outlines the framework (May 2024)
ARIA FundingActive£59M programme launched; TA1 Phase 2 opens June 2025
Research TeamsFormingOxford researchers awarded initial funding (April 2025)
Neural Network VerificationAdvancingVNN-COMP competition driving tool development
Working PrototypeNot YetNo demonstrations of GS AI components on realistic systems

The International Neural Network Verification Competition represents related work on verifying properties of neural networks, with tools like α,β-CROWN achieving significant speedups over traditional algorithms. However, this work addresses a narrower problem than full GS AI.

  • May Be Impossible: Core components may be fundamentally intractable
  • Capability Tax: Constraints may make systems unusable
  • World Model Problem: Cannot formally specify reality comprehensively
  • Value Specification Problem: Human values resist precise formalization
  • Long Timeline: Decades of research may be needed
  • Alternative Approaches May Suffice: Empirical safety might be enough
  • Verification Complexity: Proofs may be too expensive to generate
QuestionImportanceStatus
Can world models be expressive enough?CriticalActive research; no consensus
How do we specify human values formally?CriticalPhilosophical and technical challenge
Can proofs scale to complex actions?HighUnknown; depends on proof automation advances
What capability tax is acceptable?HighDepends on risk level and alternatives
Can AI help generate its own safety proofs?HighPromising but creates bootstrap problem
What partial guarantees are valuable?MediumMay enable incremental progress
DocumentAuthorContribution
Towards Guaranteed Safe AIDalrymple et al. (2024)Foundational agenda paper with Bengio, Russell, Tegmark, Seshia
ARIA Safeguarded AI ProgrammeARIAOfficial £59M programme description
davidad’s Programme ThesisDavid DalrympleComprehensive technical exposition on LessWrong
The Gradient InterviewThe GradientIn-depth interview on the agenda
OrganizationRoleContribution
ARIAFunding body£59M Safeguarded AI programme
University of OxfordResearchMultiple ARIA-funded projects
MIRI (historically)ResearchEarly theoretical work on agent foundations
SystemAchievementRelevance
seL4Formally verified OS kernelProves verification can scale to real systems; used in aerospace
CompCertFormally verified C compilerZero wrong-code bugs found in 6 CPU-years of testing
CertiKOSVerified concurrent OSBuilt on seL4 foundations at Yale

Provably safe AI affects the Ai Transition Model through fundamental safety guarantees:

FactorParameterImpact
Alignment RobustnessVerification strengthWould provide mathematical alignment guarantees
Misalignment PotentialAlignment approachEliminates misalignment by construction
Safety-Capability GapGap closureVerified systems have provable safety properties

The provably safe AI agenda represents a moonshot approach that could fundamentally solve AI safety if successful. The extreme difficulty of the technical challenges means success is uncertain, but the potential value of even partial progress justifies significant research investment. This approach is particularly valuable as a complement to empirical safety work, providing a fundamentally different path to safe AI.