Skip to content

XPT (Existential Risk Persuasion Tournament)

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:54 (Adequate)⚠️
Importance:64.5 (Useful)
Last edited:2026-01-29 (3 days ago)
Words:2.0k
Structure:
📊 15📈 0🔗 6📚 1316%Score: 13/15
LLM Summary:A 2022 forecasting tournament with 169 participants found superforecasters severely underestimated AI progress (2.3% probability for IMO gold vs actual 2025 achievement) and gave 8x lower AI extinction risk estimates than domain experts (0.38% vs 3% by 2100), with minimal belief convergence despite four months of structured debate. The key actionable finding is that domain experts were 2.5x more accurate than superforecasters on AI benchmarks, suggesting AI researcher predictions deserve more weight on AI-specific questions.
Issues (2):
  • QualityRated 54 but structure suggests 87 (underrated by 33 points)
  • Links4 links could use <R> components
DimensionAssessmentEvidence
Methodological InnovationExceptionalFirst structured adversarial collaboration tournament on existential risk
Research QualityPeer-reviewedPublished in International Journal of Forecasting (2025)
Participant QualityWorld-class89 superforecasters, 80 domain experts across AI, bio, nuclear, climate
Key FindingStrikingSuperforecasters severely underestimated AI progress; 8x gap on AI extinction risk
ImpactSignificantInforms how to weight expert vs superforecaster predictions on AI
ReplicabilityHighMethodology documented, could be applied to other domains
AttributeDetails
Full NameExistential Risk Persuasion Tournament
AbbreviationXPT
OrganizationForecasting Research Institute (FRI)
Chief ScientistPhilip Tetlock
TimelineJune - October 2022 (4 months)
Participants169 total: 89 superforecasters, 80 domain experts
PublishedInternational Journal of Forecasting, Vol. 41, Issue 2 (2025)
FundingCoefficient Giving grants to FRI (part of $16M+ total)

The Existential Risk Persuasion Tournament (XPT) represents a landmark attempt to improve forecasting rigor on catastrophic risks through structured adversarial collaboration. Running from June through October 2022, the tournament brought together 169 participants—89 superforecasters with proven track records from the Good Judgment Project and 80 domain experts recruited from AI safety, biosecurity, nuclear security, and climate organizations—to forecast existential threats facing humanity.

Unlike traditional forecasting tournaments that simply aggregate independent predictions, the XPT required participants to:

  1. Make initial anonymous forecasts
  2. Explain their reasoning publicly
  3. Engage in adversarial collaboration with those who disagreed
  4. Update forecasts based on debate outcomes

The methodology aimed to extract genuine disagreements (cruxes) and potentially reduce them through structured dialogue with monetary incentives for persuasion and accuracy.

The results, published in the International Journal of Forecasting in 2025, produced a striking finding: superforecasters systematically underestimated AI progress. On questions about AI achieving gold-level performance on the International Mathematical Olympiad, superforecasters gave only 2.3% probability to outcomes that actually occurred in July 2025, compared to 8.6% from domain experts.

Perhaps more importantly, the tournament documented an 8x gap between superforecasters (0.38%) and domain experts (3%) on AI-caused extinction risk by 2100, with minimal convergence despite four months of structured debate.

StageActivityPurposeDuration
Stage 1Anonymous probability judgmentsEstablish baseline forecasts without social influenceInitial
Stage 2Explain reasoningExpose assumptions and evidence transparentlyWeeks 1-4
Stage 3Adversarial collaborationStructured debate to identify cruxes and potential agreementWeeks 5-12
Stage 4Final forecastsUpdate based on debate outcomes with accuracy incentivesFinal

Participants received financial incentives for:

  • Persuasion: Payment if you convinced others to update toward your view
  • Accuracy: Payment if your final forecast was well-calibrated on resolvable questions
  • Participation: Base compensation for completing all stages

This incentive structure was designed to encourage both honest belief expression and genuine engagement with opposing views.

GroupNumberSelection CriteriaExpertise
Superforecasters89Top performers from Good Judgment Project with track records of predictive accuracyGeneralists with proven calibration
Domain Experts80Specialists recruited via organizations, academic departments, and research labsSubject matter experts

Domain expert breakdown:

Expert CategoryNumberFocus Areas
AI Experts32Machine learning, AI safety, computer science
General X-risk Experts15Long-run threats, existential risk research
Biorisk Experts12Pandemics, biosecurity, biotechnology
Nuclear Experts12Arms control, nuclear policy, deterrence
Climate Experts9Climate science, environmental policy

A 2025 follow-up analysis evaluated how well XPT participants predicted three years of AI progress since summer 2022:

BenchmarkSuperforecastersDomain ExpertsActual Outcome
IMO Gold by 20252.3%8.6%Achieved July 2025
MATH benchmark9.3%21.4%Exceeded expectations
MMLU benchmark7.2%25.0%Exceeded expectations
QuALITY benchmark20.1%43.5%Exceeded expectations
Average across benchmarks9.7%24.6%All exceeded predictions

Critical insight: Both groups systematically underestimated AI progress, but domain experts were 2.5x more accurate than superforecasters. Superforecasters initially thought an AI would achieve IMO Gold in 2035—a decade late.

The only strategy that reliably worked was aggregating everyone’s forecasts: taking the median of all predictions (both superforecasters and domain experts combined) produced substantially more accurate forecasts than any individual or group.

Risk CategorySuperforecasters (Median)Domain Experts (Median)Ratio (Expert/SF)
Any catastrophe9%20%2.2x
Any extinction1%6%6x
AI-caused extinction0.38%3%7.9x
Nuclear extinction0.1%0.3%3x
Bio extinction0.08%1%12.5x

The ~8x gap between superforecasters and domain experts on AI-caused extinction represents one of the largest disagreements in the tournament.

The XPT revealed how conditional framing affects risk estimates:

FramingSuperforecaster EstimateDomain Expert Estimate
Unconditional AI extinction by 21000.38%3%
Conditional on AGI by 20701%
Increase factor2.6x

A striking finding was the minimal convergence of beliefs despite four months of structured debate with monetary incentives:

“Despite incentives to share their best arguments during four months of discussion, neither side materially moved the other’s views.”

The published paper suggests this would be puzzling if participants were Bayesian agents but is less puzzling if participants were “boundedly rational agents searching for confirmatory evidence as the risks of embarrassing accuracy feedback receded.”

Strong AI-risk proponents made particularly extreme long-range forecasts but not short-range forecasts, suggesting possible motivated reasoning or genuine structural uncertainty about distant futures.

The XPT results have significant implications for how the AI safety community interprets forecasts:

ImplicationEvidenceRecommended Action
Superforecasters may systematically underestimate AI progress2.5x gap on benchmark predictions; thought IMO Gold in 2035 vs 2025 actualWeight superforecaster AI timeline estimates with skepticism
Domain experts better calibrated on AI specificallyCloser to actual outcomes on MATH, MMLU, QuALITY, IMOGive more weight to AI researcher predictions on AI questions
Aggregation outperforms individualsCombined median was most accurate forecastUse wisdom-of-crowds rather than individual expert opinions
Structured debate has limited impactMinimal belief updating despite four months of discussionDon’t expect debates to resolve fundamental disagreements
Long-range forecasts particularly unreliableExtreme positions taken on 2100 but not 2025 questionsFocus policy on near-term measurable outcomes

The XPT reveals a paradox: superforecasters are selected specifically for their calibration on historical questions, yet they significantly underperformed on AI progress. This suggests that:

  1. Base-rate reasoning fails for unprecedented change: Superforecasters may anchor on historical rates of technological progress that don’t account for potential AI acceleration
  2. Domain expertise matters for novel domains: On questions requiring deep understanding of AI capabilities, specialists outperformed generalists
  3. Neither group is reliable for extinction risk: With no feedback available, even the best forecasters may be poorly calibrated

Comparison with Other Forecasting Approaches

Section titled “Comparison with Other Forecasting Approaches”
ApproachMethodologyStrengthsLimitations
XPTMulti-stage adversarial collaborationStructured debate, expert+SF mix, incentivizedSmall scale (169 participants), expensive
MetaculusContinuous community aggregationLarge scale (50K+ users), real-time updatesLess structured debate, selection bias
Delphi SurveysAnonymous multi-round pollingSimple, expert-focusedNo debate, poor track record
Prediction MarketsFinancial incentives via tradingStrong incentives, liquidity signalsRequires funding, regulatory issues
Good JudgmentSuperforecaster panelsProven track recordCommercial, smaller question set

The XPT’s innovation is the structured adversarial collaboration component, which aims to extract genuine cruxes rather than just aggregate opinions.

The XPT introduced several methodological innovations applicable beyond existential risk:

InnovationDescriptionPotential Application
Adversarial collaborationPairing forecasters with opposite viewsAny high-stakes forecasting domain
Persuasion incentivesPayment for moving others’ beliefsEncouraging genuine engagement
Expert-SF comparisonDirect head-to-head on same questionsDetermining when to trust whom
Conditional forecastingSeparating P(outcome) from P(outcome given condition)Identifying key cruxes

The XPT methodology is highly replicable:

  • Documented four-stage process
  • Incentive structure specified
  • Participant selection criteria clear
  • Question operationalization transparent

Future tournaments could apply this methodology to:

  • Climate change scenarios
  • Biosecurity threats
  • Geopolitical risks
  • Technology adoption forecasts
StrengthEvidence
World-class participants89 proven superforecasters + 80 domain experts
Structured methodologyFour-stage process with clear incentives
Peer-reviewed publicationInternational Journal of Forecasting (2025)
Documented cruxesExplicit disagreements between groups
Retrospective validationCould check 2025 AI predictions against reality
Methodological innovationAdversarial collaboration approach
LimitationImpact
Small sample169 participants vs thousands on platforms like Metaculus
ExpensiveIntensive methodology requires significant resources per study
Four-month snapshotMay not capture long-term belief evolution
Minimal updatingDebate had limited impact on beliefs
Selection biasVolunteers may not represent broader expert/SF populations
Long-horizon uncertaintyNo ground truth for 2100 forecasts to validate against

The XPT was conducted by FRI with funding from Coefficient Giving:

GrantAmountPurpose
Initial FRI funding$175KPlanning (2021)
Science of Forecasting$1.3M (over 3 years)Core research program including XPT (2022)
General Support$10M (over 3 years)Expanded research (2023)

The XPT likely consumed a significant portion of FRI’s 2022-2023 budget given the intensive participant engagement and analysis required.

FRI’s follow-up project ForecastBench addresses some XPT limitations:

  • Continuous validation: New questions resolve weekly/monthly
  • AI benchmarking: Compares LLMs to human forecasters
  • Larger scale: 1,000 questions vs ~100 in XPT
  • Dynamic: Ongoing rather than one-time tournament

The XPT methodology could be applied to other domains:

DomainKey QuestionsExpert Communities
ClimateTipping points, geoengineering risksClimate scientists vs policy experts
BiosecurityPandemic likelihood, lab leak scenariosVirologists vs biosecurity specialists
GeopoliticsTaiwan invasion, nuclear escalationIR scholars vs intelligence analysts
TechnologyQuantum computing, fusion timelinesPhysicists vs technology forecasters