Skip to content

Forecasting Research Institute

📋Page Status
Page Type:ContentStyle Guide →Standard knowledge base article
Quality:55 (Adequate)⚠️
Importance:54 (Useful)
Last edited:2026-01-29 (3 days ago)
Words:3.9k
Structure:
📊 26📈 1🔗 7📚 6112%Score: 14/15
LLM Summary:FRI's XPT tournament found superforecasters gave 9.7% average probability to AI progress outcomes that occurred vs 24.6% from domain experts, suggesting superforecasters systematically underestimate AI progress. Their research shows median expert AI extinction risk at 3% by 2100 vs 0.38% from superforecasters, with minimal belief convergence despite structured debate.
Issues (2):
  • QualityRated 55 but structure suggests 93 (underrated by 38 points)
  • Links10 links could use <R> components
DimensionAssessmentEvidence
Research QualityExceptionalPeer-reviewed publications in International Journal of Forecasting, ICLR
Methodology InnovationHighXPT persuasion tournament methodology, ForecastBench benchmark
InfluenceGrowingCited in policy discussions, academic forecasting research
LeadershipWorld-classPhilip Tetlock (Chief Scientist), author of Superforecasting
ScaleModerate169 participants in XPT, growing ForecastBench community
AI RelevanceCentralAI progress forecasting is major research focus
Key FindingStrikingSuperforecasters severely underestimated AI progress
AttributeDetails
Full NameForecasting Research Institute
Founded2021
Chief ScientistPhilip Tetlock (author of Superforecasting and Expert Political Judgment)
CEOJosh Rosenberg
Research DirectorEzra Karger (also Senior Economist, Federal Reserve Bank of Chicago)
LocationPhiladelphia area / Remote
Status501(c)(3) research nonprofit
Websiteforecastingresearch.org
FundingOver $16M from Coefficient Giving (2021-present)
Key OutputsXPT Tournament, ForecastBench (ICLR 2025), FRI-ONN Nuclear Study
FocusForecasting methodology for high-stakes decisions and existential risk

The Forecasting Research Institute develops advanced forecasting methods to improve decision-making on high-stakes issues, with particular emphasis on existential risks and AI development. Founded in 2021 with initial support from Coefficient Giving and led by Chief Scientist Philip Tetlock—whose research established the field of superforecasting—FRI represents the next generation of forecasting research, moving from establishing accuracy standards to channeling forecasting into real-world policy relevance.

FRI’s flagship project, the Existential Risk Persuasion Tournament (XPT), introduced a multi-stage methodology designed to improve the rigor of debates about catastrophic risks. Unlike traditional forecasting tournaments that simply aggregate independent predictions, the XPT required participants to engage in structured debates, explain their reasoning, and update their forecasts through adversarial collaboration. Running from June through October 2022, the tournament brought together 169 participants who made forecasts about existential threats including AI, biosecurity, climate change, and nuclear war. The results, published in the International Journal of Forecasting in 2025, produced striking findings about the limits of current forecasting on AI progress.

The institute has documented a significant gap between superforecaster and domain expert predictions on AI, with superforecasters systematically underestimating the pace of AI progress. On questions about AI achieving gold-level performance on the International Mathematical Olympiad, superforecasters gave only 2.3% probability to outcomes that actually occurred in July 2025, compared to 8.6% from domain experts. Across four AI benchmarks (MATH, MMLU, QuALITY, and IMO Gold), superforecasters assigned an average probability of just 9.7% to outcomes that actually occurred, compared to 24.6% from domain experts. This finding has important implications for how AI timeline forecasts should be interpreted and weighted.

FRI’s more recent work includes ForecastBench, a dynamic benchmark for evaluating LLM forecasting capabilities published at ICLR 2025, and a collaboration with the Open Nuclear Network on nuclear catastrophe risk forecasting presented at the 2024 NPT PrepCom in Geneva.

Philip Tetlock’s foundational research provides the intellectual basis for FRI’s work. Understanding his four decades of forecasting research is essential to understanding FRI’s approach.

Tetlock’s landmark study, summarized in Expert Political Judgment: How Good Is It? How Can We Know? (Princeton University Press, 2005), examined 28,000 forecasts from 284 experts across government, academia, and journalism over two decades. The sobering findings established core principles that still guide FRI’s methodology:

FindingImplication
Experts were often only slightly more accurate than chanceTraditional expertise is insufficient for forecasting
Simple extrapolation algorithms often beat expert forecastsFormal methods can outperform intuition
Media-prominent forecasters performed worse than low-profile colleaguesFame and accuracy are inversely correlated
”Foxes” (eclectic thinkers) outperformed “hedgehogs” (single-theory adherents)Cognitive style matters more than credentials

The “hedgehog vs. fox” framework, adapted from Isaiah Berlin’s essay, became a cornerstone of forecasting research. Hedgehogs “know one big thing”—they have a grand theory (Marxist, Libertarian, or otherwise) that they extend into many domains with great confidence. Foxes “know many little things”—they draw from eclectic traditions and improvise in response to changing events. Tetlock found that foxes demonstrated significantly better calibration and discrimination scores, particularly on long-term forecasts.

Building on these findings, Tetlock co-led the Good Judgment Project (GJP), a multi-year IARPA-funded study of probability judgment accuracy. The project tested whether forecasting accuracy could be systematically improved through selection, training, and team structure.

ComponentApproachResult
Participant PoolThousands of volunteer forecastersEnabled large-scale experimentation
TrainingSimple probability training exercisesImproved Brier scores significantly
SelectionPersonality-trait tests for cognitive biasIdentified consistent top performers
SuperforecastersTop 2% across multiple seasonsMaintained accuracy over time and topics
Team StructureCollaborative forecasting groupsTeams outperformed individuals

Key findings included:

  • Training exercises substantially improved forecast accuracy as measured by Brier scores
  • The best forecasters (“superforecasters”) maintained consistent performance across years and question categories
  • A log-odds extremizing aggregation algorithm outperformed competitors
  • GJP forecasts were reportedly 30% better than intelligence officers with access to classified information

The project resulted in Superforecasting: The Art and Science of Prediction (2015), co-authored with Dan Gardner, which distilled principles of good forecasting: gather evidence from diverse sources, think probabilistically, work in teams, keep score, and remain willing to admit error.

FRI represents the third phase of Tetlock’s research program. While the first phase established that experts are poorly calibrated and the second identified characteristics of accurate forecasters, FRI’s mission focuses on applying these insights to high-stakes policy questions—particularly existential risks where feedback loops are weak or nonexistent.

Loading diagram...

For detailed XPT methodology, participant breakdown, and full analysis, see the dedicated XPT (Existential Risk Persuasion Tournament) page.

A 2025 follow-up analysis by Tetlock, Rosenberg, Kučinskas, Ceppas de Castro, Jacobs, and Karger evaluated how well XPT participants predicted three years of AI progress since summer 2022:

BenchmarkSuperforecastersDomain ExpertsActual Outcome
IMO Gold by 20252.3%8.6%Achieved July 2025
MATH benchmark9.3%21.4%Exceeded
MMLU benchmark7.2%25.0%Exceeded
QuALITY benchmark20.1%43.5%Exceeded
Average across benchmarks9.7%24.6%All exceeded predictions

Both groups systematically underestimated AI progress, but domain experts were closer to reality. Superforecasters initially thought an AI would achieve IMO Gold in 2035—a decade late. The only strategy that reliably worked was aggregating everyone’s forecasts: taking the median of all predictions produced substantially more accurate forecasts than any individual or group.

Risk CategorySuperforecasters (Median)Domain Experts (Median)Ratio
Any catastrophe by 21009%20%2.2x
Any extinction by 21001%6%6x
AI-caused extinction by 21000.38%3%7.9x
Nuclear extinction by 21000.1%0.3%3x
Bio extinction by 21000.08%1%12.5x

The ~8x gap between superforecasters (0.38%) and domain experts (3%) on AI-caused extinction represents one of the largest disagreements in the tournament. Notably, superforecasters gave higher probability to nuclear catastrophe (4%) than AI catastrophe (2.13%) by 2100, but assigned extinction risk from AI as roughly an order of magnitude larger than from nuclear weapons—possibly because AI could “deliberately hunt down survivors.”

For comparison, existential risk researcher Toby Ord estimated a 16% total chance of extinction by 2100—16x higher than superforecasters and 2.5x higher than domain experts.

The XPT revealed how conditional framing affects risk estimates:

FramingSuperforecaster Estimate
Unconditional AI extinction by 21000.38%
Conditional on AGI by 20701%
Increase factor2.6x

A striking finding was the minimal convergence of beliefs despite four months of structured debate with monetary incentives:

“Despite incentives to share their best arguments during four months of discussion, neither side materially moved the other’s views.”

The paper suggests this would be puzzling if participants were Bayesian agents but is less puzzling if participants were “boundedly rational agents searching for confirmatory evidence as the risks of embarrassing accuracy feedback receded.” Strong AI-risk proponents made particularly extreme long- but not short-range forecasts.

ForecastBench is FRI’s dynamic, contamination-free benchmark for evaluating large language model forecasting capabilities, published at ICLR 2025.

The benchmark was designed to solve the data contamination problem that plagues static AI benchmarks:

FeatureDescription
Dynamic Questions1,000 questions, continuously updated with new future-dated questions
Contamination-FreeAll questions about events with no known answer at submission time
Multiple BaselinesCompares LLMs to superforecasters, public forecasters, and random chance
Open SubmissionPublic leaderboard for model comparison
Question SourcesMarket questions (Manifold, Metaculus, Polymarket, RAND) and dataset questions (ACLED, DBnomics, FRED, Wikipedia, Yahoo Finance)
FundingSupported by Coefficient Giving until mid-2027

The authors (Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock) designed ForecastBench as a “valuable proxy for general intelligence” since forecasting requires integrating diverse knowledge sources and reasoning under uncertainty.

ForecasterDifficulty-Adjusted Brier ScoreNotes
Superforecasters0.081Best overall performance
GPT-4.5 (Feb 2025)0.101Best LLM performance
GPT-4 (Mar 2023)0.131Baseline frontier model
Public Participants≈0.12LLMs now outperform non-experts
Random Baseline0.25Chance performance

Key findings from ForecastBench:

FindingEvidence
Superforecasters still leadThe 0.054 Brier score gap between superforecasters and GPT-4o is larger than the 0.026 gap between GPT-4o and GPT-4
Rapid LLM improvementState-of-the-art LLM performance improves by ≈0.016 difficulty-adjusted Brier points annually
Projected parityLinear extrapolation suggests LLMs will match superforecaster performance in November 2026 (95% CI: December 2025 – January 2028)
Initial models underperformedClaude-3.5 Sonnet and GPT-4 Turbo initially performed roughly as well as a simple median of public forecasts

ForecastBench provides important empirical grounding for claims about AI forecasting capabilities, demonstrating measurable progress while showing that complex geopolitical and scientific questions remain challenging for LLMs.

PublicationYearVenueKey Contribution
Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament2023Working paperXPT methodology and initial findings
Subjective-probability forecasts of existential risk2025International Journal of ForecastingPeer-reviewed XPT results (Vol. 41, Issue 2, pp. 499-516)
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities2025ICLRLLM forecasting benchmark
Improving Judgments of Existential Risk2022SSRN Working PaperFramework for better forecasts, questions, explanations, and policies
Can Humanity Achieve a Century of Nuclear Peace?2024FRI-ONN ReportNuclear catastrophe probability estimates
Assessing Near-Term Accuracy in the XPT2025FRI ReportRetrospective accuracy analysis of 2022 forecasts

FRI’s research addresses a critical methodological challenge: forecasting low-probability, high-consequence events like existential risks where traditional calibration feedback is unavailable.

ChallengeIssueManifestation in XPT
Base Rate AnchoringForecasters anchor too heavily on historical ratesMay explain superforecaster underestimation of novel AI progress
Probability CompressionAll “unlikely” events collapsed to similar estimatesExtinction estimates cluster near 0-1% despite very different underlying mechanisms
Feedback DelaysCan’t learn from rare eventsNo extinction has occurred to calibrate against
Horizon EffectsExtreme estimates for distant futuresStrong AI-risk proponents gave extreme long- but not short-range forecasts
Confirmatory SearchSeeking evidence that confirms existing viewsNeither side updated materially despite structured debate
MethodDescriptionApplication
Structured Scenario AnalysisBreak down complex events into component pathsDecompose “AI extinction” into specific mechanisms
Adversarial CollaborationPair forecasters with opposing viewsXPT Stage 3 debate structure
Cross-domain CalibrationUse accuracy on resolvable questions to weight long-run forecastsCompare 2025-resolvable vs 2100 forecasts
Reciprocal ScoringMethods for forecasting questions that may never resolveKarger (2021) methodology paper

FRI collaborated with the Open Nuclear Network (ONN) in association with the University of Pennsylvania on a comprehensive nuclear catastrophe forecasting study.

AspectDetails
Partner OrganizationsFRI, Open Nuclear Network, University of Pennsylvania
MethodologyXPT-style structured elicitation with superforecasters and nuclear experts
Definition of CatastropheEvent causing over 10 million deaths
Time HorizonProbability estimates through 2045
Presentation2024 NPT PrepCom in Geneva (July 25, 2024)
PublicationCan Humanity Achieve a Century of Nuclear Peace?
FindingEstimate
Median expert probability of nuclear catastrophe by 20455%
Superforecaster probability1%
Most likely geopolitical sourceRussia-NATO/USA tensions
Potential risk reduction50% if six key policies fully implemented

The study identified six policies that could collectively reduce nuclear catastrophe risk by 50%:

  1. Establishing a secure crisis communications network
  2. Conducting comprehensive failsafe reviews of nuclear protocols
  3. Implementing enhanced early warning cooperation
  4. Adopting no-first-use declarations
  5. Reducing nuclear arsenal sizes
  6. Strengthening non-proliferation verification

The July 2024 side event “A Gamble of Our Own Choosing: Forecasting Nuclear Risks” highlighted:

  • Forecasting combined with qualitative analysis is invaluable for understanding nuclear risks
  • Need for more dynamic risk assessment methods
  • Importance of communicating findings effectively to decision-makers
  • Focus on near-term events enhances methodology credibility

FRI’s work is primarily funded by Coefficient Giving, which launched forecasting as an independent cause area in 2024.

GrantAmountPurposeDate
Initial Planning Support$175,000Planning work by TetlockOct 2021
Science of Forecasting$1.3M (over 3 years)Core research program, forecasting platform development2022
General Support$10M (over 3 years)Expanded research program2023
AI Progress Forecasting Panel$1.07M (over 2 years)Panel of AI experts forecasting capabilities, adoption, impacts2024
Red-line Evaluations$125,000Operationalizing AI red-line evaluations2024
Tripwire Capability Evaluations$158,850AI capability tripwire forecasting2024
Forecasting Benchmark$100,000Collaboration with Steinhardt lab on ForecastBench2024
XPT Recognition Prize$15,000Recognition for XPT publication2023
Analysis of Historical Forecasts$10,000Forecasting accuracy analysis2024
AI Risk Discussion Project$150,000Bringing together forecasters who disagree on AI x-risk2024

Total Coefficient Giving funding: Over $16 million

OrganizationPrimary MethodStrengthLimitation
FRIMethodology research, structured tournamentsScientific rigor, peer-reviewed publicationsSmaller scale, research-focused
MetaculusPrediction aggregation platformScale, continuous questions, public accessLess methodological innovation
Epoch AIEmpirical AI trends analysisData quality, quantitative rigorLess forecasting focus
Good Judgment Inc.Commercial superforecaster panelsProven accuracy, operational focusCommercial rather than research mission
PolymarketPrediction marketsReal-money incentives, liquidityRegulatory constraints, short-term focus

The XPT findings have significant implications for how the AI safety community should interpret forecasts:

ImplicationEvidenceAction
Superforecasters may systematically underestimate AI progress2.5x gap on benchmark predictions; thought IMO Gold would occur in 2035Weight superforecaster AI timeline estimates with skepticism
Domain experts may be better calibrated on AI specificallyCloser to actual outcomes on MATH, MMLU, QuALITY, IMOGive more weight to AI researcher predictions on AI questions
Aggregation outperforms individualsCombined median was most accurate forecastUse wisdom-of-crowds rather than individual expert opinions
Structured debate has limited impactMinimal belief updating despite four months of discussionDon’t expect debates to resolve fundamental disagreements
Long-range forecasts are particularly unreliableExtreme positions taken on 2100 but not 2025 questionsFocus policy on near-term measurable outcomes

FRI’s research reveals a paradox: superforecasters are selected specifically for their calibration on historical questions, yet they significantly underperformed on AI progress. This suggests that:

  1. Base-rate reasoning fails for unprecedented change: Superforecasters may anchor on historical rates of technological progress that don’t account for potential AI acceleration
  2. Domain expertise matters for novel domains: On questions requiring deep understanding of AI capabilities, specialists outperformed generalists
  3. Neither group is reliable for extinction risk: With no feedback available, even the best forecasters may be poorly calibrated
RecommendationRationale
Weight domain expertise higher on AIExperts outperformed superforecasters on AI questions
Use structured elicitationReduces some biases vs. simple aggregation
Decompose complex questionsHelps calibrate low-probability estimates
Track calibration by domainForecaster accuracy varies across topics
Invest in resolvable benchmarksNear-term forecasts provide calibration feedback
Combine multiple forecaster typesAggregation across groups worked best
RolePersonBackground
Chief ScientistPhilip TetlockAnnenberg University Professor at UPenn, author of Superforecasting and Expert Political Judgment, Good Judgment Project co-founder, elected to American Philosophical Society (2019)
CEOJosh RosenbergOrganizational leadership and operations
Research DirectorEzra KargerSenior Economist at Federal Reserve Bank of Chicago, research in labor economics, public economics, and forecasting

According to FRI’s team page, the organization includes:

Team MemberFocus Area
Michael PageResearch operations
Tegan McCaslinResearch
Zachary JacobsResearch, ForecastBench development
+ Various contractorsExternal collaborators in forecasting
InstitutionAffiliation
University of PennsylvaniaTetlock’s primary appointment (Wharton School + School of Arts and Sciences)
Federal Reserve Bank of ChicagoKarger’s primary appointment
NBERKarger is NBER affiliate
DateEvent
1984-2003Tetlock conducts Expert Political Judgment study (284 experts, 28,000 forecasts)
2005Expert Political Judgment published by Princeton University Press
2011Good Judgment Project launched with IARPA funding
2015Superforecasting published; GJP concludes after beating competition
October 2021FRI founded with $175K Coefficient Giving planning grant
June-October 2022XPT tournament conducted (169 participants, 4 months)
2022Coefficient Giving provides $1.3M multi-year grant
August 2023XPT working paper released
2023Coefficient Giving provides $10M general support grant
September 2024ForecastBench launched
July 2024FRI-ONN nuclear risk study presented at NPT PrepCom in Geneva
October 2024Nuclear risk report published
January 2025ForecastBench paper published at ICLR 2025
2025XPT results published in International Journal of Forecasting
September 2025XPT near-term accuracy follow-up published
StrengthEvidence
Methodological rigorPeer-reviewed publications in top venues (ICLR, Int. Journal of Forecasting)
Leadership credentialsTetlock’s four decades of forecasting research, American Philosophical Society membership
InnovationXPT methodology, ForecastBench, structured elicitation techniques
Policy relevanceNuclear risk work presented at NPT PrepCom, AI policy applications
IndependenceResearch nonprofit with philanthropic rather than commercial funding
Quantitative findingsSpecific probability estimates with documented methodology
LimitationContext
Scale169 XPT participants vs. thousands on platforms like Metaculus
SpeedResearch focus means slower output than real-time forecasting platforms
CostIntensive methodology requires significant resources per study
GeneralizabilityTournament findings may not transfer to all forecasting contexts
Long-range uncertaintyNo ground truth available for existential risk calibration
Minimal updatingXPT showed debates had limited impact on beliefs
QuestionRelevance
Should policy weight superforecasters or domain experts on AI?XPT suggests experts may be better calibrated for AI specifically
Can LLMs eventually match superforecasters?ForecastBench suggests parity by late 2026
How should we interpret minimal belief updating?May reflect genuine irreducible uncertainty or cognitive limitations
What forecasting methods work for unprecedented events?Neither group was well-calibrated on AI progress