Skip to content

ForecastBench

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:53 (Adequate)⚠️
Importance:62.5 (Useful)
Last edited:2026-01-29 (3 days ago)
Words:1.9k
Structure:
📊 21📈 0🔗 9📚 115%Score: 13/15
LLM Summary:ForecastBench is a dynamic, contamination-free benchmark with 1,000 continuously-updated questions comparing LLM forecasting to superforecasters. GPT-4.5 achieves 0.101 Brier score vs 0.081 for superforecasters; linear extrapolation projects LLMs will match human experts by November 2026 (95% CI: Dec 2025 – Jan 2028).
Issues (2):
  • QualityRated 53 but structure suggests 87 (underrated by 34 points)
  • Links1 link could use <R> components
DimensionAssessmentEvidence
InnovationExceptionalFirst dynamic, contamination-free AI forecasting benchmark
Research QualityPeer-reviewedPublished at ICLR 2025 (top-tier ML conference)
Practical ImpactHighProvides empirical grounding for claims about AI forecasting progress
Benchmark DesignRobust1,000 questions, continuous updates, multiple baselines
Key FindingSignificantLLMs improving rapidly but superforecasters still lead; projected parity late 2026
ReplicabilityHighOpen submission leaderboard, documented methodology
AttributeDetails
NameForecastBench
OrganizationForecasting Research Institute (FRI)
AuthorsEzra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock
PublishedICLR 2025
Launch DateSeptember 2024
Websiteforecastbench.org
PaperOpenReview ICLR 2025
FundingCoefficient Giving (supported through mid-2027)
Question Count1,000 (continuously updated)

ForecastBench is FRI’s dynamic benchmark for evaluating large language model forecasting capabilities, designed to solve the data contamination problem that plagues static AI benchmarks. Published at ICLR 2025, ForecastBench maintains 1,000 questions continuously updated with new future-dated questions to ensure all queries are about events with no known answer at submission time.

The benchmark was created to address a critical methodological issue: as LLMs are trained on vast internet corpora, they may have seen the answers to static benchmark questions in their training data. By focusing exclusively on questions about future events that haven’t resolved yet, ForecastBench provides a contamination-free measure of genuine forecasting ability.

The authors (led by FRI Research Director Ezra Karger and Chief Scientist Philip Tetlock) designed ForecastBench as a “valuable proxy for general intelligence” since forecasting requires integrating diverse knowledge sources and reasoning under uncertainty.

As of February 2025:

ForecasterDifficulty-Adjusted Brier ScoreStatus
Superforecasters0.081Best overall performance
GPT-4.50.101Best LLM performance
GPT-4 (Mar 2023)0.131Baseline frontier model
Public Participants≈0.12LLMs now outperform non-experts
Random Baseline0.25Chance performance

Critical finding: The gap between superforecasters and GPT-4.5 (0.020 Brier points) is larger than the gap between GPT-4.5 and GPT-4 (0.030 Brier points), suggesting substantial room for improvement remains.

Static benchmarks have a fatal flaw for evaluating forecasting:

ProblemImpactForecastBench Solution
Training data contaminationLLMs may have seen answersOnly questions about future events
Benchmark stalenessQuestions become outdatedContinuous addition of new questions
No ground truth yetCan’t verify answers immediatelyQuestions resolve on schedule (days to months)

Example contamination scenario:

  • Static benchmark: “Will COVID-19 vaccines be approved by end of 2020?” (known answer: yes)
  • ForecastBench: “Will a new pandemic pathogen emerge by end of 2026?” (unknown answer)

ForecastBench draws questions from two categories:

Questions sourced from prediction platforms:

PlatformTypeExample Questions
MetaculusReputation-based”When will AGI be developed?”
ManifoldPlay money market”Will SpaceX land on Mars by 2030?”
PolymarketReal money (crypto)“Who will win the 2028 US presidential election?”
RANDExpert elicitation”What’s the probability of nuclear conflict by 2035?”

Questions about future values in public datasets:

DatasetTypeExample Questions
ACLEDConflict events”How many conflict fatalities in Syria next month?”
DBnomicsEconomic indicators”What will Germany’s GDP growth rate be in Q3 2026?”
FREDEconomic data”What will US unemployment be in December 2026?”
WikipediaPageviews, edits”How many monthly pageviews for ‘AGI’ in March 2026?”
Yahoo FinanceStock prices, indices”What will S&P 500 close at on December 31, 2026?”
FindingEvidence
Superforecasters remain best0.081 Brier score vs 0.101 for GPT-4.5
Gap is substantial0.020 Brier points = large performance difference
Gap larger than LLM improvement rateSF-GPT gap (0.020) > GPT improvement (0.016/year)
MetricValueImplication
Annual improvement rate≈0.016 difficulty-adjusted Brier pointsConsistent, measurable progress
Projected parity dateNovember 2026Linear extrapolation from current trajectory
95% Confidence IntervalDecember 2025 – January 2028Uncertainty in timeline
Time to parity12-24 months from Feb 2025Near-term milestone
GroupBrier ScoreInterpretation
Superforecasters0.081Top human performance
GPT-4.50.101Best AI performance
Public forecasters≈0.12Casual participants
GPT-40.1312-year-old frontier model

LLMs have crossed the threshold of matching casual human forecasters but still trail expert human forecasters by a meaningful margin.

Claude-3.5 Sonnet and GPT-4 Turbo initially performed roughly as well as a simple median of public forecasts, suggesting that early frontier LLMs without specialized forecasting training were comparable to crowd aggregation.

ForecastBench uses difficulty-adjusted Brier scores to account for question hardness:

AdjustmentPurposeMethod
BaselineSome questions easier than othersCompare to community median
NormalizationMake scores comparable across question setsAdjust relative to typical forecaster
StandardizationRemove sampling artifactsControl for question distribution

This ensures that an LLM scoring 0.101 on hard questions is rated fairly compared to a forecaster scoring 0.12 on easier questions.

Questions resolve on different timescales:

TimelinePercentageExamples
Days≈10%Near-term events (elections, product launches)
Weeks≈30%Economic indicators, conflict events
Months≈40%Technology milestones, policy decisions
Years≈20%Long-term trends (AGI timelines, climate)

This distribution balances rapid feedback for validation with long-term questions relevant to AI safety.

The ForecastBench leaderboard allows:

  • Open submission: Anyone can submit LLM forecasts
  • Standardized comparison: All entries scored on same questions
  • Transparency: Methodology and scores public
  • Competition: Drive improvement through benchmarking

ForecastBench includes baseline forecasting bots:

BotMethodPurpose
RandomUniform distributionLower bound
Community medianAggregate human forecastsCrowd wisdom baseline
GPT-4Vanilla frontier LLMHistorical baseline
GPT-4.5Current frontier LLMState-of-the-art
BenchmarkDomainContaminationDynamicQuestion Count
ForecastBenchForecastingNone (future events)Yes (continuous)1,000
MMLUGeneral knowledgeHighNo (static)15,908
GSM8KMath reasoningModerateNo (static)8,500
HumanEvalCode generationHighNo (static)164
AI Forecasting BenchmarkForecastingNoneYes (quarterly)≈350/quarter

ForecastBench’s continuous dynamic updates distinguish it from static benchmarks that become contaminated over time.

ProjectFocusRelationship to ForecastBench
XPTAdversarial collaborationInformed methodology; XPT showed SF-expert gaps
FRI-ONN Nuclear StudyNuclear risk forecastingApplied forecasting methods
AI Progress Forecasting PanelExpert AI predictionsPotential question source
Platform/ProjectTypeComplementarity
MetaculusForecasting platformForecastBench uses Metaculus questions as source
AI Forecasting Benchmark TournamentHuman vs AI competitionSimilar goals, quarterly structure
SquiggleProbabilistic modelingCould use ForecastBench data as model inputs
MetaforecastForecast aggregationCould aggregate ForecastBench bot predictions

The authors argue that forecasting is a valuable proxy for general intelligence because it requires:

CapabilityWhy It Matters for Forecasting
Knowledge integrationCombine information from multiple domains
Uncertainty reasoningExpress confidence probabilistically
Causal reasoningUnderstand mechanisms driving outcomes
Temporal reasoningProject trends forward in time
CalibrationMatch confidence to actual accuracy

Progress on ForecastBench may therefore indicate progress on general reasoning capabilities.

If LLMs match superforecasters by late 2026, this suggests:

ImplicationReasoning
AI reasoning progressForecasting requires sophisticated integration of knowledge
Economic impactAutomated forecasting could replace human analysts in some contexts
AI safety concernAdvanced forecasting = better strategic planning for AI systems
Validation of scalingContinued capability gains from larger models/data

However, extrapolation is uncertain: progress may plateau, or LLMs may hit a ceiling below human expert performance on the hardest questions.

StrengthEvidence
Contamination-freeOnly questions about future events
Dynamic updatesContinuous addition of new questions
Peer-reviewedPublished at ICLR 2025 (top-tier venue)
Multiple baselinesSuperforecasters, public, LLMs, random
Open submissionPublic leaderboard enables competition
Quantitative projectionClear timeline for potential AI-human parity
LimitationImpact
Resolution lagMust wait for questions to resolve
Extrapolation uncertaintyLinear projection may not hold
Question distributionMay not cover all important forecasting domains
Human baseline variabilitySuperforecaster performance may vary over time
Cost of evaluationRequires ongoing question curation and resolution
Narrow scopeForecasting ≠ general intelligence (though correlated)

ForecastBench is supported by Coefficient Giving grants to FRI:

GrantAmountPurpose
Forecasting Benchmark$100KCollaboration with Steinhardt lab
General FRI supportPart of $10M+ totalCore operations and research

Funding is committed through mid-2027, ensuring the benchmark remains active and updated.

Potential enhancements based on the current trajectory:

EnhancementBenefitChallenge
Expand question domainsMore comprehensive coverageCuration effort
Add reasoning evaluationAssess whether LLMs “understand” forecastsSubjective judgment
Multi-turn forecastingTest updating based on new informationMore complex protocol
Ensemble methodsBenchmark aggregation strategiesRequires multiple models
Adversarial questionsTest robustness to edge casesQuestion design difficulty