Skip to content

AI Forecasting Benchmark Tournament

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:41 (Adequate)⚠️
Importance:42 (Reference)
Last edited:2026-01-29 (3 days ago)
Words:1.7k
Structure:
📊 20📈 0🔗 5📚 79%Score: 13/15
LLM Summary:Quarterly competition (Q2 2025: 348 questions, 54 bot-makers, $30K prizes) comparing human Pro Forecasters against AI bots, with statistical testing showing humans maintain significant lead (p=0.00001) though AI improves ~24% Q3-Q4 2024. Best AI baseline is OpenAI's o3; top bot-makers are students/hobbyists; ensemble methods significantly improve performance.
Issues (2):
  • QualityRated 41 but structure suggests 87 (underrated by 46 points)
  • Links1 link could use <R> components
DimensionAssessmentEvidence
ScaleLarge348 questions (Q2 2025), 54 bot-makers participating
RigorHighStatistical significance testing, standardized scoring (Peer score)
CompetitiveStrong$10K quarterly prizes, API credits, public leaderboard
Key FindingClearPro Forecasters significantly outperform AI (p = 0.00001), though gap narrowing
Industry SupportRobustOpenAI and Anthropic provide API credits
Practical ImpactGrowingDemonstrates current AI forecasting limitations and progress rate
AttributeDetails
NameAI Forecasting Benchmark Tournament
AbbreviationAIB (or AI Benchmark)
OrganizationMetaculus
Launched2024
Structure4-month seasonal tournament + bi-weekly MiniBench
Websitemetaculus.com/aib/
Prize Pool$10,000 per quarter
Industry PartnersOpenAI (API credits), Anthropic (API credits)

The AI Forecasting Benchmark Tournament represents Metaculus’s flagship initiative for comparing human and AI forecasting capabilities. Launched in 2024, the tournament runs in two parallel series:

  1. Primary Seasonal Tournament: 4-month competitions with ~300-400 questions
  2. MiniBench: Bi-weekly fast-paced tournaments for rapid iteration

Participants can compete using API credits provided by OpenAI and Anthropic, encouraging experimentation with frontier LLMs. The tournament has become the premier benchmark for tracking AI progress on forecasting—a domain that requires integrating diverse information sources, reasoning under uncertainty, and calibrating confidence to match reality.

ComponentDurationQuestion CountPrize Pool
Seasonal Tournament4 months≈300-400$10,000
MiniBench2 weeks≈20-30Varies

Both components use Metaculus’s Peer score metric, which compares forecasters to each other and equalizes for question difficulty, making performance comparison fair across different question sets.

QuarterBest Bot PerformanceGap to Pro ForecastersKey Development
Q3 2024-11.3Large negative gapInitial baseline
Q4 2024-8.6Moderate negative gap≈24% improvement
Q1 2025First place (metac-o1)Narrowing gapFirst bot to lead leaderboard
Q2 2025OpenAI o3 (baseline)Statistical gap remains (p = 0.00001)Humans maintain clear lead

Note: Score of 0 = equal to comparison group. Negative scores mean underperformance relative to Pro Forecasters.

Q2 2025 tournament results provided key insights:

MetricFinding
Questions348
Bot-makers54
Statistical significancePro Forecasters lead at p = 0.00001
Top bot-makersTop 3 (excluding Metaculus in-house) were students or hobbyists
Aggregation effectTaking median or mean of multiple forecasts improved scores significantly
Best baseline botOpenAI’s o3

Despite rapid AI improvement, human expert forecasters remain statistically significantly better:

EvidenceInterpretation
p = 0.00001Extremely strong statistical significance
Consistent across quartersNot a fluke; reproducible result
Even best bots trailTop AI systems still below human expert level

The top 3 bot-makers (excluding Metaculus’s in-house bots) in Q2 2025 were students or hobbyists, not professional AI researchers:

ImplicationExplanation
Low barrier to entryAPI access + creativity > credentials
Forecasting as craftDomain knowledge + prompt engineering matters more than ML expertise
Innovation from edgesSome of the best approaches come from non-traditional participants

Taking the median or mean of multiple LLM forecasts rather than single calls substantially improved scores:

MethodPerformance
Single LLM callBaseline
Median of multiple callsSignificantly better
Mean of multiple callsSignificantly better

This suggests that ensemble methods are critical for AI forecasting, similar to how aggregating multiple human forecasters improves accuracy.

Metaculus Community Prediction Remains Strong

Section titled “Metaculus Community Prediction Remains Strong”

The average Peer score for the Metaculus Community Prediction is 12.9, ranking in the top 10 on the global leaderboard over every 2-year period since 2016. This demonstrates that aggregated human forecasts remain world-class and provide a high bar for AI systems to match.

Participants develop forecasting bots using:

ComponentDescription
API AccessOpenAI and Anthropic provide credits
Metaculus APIFetch questions, submit forecasts
Prompt EngineeringCraft prompts that produce well-calibrated forecasts
Aggregation LogicCombine multiple model calls or different models
Continuous LearningIterate based on quarterly feedback

Metaculus uses Peer score for fair comparison:

FeatureBenefit
Relative comparisonCompares forecasters to each other, not absolute truth
Difficulty adjustmentAccounts for question hardness
Time-averagedRewards updating when new information emerges
Equalizes participationForecasters with different time constraints comparable

Metaculus provides baseline bots for comparison:

BotMethodPurpose
GPT-4oVanilla frontier LLMStandard baseline
o3OpenAI’s reasoning modelBest performance (Q2 2025)
Claude variantsAnthropic frontier modelsAlternative baseline
Metaculus in-houseCustom implementationsMetaculus’s own research
ProjectOrganizationFocusStructureScale
AI Forecasting BenchmarkMetaculusHuman vs AIQuarterly tournaments≈350 questions/quarter
ForecastBenchFRIAI benchmarkingContinuous evaluation1,000 questions
XPTFRIExpert collaborationOne-time tournament≈100 questions
Good JudgmentGood Judgment IncSuperforecaster panelsOngoing operationsClient-specific

The AI Forecasting Benchmark’s quarterly structure balances rapid iteration (faster than XPT) with sufficient time for meaningful comparison (longer than weekly competitions).

Both frontier AI labs provide API credits to tournament participants:

BenefitImpact
Free experimentationLowers cost barrier for participants
Frontier model accessEnsures latest capabilities are tested
Corporate validationLabs view forecasting as important benchmark
Data for researchLabs learn from bot performance patterns

The tournament provides empirical data on AI’s ability to:

CapabilityForecasting Relevance
Information integrationCombine diverse sources to estimate probabilities
CalibrationMatch confidence to actual frequency of outcomes
Temporal reasoningProject trends forward in time
Uncertainty quantificationExpress degrees of belief numerically
Continuous learningUpdate beliefs as new information emerges

Based on current trajectory:

MilestoneEstimated TimingSignificance
Bot equals median humanAchieved (Q1 2025)AI matches casual forecasters
Bot equals Pro Forecaster2026-2027?AI matches human experts
Bot exceeds Community Prediction2027-2028?AI exceeds aggregated human wisdom

These milestones serve as empirical indicators of AI reasoning progress.

The AI Forecasting Benchmark integrates with Metaculus’s broader platform:

ComponentRelationship to AI Benchmark
Pro ForecastersHuman comparison group
Community PredictionAggregated human baseline
AI 2027 TournamentAI-specific questions for human forecasters
Track Record PageHistorical calibration data

Researchers use the tournament to:

  • Benchmark new model architectures
  • Test prompt engineering strategies
  • Validate aggregation methods
  • Track capability progress over time

The tournament informs:

  • When to trust AI vs human forecasts
  • How to combine AI and human forecasts
  • Optimal ensemble strategies
  • Calibration techniques

The tournament provides evidence for:

  • Current AI reasoning capabilities
  • Rate of AI capability progress
  • Domains where AI still trails humans
  • Potential for AI-assisted forecasting on x-risk questions
StrengthEvidence
Large scale300-400 questions per quarter
Real-time competitionOngoing rather than one-time
Industry supportOpenAI and Anthropic API credits
Public leaderboardTransparent comparison
Statistical rigorSignificance testing, controlled scoring
AccessibleStudents/hobbyists competitive with professionals
LimitationImpact
Quarterly lagResults only every 3-4 months
API cost dependencyLimits experimentation for some participants
Question selectionMay not cover all important domains
Bot sophistication ceilingDiminishing returns to complexity?
Human baseline variabilityPro Forecaster performance may change over time

The tournament is supported by:

SourceTypeAmount
Coefficient GivingGrant funding to Metaculus$1.5M+ (2022-2023)
OpenAIAPI credit sponsorshipNot disclosed
AnthropicAPI credit sponsorshipNot disclosed
Prize PoolPer quarter$10,000

Total annual prize commitment: $120,000 (4 quarters × $10K).

Potential enhancements based on current trajectory:

EnhancementBenefitChallenge
Increase question diversityTest broader capabilitiesCuration effort
Add multi-turn forecastingTest updating based on new infoMore complex protocol
Reason evaluationAssess whether bots “understand”Subjective judgment
Cross-tournament comparisonLink to ForecastBench, Good JudgmentStandardization
Adversarial questionsTest robustnessQuestion design