AI Forecasting Benchmark Tournament
- QualityRated 41 but structure suggests 87 (underrated by 46 points)
- Links1 link could use <R> components
Quick Assessment
Section titled “Quick Assessment”| Dimension | Assessment | Evidence |
|---|---|---|
| Scale | Large | 348 questions (Q2 2025), 54 bot-makers participating |
| Rigor | High | Statistical significance testing, standardized scoring (Peer score) |
| Competitive | Strong | $10K quarterly prizes, API credits, public leaderboard |
| Key Finding | Clear | Pro Forecasters significantly outperform AI (p = 0.00001), though gap narrowing |
| Industry Support | Robust | OpenAI and Anthropic provide API credits |
| Practical Impact | Growing | Demonstrates current AI forecasting limitations and progress rate |
Tournament Details
Section titled “Tournament Details”| Attribute | Details |
|---|---|
| Name | AI Forecasting Benchmark Tournament |
| Abbreviation | AIB (or AI Benchmark) |
| Organization | MetaculusOrganizationMetaculusMetaculus is a reputation-based forecasting platform with 1M+ predictions showing AGI probability at 25% by 2027 and 50% by 2031 (down from 50 years away in 2020). Analysis finds good short-term ca...Quality: 50/100 |
| Launched | 2024 |
| Structure | 4-month seasonal tournament + bi-weekly MiniBench |
| Website | metaculus.com/aib/ |
| Prize Pool | $10,000 per quarter |
| Industry Partners | OpenAI (API credits), Anthropic (API credits) |
Overview
Section titled “Overview”The AI Forecasting Benchmark Tournament represents MetaculusOrganizationMetaculusMetaculus is a reputation-based forecasting platform with 1M+ predictions showing AGI probability at 25% by 2027 and 50% by 2031 (down from 50 years away in 2020). Analysis finds good short-term ca...Quality: 50/100’s flagship initiative for comparing human and AI forecasting capabilities. Launched in 2024, the tournament runs in two parallel series:
- Primary Seasonal Tournament: 4-month competitions with ~300-400 questions
- MiniBench: Bi-weekly fast-paced tournaments for rapid iteration
Participants can compete using API credits provided by OpenAI and Anthropic, encouraging experimentation with frontier LLMs. The tournament has become the premier benchmark for tracking AI progress on forecasting—a domain that requires integrating diverse information sources, reasoning under uncertainty, and calibrating confidence to match reality.
Structure
Section titled “Structure”| Component | Duration | Question Count | Prize Pool |
|---|---|---|---|
| Seasonal Tournament | 4 months | ≈300-400 | $10,000 |
| MiniBench | 2 weeks | ≈20-30 | Varies |
Both components use Metaculus’s Peer score metric, which compares forecasters to each other and equalizes for question difficulty, making performance comparison fair across different question sets.
Historical Results
Section titled “Historical Results”Quarterly Performance Trajectory
Section titled “Quarterly Performance Trajectory”| Quarter | Best Bot Performance | Gap to Pro Forecasters | Key Development |
|---|---|---|---|
| Q3 2024 | -11.3 | Large negative gap | Initial baseline |
| Q4 2024 | -8.6 | Moderate negative gap | ≈24% improvement |
| Q1 2025 | First place (metac-o1) | Narrowing gap | First bot to lead leaderboard |
| Q2 2025 | OpenAI o3 (baseline) | Statistical gap remains (p = 0.00001) | Humans maintain clear lead |
Note: Score of 0 = equal to comparison group. Negative scores mean underperformance relative to Pro Forecasters.
Q2 2025 Detailed Results
Section titled “Q2 2025 Detailed Results”Q2 2025 tournament results provided key insights:
| Metric | Finding |
|---|---|
| Questions | 348 |
| Bot-makers | 54 |
| Statistical significance | Pro Forecasters lead at p = 0.00001 |
| Top bot-makers | Top 3 (excluding Metaculus in-house) were students or hobbyists |
| Aggregation effect | Taking median or mean of multiple forecasts improved scores significantly |
| Best baseline bot | OpenAI’s o3 |
Key Findings
Section titled “Key Findings”Pro Forecasters Maintain Lead
Section titled “Pro Forecasters Maintain Lead”Despite rapid AI improvement, human expert forecasters remain statistically significantly better:
| Evidence | Interpretation |
|---|---|
| p = 0.00001 | Extremely strong statistical significance |
| Consistent across quarters | Not a fluke; reproducible result |
| Even best bots trail | Top AI systems still below human expert level |
Students and Hobbyists Competitive
Section titled “Students and Hobbyists Competitive”The top 3 bot-makers (excluding Metaculus’s in-house bots) in Q2 2025 were students or hobbyists, not professional AI researchers:
| Implication | Explanation |
|---|---|
| Low barrier to entry | API access + creativity > credentials |
| Forecasting as craft | Domain knowledge + prompt engineering matters more than ML expertise |
| Innovation from edges | Some of the best approaches come from non-traditional participants |
Aggregation Helps Significantly
Section titled “Aggregation Helps Significantly”Taking the median or mean of multiple LLM forecasts rather than single calls substantially improved scores:
| Method | Performance |
|---|---|
| Single LLM call | Baseline |
| Median of multiple calls | Significantly better |
| Mean of multiple calls | Significantly better |
This suggests that ensemble methods are critical for AI forecasting, similar to how aggregating multiple human forecasters improves accuracy.
Metaculus Community Prediction Remains Strong
Section titled “Metaculus Community Prediction Remains Strong”The average Peer score for the Metaculus Community Prediction is 12.9, ranking in the top 10 on the global leaderboard over every 2-year period since 2016. This demonstrates that aggregated human forecasts remain world-class and provide a high bar for AI systems to match.
Technical Implementation
Section titled “Technical Implementation”Bot Development Process
Section titled “Bot Development Process”Participants develop forecasting bots using:
| Component | Description |
|---|---|
| API Access | OpenAI and Anthropic provide credits |
| Metaculus API | Fetch questions, submit forecasts |
| Prompt Engineering | Craft prompts that produce well-calibrated forecasts |
| Aggregation Logic | Combine multiple model calls or different models |
| Continuous Learning | Iterate based on quarterly feedback |
Scoring: Peer Score
Section titled “Scoring: Peer Score”Metaculus uses Peer score for fair comparison:
| Feature | Benefit |
|---|---|
| Relative comparison | Compares forecasters to each other, not absolute truth |
| Difficulty adjustment | Accounts for question hardness |
| Time-averaged | Rewards updating when new information emerges |
| Equalizes participation | Forecasters with different time constraints comparable |
Baseline Bots
Section titled “Baseline Bots”Metaculus provides baseline bots for comparison:
| Bot | Method | Purpose |
|---|---|---|
| GPT-4o | Vanilla frontier LLM | Standard baseline |
| o3 | OpenAI’s reasoning model | Best performance (Q2 2025) |
| Claude variants | Anthropic frontier models | Alternative baseline |
| Metaculus in-house | Custom implementations | Metaculus’s own research |
Comparison with Other Projects
Section titled “Comparison with Other Projects”| Project | Organization | Focus | Structure | Scale |
|---|---|---|---|---|
| AI Forecasting Benchmark | Metaculus | Human vs AI | Quarterly tournaments | ≈350 questions/quarter |
| ForecastBenchConceptForecastBenchForecastBench is a dynamic, contamination-free benchmark with 1,000 continuously-updated questions comparing LLM forecasting to superforecasters. GPT-4.5 achieves 0.101 Brier score vs 0.081 for sup...Quality: 53/100 | FRI | AI benchmarking | Continuous evaluation | 1,000 questions |
| XPTConceptXPT (Existential Risk Persuasion Tournament)A 2022 forecasting tournament with 169 participants found superforecasters severely underestimated AI progress (2.3% probability for IMO gold vs actual 2025 achievement) and gave 8x lower AI extinc...Quality: 54/100 | FRI | Expert collaboration | One-time tournament | ≈100 questions |
| Good Judgment | Good Judgment Inc | Superforecaster panels | Ongoing operations | Client-specific |
The AI Forecasting Benchmark’s quarterly structure balances rapid iteration (faster than XPT) with sufficient time for meaningful comparison (longer than weekly competitions).
Industry Partnerships
Section titled “Industry Partnerships”OpenAI and Anthropic Support
Section titled “OpenAI and Anthropic Support”Both frontier AI labs provide API credits to tournament participants:
| Benefit | Impact |
|---|---|
| Free experimentation | Lowers cost barrier for participants |
| Frontier model access | Ensures latest capabilities are tested |
| Corporate validation | Labs view forecasting as important benchmark |
| Data for research | Labs learn from bot performance patterns |
Implications for AI Development
Section titled “Implications for AI Development”Forecasting as Intelligence Proxy
Section titled “Forecasting as Intelligence Proxy”The tournament provides empirical data on AI’s ability to:
| Capability | Forecasting Relevance |
|---|---|
| Information integration | Combine diverse sources to estimate probabilities |
| Calibration | Match confidence to actual frequency of outcomes |
| Temporal reasoning | Project trends forward in time |
| Uncertainty quantification | Express degrees of belief numerically |
| Continuous learning | Update beliefs as new information emerges |
Near-Term Milestones
Section titled “Near-Term Milestones”Based on current trajectory:
| Milestone | Estimated Timing | Significance |
|---|---|---|
| Bot equals median human | Achieved (Q1 2025) | AI matches casual forecasters |
| Bot equals Pro Forecaster | 2026-2027? | AI matches human experts |
| Bot exceeds Community Prediction | 2027-2028? | AI exceeds aggregated human wisdom |
These milestones serve as empirical indicators of AI reasoning progress.
Relationship to Metaculus Ecosystem
Section titled “Relationship to Metaculus Ecosystem”The AI Forecasting Benchmark integrates with Metaculus’s broader platform:
| Component | Relationship to AI Benchmark |
|---|---|
| Pro Forecasters | Human comparison group |
| Community Prediction | Aggregated human baseline |
| AI 2027 Tournament | AI-specific questions for human forecasters |
| Track Record Page | Historical calibration data |
Use Cases
Section titled “Use Cases”AI Research
Section titled “AI Research”Researchers use the tournament to:
- Benchmark new model architectures
- Test prompt engineering strategies
- Validate aggregation methods
- Track capability progress over time
Forecasting Methodology
Section titled “Forecasting Methodology”The tournament informs:
- When to trust AI vs human forecasts
- How to combine AI and human forecasts
- Optimal ensemble strategies
- Calibration techniques
AI Safety
Section titled “AI Safety”The tournament provides evidence for:
- Current AI reasoning capabilities
- Rate of AI capability progress
- Domains where AI still trails humans
- Potential for AI-assisted forecasting on x-risk questions
Strengths and Limitations
Section titled “Strengths and Limitations”Strengths
Section titled “Strengths”| Strength | Evidence |
|---|---|
| Large scale | 300-400 questions per quarter |
| Real-time competition | Ongoing rather than one-time |
| Industry support | OpenAI and Anthropic API credits |
| Public leaderboard | Transparent comparison |
| Statistical rigor | Significance testing, controlled scoring |
| Accessible | Students/hobbyists competitive with professionals |
Limitations
Section titled “Limitations”| Limitation | Impact |
|---|---|
| Quarterly lag | Results only every 3-4 months |
| API cost dependency | Limits experimentation for some participants |
| Question selection | May not cover all important domains |
| Bot sophistication ceiling | Diminishing returns to complexity? |
| Human baseline variability | Pro Forecaster performance may change over time |
Funding
Section titled “Funding”The tournament is supported by:
| Source | Type | Amount |
|---|---|---|
| Coefficient GivingCoefficient GivingCoefficient Giving (formerly Open Philanthropy) has directed $4B+ in grants since 2014, including $336M to AI safety (~60% of external funding). The organization spent ~$50M on AI safety in 2024, w...Quality: 55/100 | Grant funding to Metaculus | $1.5M+ (2022-2023) |
| OpenAI | API credit sponsorship | Not disclosed |
| Anthropic | API credit sponsorship | Not disclosed |
| Prize Pool | Per quarter | $10,000 |
Total annual prize commitment: $120,000 (4 quarters × $10K).
Future Directions
Section titled “Future Directions”Potential enhancements based on current trajectory:
| Enhancement | Benefit | Challenge |
|---|---|---|
| Increase question diversity | Test broader capabilities | Curation effort |
| Add multi-turn forecasting | Test updating based on new info | More complex protocol |
| Reason evaluation | Assess whether bots “understand” | Subjective judgment |
| Cross-tournament comparison | Link to ForecastBench, Good Judgment | Standardization |
| Adversarial questions | Test robustness | Question design |