ForecastBench
- QualityRated 53 but structure suggests 87 (underrated by 34 points)
- Links1 link could use <R> components
Quick Assessment
Section titled “Quick Assessment”| Dimension | Assessment | Evidence |
|---|---|---|
| Innovation | Exceptional | First dynamic, contamination-free AI forecasting benchmark |
| Research Quality | Peer-reviewed | Published at ICLR 2025 (top-tier ML conference) |
| Practical Impact | High | Provides empirical grounding for claims about AI forecasting progress |
| Benchmark Design | Robust | 1,000 questions, continuous updates, multiple baselines |
| Key Finding | Significant | LLMs improving rapidly but superforecasters still lead; projected parity late 2026 |
| Replicability | High | Open submission leaderboard, documented methodology |
Project Details
Section titled “Project Details”| Attribute | Details |
|---|---|
| Name | ForecastBench |
| Organization | Forecasting Research Institute (FRI)OrganizationForecasting Research Institute (FRI)FRI's XPT tournament found superforecasters gave 9.7% average probability to AI progress outcomes that occurred vs 24.6% from domain experts, suggesting superforecasters systematically underestimat...Quality: 55/100 |
| Authors | Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock |
| Published | ICLR 2025 |
| Launch Date | September 2024 |
| Website | forecastbench.org |
| Paper | OpenReview ICLR 2025 |
| Funding | Coefficient GivingCoefficient GivingCoefficient Giving (formerly Open Philanthropy) has directed $4B+ in grants since 2014, including $336M to AI safety (~60% of external funding). The organization spent ~$50M on AI safety in 2024, w...Quality: 55/100 (supported through mid-2027) |
| Question Count | 1,000 (continuously updated) |
Overview
Section titled “Overview”ForecastBench is FRI’s dynamic benchmark for evaluating large language model forecasting capabilities, designed to solve the data contamination problem that plagues static AI benchmarks. Published at ICLR 2025, ForecastBench maintains 1,000 questions continuously updated with new future-dated questions to ensure all queries are about events with no known answer at submission time.
The benchmark was created to address a critical methodological issue: as LLMs are trained on vast internet corpora, they may have seen the answers to static benchmark questions in their training data. By focusing exclusively on questions about future events that haven’t resolved yet, ForecastBench provides a contamination-free measure of genuine forecasting ability.
The authors (led by FRI Research Director Ezra Karger and Chief Scientist Philip Tetlock) designed ForecastBench as a “valuable proxy for general intelligence” since forecasting requires integrating diverse knowledge sources and reasoning under uncertainty.
Current Results
Section titled “Current Results”As of February 2025:
| Forecaster | Difficulty-Adjusted Brier Score | Status |
|---|---|---|
| Superforecasters | 0.081 | Best overall performance |
| GPT-4.5 | 0.101 | Best LLM performance |
| GPT-4 (Mar 2023) | 0.131 | Baseline frontier model |
| Public Participants | ≈0.12 | LLMs now outperform non-experts |
| Random Baseline | 0.25 | Chance performance |
Critical finding: The gap between superforecasters and GPT-4.5 (0.020 Brier points) is larger than the gap between GPT-4.5 and GPT-4 (0.030 Brier points), suggesting substantial room for improvement remains.
Design Philosophy
Section titled “Design Philosophy”Solving the Contamination Problem
Section titled “Solving the Contamination Problem”Static benchmarks have a fatal flaw for evaluating forecasting:
| Problem | Impact | ForecastBench Solution |
|---|---|---|
| Training data contamination | LLMs may have seen answers | Only questions about future events |
| Benchmark staleness | Questions become outdated | Continuous addition of new questions |
| No ground truth yet | Can’t verify answers immediately | Questions resolve on schedule (days to months) |
Example contamination scenario:
- Static benchmark: “Will COVID-19 vaccines be approved by end of 2020?” (known answer: yes)
- ForecastBench: “Will a new pandemic pathogen emerge by end of 2026?” (unknown answer)
Question Sources
Section titled “Question Sources”ForecastBench draws questions from two categories:
Market Questions
Section titled “Market Questions”Questions sourced from prediction platforms:
| Platform | Type | Example Questions |
|---|---|---|
| MetaculusOrganizationMetaculusMetaculus is a reputation-based forecasting platform with 1M+ predictions showing AGI probability at 25% by 2027 and 50% by 2031 (down from 50 years away in 2020). Analysis finds good short-term ca...Quality: 50/100 | Reputation-based | ”When will AGI be developed?” |
| Manifold | Play money market | ”Will SpaceX land on Mars by 2030?” |
| Polymarket | Real money (crypto) | “Who will win the 2028 US presidential election?” |
| RAND | Expert elicitation | ”What’s the probability of nuclear conflict by 2035?” |
Dataset Questions
Section titled “Dataset Questions”Questions about future values in public datasets:
| Dataset | Type | Example Questions |
|---|---|---|
| ACLED | Conflict events | ”How many conflict fatalities in Syria next month?” |
| DBnomics | Economic indicators | ”What will Germany’s GDP growth rate be in Q3 2026?” |
| FRED | Economic data | ”What will US unemployment be in December 2026?” |
| Wikipedia | Pageviews, edits | ”How many monthly pageviews for ‘AGI’ in March 2026?” |
| Yahoo Finance | Stock prices, indices | ”What will S&P 500 close at on December 31, 2026?” |
Key Findings
Section titled “Key Findings”Superforecasters Still Lead
Section titled “Superforecasters Still Lead”| Finding | Evidence |
|---|---|
| Superforecasters remain best | 0.081 Brier score vs 0.101 for GPT-4.5 |
| Gap is substantial | 0.020 Brier points = large performance difference |
| Gap larger than LLM improvement rate | SF-GPT gap (0.020) > GPT improvement (0.016/year) |
Rapid LLM Improvement
Section titled “Rapid LLM Improvement”| Metric | Value | Implication |
|---|---|---|
| Annual improvement rate | ≈0.016 difficulty-adjusted Brier points | Consistent, measurable progress |
| Projected parity date | November 2026 | Linear extrapolation from current trajectory |
| 95% Confidence Interval | December 2025 – January 2028 | Uncertainty in timeline |
| Time to parity | 12-24 months from Feb 2025 | Near-term milestone |
LLMs Now Outperform Non-Experts
Section titled “LLMs Now Outperform Non-Experts”| Group | Brier Score | Interpretation |
|---|---|---|
| Superforecasters | 0.081 | Top human performance |
| GPT-4.5 | 0.101 | Best AI performance |
| Public forecasters | ≈0.12 | Casual participants |
| GPT-4 | 0.131 | 2-year-old frontier model |
LLMs have crossed the threshold of matching casual human forecasters but still trail expert human forecasters by a meaningful margin.
Initial Models Underperformed
Section titled “Initial Models Underperformed”Claude-3.5 Sonnet and GPT-4 Turbo initially performed roughly as well as a simple median of public forecasts, suggesting that early frontier LLMs without specialized forecasting training were comparable to crowd aggregation.
Methodology
Section titled “Methodology”Difficulty Adjustment
Section titled “Difficulty Adjustment”ForecastBench uses difficulty-adjusted Brier scores to account for question hardness:
| Adjustment | Purpose | Method |
|---|---|---|
| Baseline | Some questions easier than others | Compare to community median |
| Normalization | Make scores comparable across question sets | Adjust relative to typical forecaster |
| Standardization | Remove sampling artifacts | Control for question distribution |
This ensures that an LLM scoring 0.101 on hard questions is rated fairly compared to a forecaster scoring 0.12 on easier questions.
Resolution Timelines
Section titled “Resolution Timelines”Questions resolve on different timescales:
| Timeline | Percentage | Examples |
|---|---|---|
| Days | ≈10% | Near-term events (elections, product launches) |
| Weeks | ≈30% | Economic indicators, conflict events |
| Months | ≈40% | Technology milestones, policy decisions |
| Years | ≈20% | Long-term trends (AGI timelines, climate) |
This distribution balances rapid feedback for validation with long-term questions relevant to AI safety.
Leaderboard and Submissions
Section titled “Leaderboard and Submissions”Public Leaderboard
Section titled “Public Leaderboard”The ForecastBench leaderboard allows:
- Open submission: Anyone can submit LLM forecasts
- Standardized comparison: All entries scored on same questions
- Transparency: Methodology and scores public
- Competition: Drive improvement through benchmarking
Baseline Bots
Section titled “Baseline Bots”ForecastBench includes baseline forecasting bots:
| Bot | Method | Purpose |
|---|---|---|
| Random | Uniform distribution | Lower bound |
| Community median | Aggregate human forecasts | Crowd wisdom baseline |
| GPT-4 | Vanilla frontier LLM | Historical baseline |
| GPT-4.5 | Current frontier LLM | State-of-the-art |
Comparison with Other Benchmarks
Section titled “Comparison with Other Benchmarks”| Benchmark | Domain | Contamination | Dynamic | Question Count |
|---|---|---|---|---|
| ForecastBench | Forecasting | None (future events) | Yes (continuous) | 1,000 |
| MMLU | General knowledge | High | No (static) | 15,908 |
| GSM8K | Math reasoning | Moderate | No (static) | 8,500 |
| HumanEval | Code generation | High | No (static) | 164 |
| AI Forecasting Benchmark | Forecasting | None | Yes (quarterly) | ≈350/quarter |
ForecastBench’s continuous dynamic updates distinguish it from static benchmarks that become contaminated over time.
Relationship to Other Projects
Section titled “Relationship to Other Projects”FRI Ecosystem
Section titled “FRI Ecosystem”| Project | Focus | Relationship to ForecastBench |
|---|---|---|
| XPTConceptXPT (Existential Risk Persuasion Tournament)A 2022 forecasting tournament with 169 participants found superforecasters severely underestimated AI progress (2.3% probability for IMO gold vs actual 2025 achievement) and gave 8x lower AI extinc...Quality: 54/100 | Adversarial collaboration | Informed methodology; XPT showed SF-expert gaps |
| FRI-ONN Nuclear Study | Nuclear risk forecasting | Applied forecasting methods |
| AI Progress Forecasting Panel | Expert AI predictions | Potential question source |
Broader Forecasting Ecosystem
Section titled “Broader Forecasting Ecosystem”| Platform/Project | Type | Complementarity |
|---|---|---|
| MetaculusOrganizationMetaculusMetaculus is a reputation-based forecasting platform with 1M+ predictions showing AGI probability at 25% by 2027 and 50% by 2031 (down from 50 years away in 2020). Analysis finds good short-term ca...Quality: 50/100 | Forecasting platform | ForecastBench uses Metaculus questions as source |
| AI Forecasting Benchmark TournamentConceptAI Forecasting Benchmark TournamentQuarterly competition (Q2 2025: 348 questions, 54 bot-makers, $30K prizes) comparing human Pro Forecasters against AI bots, with statistical testing showing humans maintain significant lead (p=0.00...Quality: 41/100 | Human vs AI competition | Similar goals, quarterly structure |
| SquiggleConceptSquiggleSquiggle is a domain-specific probabilistic programming language optimized for intuition-driven estimation rather than data-driven inference, developed by QURI and adopted primarily in the EA commu...Quality: 41/100 | Probabilistic modeling | Could use ForecastBench data as model inputs |
| MetaforecastConceptMetaforecastMetaforecast is a forecast aggregation platform combining 2,100+ questions from 10+ sources (Metaculus, Manifold, Polymarket, etc.) with daily updates via automated scraping. Created by QURI, it pr...Quality: 35/100 | Forecast aggregation | Could aggregate ForecastBench bot predictions |
Implications for AI Development
Section titled “Implications for AI Development”Forecasting as Proxy for Intelligence
Section titled “Forecasting as Proxy for Intelligence”The authors argue that forecasting is a valuable proxy for general intelligence because it requires:
| Capability | Why It Matters for Forecasting |
|---|---|
| Knowledge integration | Combine information from multiple domains |
| Uncertainty reasoning | Express confidence probabilistically |
| Causal reasoning | Understand mechanisms driving outcomes |
| Temporal reasoning | Project trends forward in time |
| Calibration | Match confidence to actual accuracy |
Progress on ForecastBench may therefore indicate progress on general reasoning capabilities.
Projected Parity Implications
Section titled “Projected Parity Implications”If LLMs match superforecasters by late 2026, this suggests:
| Implication | Reasoning |
|---|---|
| AI reasoning progress | Forecasting requires sophisticated integration of knowledge |
| Economic impact | Automated forecasting could replace human analysts in some contexts |
| AI safety concern | Advanced forecasting = better strategic planning for AI systems |
| Validation of scaling | Continued capability gains from larger models/data |
However, extrapolation is uncertain: progress may plateau, or LLMs may hit a ceiling below human expert performance on the hardest questions.
Strengths and Limitations
Section titled “Strengths and Limitations”Strengths
Section titled “Strengths”| Strength | Evidence |
|---|---|
| Contamination-free | Only questions about future events |
| Dynamic updates | Continuous addition of new questions |
| Peer-reviewed | Published at ICLR 2025 (top-tier venue) |
| Multiple baselines | Superforecasters, public, LLMs, random |
| Open submission | Public leaderboard enables competition |
| Quantitative projection | Clear timeline for potential AI-human parity |
Limitations
Section titled “Limitations”| Limitation | Impact |
|---|---|
| Resolution lag | Must wait for questions to resolve |
| Extrapolation uncertainty | Linear projection may not hold |
| Question distribution | May not cover all important forecasting domains |
| Human baseline variability | Superforecaster performance may vary over time |
| Cost of evaluation | Requires ongoing question curation and resolution |
| Narrow scope | Forecasting ≠ general intelligence (though correlated) |
Funding and Support
Section titled “Funding and Support”ForecastBench is supported by Coefficient GivingCoefficient GivingCoefficient Giving (formerly Open Philanthropy) has directed $4B+ in grants since 2014, including $336M to AI safety (~60% of external funding). The organization spent ~$50M on AI safety in 2024, w...Quality: 55/100 grants to FRI:
| Grant | Amount | Purpose |
|---|---|---|
| Forecasting Benchmark | $100K | Collaboration with Steinhardt lab |
| General FRI support | Part of $10M+ total | Core operations and research |
Funding is committed through mid-2027, ensuring the benchmark remains active and updated.
Future Directions
Section titled “Future Directions”Potential enhancements based on the current trajectory:
| Enhancement | Benefit | Challenge |
|---|---|---|
| Expand question domains | More comprehensive coverage | Curation effort |
| Add reasoning evaluation | Assess whether LLMs “understand” forecasts | Subjective judgment |
| Multi-turn forecasting | Test updating based on new information | More complex protocol |
| Ensemble methods | Benchmark aggregation strategies | Requires multiple models |
| Adversarial questions | Test robustness to edge cases | Question design difficulty |