Back
ARC Prize - Leaderboard
webarcprize.org·arcprize.org/leaderboard
The ARC Prize benchmark, created by François Chollet, is widely cited in AI safety and capabilities discussions as a meaningful test of general reasoning that is difficult to solve via brute-force scaling, making it relevant for tracking genuine AGI progress.
Metadata
Importance: 55/100tool pagereference
Summary
The ARC Prize leaderboard tracks AI system performance on the Abstraction and Reasoning Corpus (ARC-AGI) benchmark, a test designed to measure general fluid intelligence and reasoning capabilities that current AI systems struggle with. It provides a public ranking of models and approaches attempting to solve ARC tasks, serving as a key benchmark for measuring progress toward human-level abstract reasoning.
Key Points
- •Tracks competitive performance on ARC-AGI, a benchmark specifically designed to resist memorization and test genuine reasoning/generalization
- •ARC tasks require understanding abstract patterns from few examples, making it a proxy for measuring general intelligence rather than narrow skill
- •Leaderboard highlights the gap between current AI capabilities and human-level performance on novel reasoning tasks
- •Serves as a reference point for evaluating whether AI systems are making genuine progress on abstract reasoning vs. benchmark gaming
- •Competition incentivizes novel approaches to program synthesis, inductive reasoning, and general problem-solving
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Reasoning and Planning | Capability | 65.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 20262 KB
ARC Prize - Leaderboard ARC-AGI-3 Leaderboard ARC-AGI-1 ARC-AGI-2 ARC-AGI-3 Author: All Authors Model type: All Types Model: All Models Understanding the Leaderboard ARC-AGI has evolved from its first versions (ARC-AGI-1 and 2) which measured passive fluid intelligence, to ARC-AGI-3 which challenges AI agents to adapt on the fly to novel interactive environments. The scatter plot above visualizes the critical relationship between cost-per-task and performance - a key measure of efficiency. True intelligence isn't just about solving problems, but solving them efficiently with minimal resources. Interpreting the data Reasoning Systems Trend Line solutions display connected points representing the same model at different reasoning levels. These trend lines illustrate how increased reasoning time affects performance, typically showing asymptotic behavior as thinking time increases. Base LLMs solutions represent single-shot inference from standard language models like GPT-4.5 and Claude 3.7, without extended reasoning capabilities. These points demonstrate raw model performance without additional reasoning enhancements. Kaggle Systems solutions showcase competition-grade submissions from the Kaggle challenge, operating under strict computational constraints ($50 compute budget for 120 evaluation tasks). These represent purpose-built, efficient methods specifically designed for the ARC Prize. Verification Policy For more information, see our testing policy . Leaderboard Breakdown Notes Only systems which required less than $10,000 to run are shown. For models that were not able to produce full test out puts, remaining tasks were marked as incorrect. Results marked as "preview" are unofficial and may be based on incomplete testing. 1 ARC-AGI-2 score estimate based on partial testing results and o1-pro pricing. 2 Provisional cost estimates based on Gemini 3 Pro pricing. Model to be retested once released. ARC Prize 2026 Get started and receive official contest updates and news. Sign Up No spam. You can unsubscribe at anytime. ARC Prize : Newsletter Subscribe to get started and receive official contest updates and news. Subscribe No spam. You can unsubscribe at anytime.
Resource ID:
a27f2ad202a2b5a7 | Stable ID: OGU5MzBmOD