Back
ARC-AGI-2 Benchmark
webarcprize.org·arcprize.org/arc-agi/2/
ARC-AGI-2 is a leading benchmark for measuring progress toward general reasoning and AGI; relevant for AI safety researchers tracking the gap between human and AI reasoning capabilities and evaluating whether scaling laws are sufficient.
Metadata
Importance: 72/100tool pagedataset
Summary
ARC-AGI-2 is a 2025 benchmark designed to stress-test AI reasoning systems, where pure LLMs score 0% and frontier reasoning systems achieve only single-digit percentages despite humans solving all tasks. It targets three core capability gaps—symbolic interpretation, compositional reasoning, and contextual rule application—demonstrating that scaling alone is insufficient and new architectures or test-time adaptation methods are required.
Key Points
- •Pure LLMs score 0% and best AI reasoning systems score only single-digit percentages; all tasks are solvable by humans, highlighting a major capability gap.
- •Benchmark targets three specific weaknesses: symbolic interpretation, compositional reasoning, and contextual rule application.
- •Demonstrates log-linear scaling is insufficient; new test-time adaptation algorithms or novel AI architectures are required.
- •Dataset includes 1,000 training tasks and 360 calibrated evaluation tasks split across public, semi-private, and private sets; challenge goal is 85% accuracy.
- •Successor to ARC-AGI-1 (2019), which saw little progress for 5 years until test-time adaptation methods emerged in late 2024.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Capability Threshold Model | Analysis | 72.0 |
Cached Content Preview
HTTP 200Fetched Apr 7, 20265 KB
ARC-AGI-2 Research ARC-AGI-2
2025 - Challenges Reasoning Models
Links
Play ARC-AGI- 2
Official ARC-AGI-2 Repo
Explore ARC-AGI-2 Tasks
Launch Video
Technical Paper
About
ARC-AGI-1 was created in 2019 (before the rise of LLMs). It endured five years of global competitions, a 50,000x scale-up of base LLMs, and saw little progress until late 2024, with the introduction of test-time adaptation methods pioneered by ARC Prize 2024 entrants and OpenAI .
ARC-AGI-2 - the next iteration of the benchmark - is designed to stress-test the capabilities of state-of-the-art AI reasoning systems, provide useful signal on AGI progress, and inspire researchers to work on new ideas.
Can you create a system that can reach 85% accuracy?
> Learn more
Efficiency Test
ARC-AGI-2: Scale is Not Enough
Log-linear scaling is insufficient to beat ARC-AGI-2.
New test-time adaptation algorithms or novel AI systems are needed to
bring AI efficiency inline with human performance.
Capability Test
ARC-AGI-2: Symbolic Interpretation
Tasks requiring symbols to be interpreted as having meaning beyond their visual patterns.
Current systems attempt to check symmetry, mirroring, and other transformations, and even recognize
connecting elements, but fail to assign semantic significance to the symbols themselves.
Try this task Capability Test
ARC-AGI-2: Compositional Reasoning
Tasks requiring simultaneous application of a rules, or application of multiples rules that
interact with each other.
In contrast, if a task has very few global rules, current systems can consitently discover and can apply
them.
Try this task Capability Test
ARC-AGI-2: Contextual Rule Application
Tasks where rules must be applied differently based on context.
Systems tend to fixate on superficial patterns rather than understanding the underlying selection principles.
Try this task Dataset Structure
Dataset Tasks Description Training Set
1000 tasks Uncalibrated, public, a spectrum of difficulty ranging from very easy to
very difficult for both humans and AI, designed to expose and teach Core
Knowledge Priors, use to train your systems.
Public Eval Set
120 tasks Calibrated, public, all tasks solved pass@2 by at least two humans, use
to test your systems.
Semi-Private Eval Set 120 tasks Calibrated, not public, all tasks solved pass@2 by at least two humans,
used for Kaggle live contest leaderboard and ARC Prize leaderboard.
"Semi" means these tasks may have been exposed to limited third-parties
eg. via API
Private Eval Set 120 tasks Calibrated, not public, all tasks solved pass@2 by at least two humans,
used for Kaggle final contest leaderboard. "Private" means these tasks
have not been exposed to third-parties.
Calibration
The eval sets (Public, Semi-Private, Private) are "calibrated," meaning tasks are statistically similar (IDD). Scores are comparable across these sets (<1pp expected), assuming no overfitting. Calibration was done via controlled
... (truncated, 5 KB total)Resource ID:
28167998c7d9c6b2 | Stable ID: sid_fZHBPnVYvp