ARC-AGI-2 Benchmark

web

ARC-AGI-2 is a leading benchmark for measuring progress toward general reasoning and AGI; relevant for AI safety researchers tracking the gap between human and AI reasoning capabilities and evaluating whether scaling laws are sufficient.

Metadata

Importance: 72/100tool pagedataset

Summary

ARC-AGI-2 is a 2025 benchmark designed to stress-test AI reasoning systems, where pure LLMs score 0% and frontier reasoning systems achieve only single-digit percentages despite humans solving all tasks. It targets three core capability gaps—symbolic interpretation, compositional reasoning, and contextual rule application—demonstrating that scaling alone is insufficient and new architectures or test-time adaptation methods are required.

Key Points

•Pure LLMs score 0% and best AI reasoning systems score only single-digit percentages; all tasks are solvable by humans, highlighting a major capability gap.
•Benchmark targets three specific weaknesses: symbolic interpretation, compositional reasoning, and contextual rule application.
•Demonstrates log-linear scaling is insufficient; new test-time adaptation algorithms or novel AI architectures are required.
•Dataset includes 1,000 training tasks and 360 calibrated evaluation tasks split across public, semi-private, and private sets; challenge goal is 85% accuracy.
•Successor to ARC-AGI-1 (2019), which saw little progress for 5 years until test-time adaptation methods emerged in late 2024.

Cited by 1 page

Page	Type	Quality
AI Capability Threshold Model	Analysis	72.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20265 KB

ARC-AGI-2 Research ARC-AGI-2

 2025 - Challenges Reasoning Models

 Links

 Play ARC-AGI- 2 
 Official ARC-AGI-2 Repo 
 Explore ARC-AGI-2 Tasks 
 Launch Video 
 Technical Paper 
 About

 ARC-AGI-1 was created in 2019 (before the rise of LLMs). It endured five years of global competitions, a 50,000x scale-up of base LLMs, and saw little progress until late 2024, with the introduction of test-time adaptation methods pioneered by ARC Prize 2024 entrants and OpenAI .

 ARC-AGI-2 - the next iteration of the benchmark - is designed to stress-test the capabilities of state-of-the-art AI reasoning systems, provide useful signal on AGI progress, and inspire researchers to work on new ideas.

 Can you create a system that can reach 85% accuracy?

 > Learn more 

 Efficiency Test

 ARC-AGI-2: Scale is Not Enough

 Log-linear scaling is insufficient to beat ARC-AGI-2.

 New test-time adaptation algorithms or novel AI systems are needed to
bring AI efficiency inline with human performance.

 Capability Test

 ARC-AGI-2: Symbolic Interpretation

 Tasks requiring symbols to be interpreted as having meaning beyond their visual patterns.

 Current systems attempt to check symmetry, mirroring, and other transformations, and even recognize
connecting elements, but fail to assign semantic significance to the symbols themselves.

 Try this task Capability Test

 ARC-AGI-2: Compositional Reasoning

 Tasks requiring simultaneous application of a rules, or application of multiples rules that
interact with each other.

 In contrast, if a task has very few global rules, current systems can consitently discover and can apply
them.

 Try this task Capability Test

 ARC-AGI-2: Contextual Rule Application

 Tasks where rules must be applied differently based on context.

 Systems tend to fixate on superficial patterns rather than understanding the underlying selection principles.

 Try this task Dataset Structure

 Dataset Tasks Description Training Set

 1000 tasks Uncalibrated, public, a spectrum of difficulty ranging from very easy to
very difficult for both humans and AI, designed to expose and teach Core
Knowledge Priors, use to train your systems.

 Public Eval Set

 120 tasks Calibrated, public, all tasks solved pass@2 by at least two humans, use
to test your systems.

 Semi-Private Eval Set 120 tasks Calibrated, not public, all tasks solved pass@2 by at least two humans,
used for Kaggle live contest leaderboard and ARC Prize leaderboard.
"Semi" means these tasks may have been exposed to limited third-parties
eg. via API

 Private Eval Set 120 tasks Calibrated, not public, all tasks solved pass@2 by at least two humans,
used for Kaggle final contest leaderboard. "Private" means these tasks
have not been exposed to third-parties.

 Calibration

 The eval sets (Public, Semi-Private, Private) are "calibrated," meaning tasks are statistically similar (IDD). Scores are comparable across these sets (<1pp expected), assuming no overfitting. Calibration was done via controlled

... (truncated, 5 KB total)

Resource ID: 28167998c7d9c6b2 | Stable ID: sid_fZHBPnVYvp