o3 scores 87.5% on ARC-AGI

web

arcprize.org·arcprize.org/blog/oai-o3-pub-breakthrough

Landmark announcement by ARC Prize documenting o3's surprising performance on ARC-AGI-1, widely cited in AI safety and capabilities discussions as evidence of a qualitative shift in AI reasoning abilities as of late 2024.

Metadata

Importance: 82/100blog postprimary source

Summary

François Chollet reports that OpenAI's o3 model scored 87.5% on the ARC-AGI-1 Semi-Private Evaluation set using high compute (1024 samples), and 75.7% under the $10k budget constraint, representing a dramatic step-function improvement over previous AI systems. This result challenges prior intuitions about AI capabilities, as ARC-AGI-1 took four years to progress from 0% with GPT-3 to only 5% with GPT-4o. The post also announces ARC-AGI-2 and ARC Prize 2025 as next-generation benchmarks targeting AGI progress.

Key Points

•o3 scored 75.7% on ARC-AGI Semi-Private Eval within $10k compute budget and 87.5% at 172x higher compute (~$456k).
•This is a massive jump from GPT-4o's ~5%, representing a qualitative shift in novel task adaptation ability not seen before in GPT-family models.
•o3 was trained on 75% of the ARC-AGI Public Training set; impact of this fine-tuning on results has not been fully disentangled.
•High-compute configuration cost ~$4,560 per task, highlighting that compute efficiency is now a critical metric alongside raw benchmark scores.
•ARC-AGI-2 and ARC Prize 2025 were announced to maintain a rigorous AGI benchmark as o3 approaches saturation of ARC-AGI-1.

Cited by 5 pages

Page	Type	Quality
Large Language Models	Concept	62.0
Reasoning and Planning	Capability	65.0
Self-Improvement and Recursive Enhancement	Capability	69.0
Is Scaling All You Need?	Crux	42.0
Emergent Capabilities	Risk	61.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202614 KB

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub | ARC Prize By François Chollet Published 20 Dec 2024 OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

 OpenAI has released a new version of o3. Read our analysis to learn how it differs from the preview below.

 
 Updated (April 16, 2025): OpenAI has officially released o3 . OpenAI has confirmed that this version is not the same as the one we tested in this original post. See more information on this. We will publish updated results for released o3 shortly.

 OpenAI&#x27;s new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5% .

 

 This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models. For context, ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o. All intuition about AI capabilities will need to get updated for o3.

 The mission of ARC Prize goes beyond our first benchmark: to be a North Star towards AGI. And we&#x27;re excited to be working with the OpenAI team and others next year to continue to design next-gen, enduring AGI benchmarks.

 ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025. We&#x27;re committed to running the Grand Prize competition until a high-efficiency, open-source solution scoring 85% is created.

 Read on for the full testing report.

 
 OpenAI o3 ARC-AGI Results

 
 Update 12/20/2024: ARC Prize presented o3&#x27;s performance results in person with OpenAI&#x27;s Sam Altman (CEO) and Mark Chen (SVP Research) during the final "12 Days of OpenAI" event. Watch the recording here .

 
 We tested o3 against two ARC-AGI datasets:

 
 Semi-Private Eval : 100 private tasks used to assess overfitting

 Public Eval : 400 public tasks

 
 At OpenAI&#x27;s direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute).

 Here are the results.

 Set Tasks Efficiency Score Retail Cost* Samples Tokens Cost/Task* Time/Task (mins) Semi-Private 100 High 75.7% $2,680 6 33.5M $26 1.3 Semi-Private 100 Low 87.5% $456,000 1024 5.7B $4,560 13.8 Public 400 High 82.8% $66,772 6 111M $167 N/A Public 400 Low 91.5% $760,000 1024 9.5B $1,900 N/A 
 * Note 3/24/2025: o3 pricing costs have been updated to use o1-pro pricing. We will update again once official o3 pricing is publicly available. The amount of compute was roughly 172x the low-compute configuration.

 * Note 12/10/2025: o3 pricing costs have been updated to use o3-pro pricing of $80/M/Tokens.

 Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performanc

... (truncated, 14 KB total)

Resource ID: 457fa3b0b79d8812 | Stable ID: OTVmZjBjMz