Skip to content
Longterm Wiki
Back

GPT-5.2 Benchmarks: Coding, Reasoning, Math, and Multimodal Performance Analysis

web

This blog post analyzes benchmark performance of GPT-5.2 across coding, reasoning, math, and multimodal tasks, relevant to AI safety as it tracks rapid capability advancement and helps evaluate frontier model progress.

Metadata

Importance: 28/100blog postanalysis

Summary

Vellum AI analyzes GPT-5.2's benchmark performance across key domains including coding (SWE-Bench Pro: 55.6%), reasoning (ARC-AGI-2: 52.9%, GPQA Diamond: 92.4%), math (AIME 2025 perfect score), and multimodal tasks. The post compares GPT-5.2 against Gemini 3 Pro and Claude Opus 4.5, highlighting significant capability jumps especially in abstract reasoning. It positions GPT-5.2 as a strong contender for production agents and complex workflow automation.

Key Points

  • GPT-5.2 Thinking scores 52.9% on ARC-AGI-2, nearly doubling Gemini 3 Pro (31.1%) and significantly beating Claude Opus 4.5 (37.6%).
  • Achieves 92.4% on GPQA Diamond (PhD-level science), slightly ahead of Gemini 3 Pro (91.9%) and well above Claude Opus 4.5 (87%).
  • SWE-Bench Pro score of 55.6% sets a new state of the art for multi-language real-world software engineering tasks.
  • GDPval results show GPT-5.2 beats or ties industry professionals on 70.9% of knowledge work tasks, indicating strong long-horizon planning.
  • High multimodal scores (MMMU-Pro: 86.5%, Video-MMMU: 90.5%) suggest robust natively multimodal architecture.

Cited by 1 page

PageTypeQuality
Reasoning and PlanningCapability65.0

Cached Content Preview

HTTP 200Fetched Apr 24, 20268 KB
OpenAI just dropped GPT-5.2 quickly after Gemini 3 Pro stunned the AI space. It’s seems to be built with an emphasis to power teams pushing real workloads with serious upgrades in performance across coding, math, planning, and multimodal tasks, compared to 5.1 back.

 It’s available now through the API, with Instant, Thinking, and Pro variants rolling out across ChatGPT paid plans.

 If you’re evaluating models for production agents or complex workflow automation, GPT-5.2 is positioned as a serious contender. Here’s what the numbers actually show.

 💡 Want to see how GPT 5.2 compares to Gemini Pro 3, Claude Opus 4.5, or Grok 4.1 for your use case? Compare them in Vellum ! Key observations of reported benchmarks

 While benchmarks are inherently limited and may not fully capture real-world utility, they are our only quantifiable way to measure progress. From the reported data, we can conclude a few things:

 Reasoning: The most compelling data points are the high scores on ARC-AGI-2 (52.9%) and GPQA Diamond (92.4%). This massive leap in abstract reasoning, beating out Gemini 3 Pro (31.1%) and Claude Opus 4.5 (37.6%) indicating a core improvement in academic and abstract reasoning. Coding: A new score of 55.6% on the challenging SWE-Bench Pro benchmark confirms its superior ability to handle real-world software engineering tasks across 4 coding languages rather than simply Python. Math: A perfect score on AIME 2025 is impressive, but the strong performance on the new FrontierMath benchmark (40.3% on Tiers 1-3) is more indicative of a robust intrinsic base for mathematical logic, even without relying on coding tools. Long-Horizon Planning: The results on GDPval are arguably the most indicative of practical utility. Beating or tying industry professionals on 70.9% of knowledge work tasks shows an unprecedented ability to handle long-horizon planning and coherent execution in a professional context. Vision: High scores across MMMU-Pro (86.5%) and Video-MMMU (90.5%) suggest a powerful, natively multimodal architecture capable of reasoning across temporal and spatial dimensions simultaneously.

 Coding capabilities

 SWE-Bench Pro and SWE-Bench Verified evaluate a model's ability to resolve real-world software issues from GitHub repositories. Unlike the Python-only Verified version, SWE-Bench Pro tests across four languages and is designed to be more challenging and industrially relevant.

 __wf_reserved_inherit While GPT-5.2 Thinking sets a new state of the art of 55.6% on SWE-Bench Pro, on the more established SWE-Bench Verified, it scores 80.0%.

 This comes neck to neck with Claude Opus 4.5 (80.9%) and surpasses Gemini 3 Pro (76.2%). This is an improvement from GPT 5.1 (76.3%) in complex, multi-language bug fixing positioning excellently for professional development workflows.

 Reasoning capabilities

 Reasoning benchmarks evaluate a model's ability to solve complex and novel problems. GPQA Diamond assesses PhD-level scientific knowledge

... (truncated, 8 KB total)
Resource ID: 5906b44c69898bf2 | Stable ID: sid_LbNHUtyWlQ