Longterm Wiki

AI Benchmarks

Directory of AI evaluation benchmarks tracked in the knowledge base, with model scores and leaderboards.

Benchmarks
12
Model Scores
65
Categories
6
With Results
10
Showing 12 of 12 benchmarks
Scoring
MMLU

Massive Multitask Language Understanding — a multiple-choice benchmark covering 57 academic subjects from STEM to humanities.

Knowledge18accuracy2021-01Dan Hendrycks et al.
SWE-bench Verified

A curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues.

Coding12percentage2024-08OpenAI / Princeton NLP
MATH

A dataset of 12,500 competition mathematics problems testing mathematical reasoning across difficulty levels 1-5.

Math10accuracy2021-03Dan Hendrycks et al.
HumanEval

A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings.

Coding10pass_at_12021-07OpenAI
GPQA Diamond

Graduate-level Google-Proof Q&A Diamond subset — extremely difficult questions in physics, chemistry, and biology that even domain experts struggle with.

Reasoning8accuracy2023-11David Rein et al.
ARC-AGI

Abstraction and Reasoning Corpus — a benchmark of visual pattern recognition tasks designed to test fluid intelligence and novel reasoning.

Reasoning2accuracy2019-11Francois Chollet
OSWorld

A benchmark for multimodal agents on real-world computer tasks across operating systems, testing GUI interaction and task completion.

Agentic2percentage2024-04
ARC-AGI-2

Second iteration of the ARC benchmark with harder tasks, designed to remain challenging as AI capabilities improve.

Reasoning1accuracy2025-01Francois Chollet / ARC Prize Foundation
Terminal-Bench Hard

A benchmark evaluating AI agents on complex terminal-based tasks requiring multi-step reasoning and system administration skills.

Agentic1percentage2025-01
Artificial Analysis Intelligence Index

A composite intelligence index from Artificial Analysis that aggregates performance across multiple benchmarks to provide a single capability score.

General1points2024-06Artificial Analysis
Terminal-Bench 2

Second version of the Terminal-Bench benchmark with expanded task coverage and difficulty.

Agenticpercentage2025-06
AIME 2025

American Invitational Mathematics Examination 2025 — a prestigious math competition used as a benchmark for advanced mathematical reasoning.

Mathaccuracy2025-02Mathematical Association of America