AI Benchmarks
Directory of AI evaluation benchmarks tracked in the knowledge base, with model scores and leaderboards.
| Scoring | |||||
|---|---|---|---|---|---|
| MMLU Massive Multitask Language Understanding — a multiple-choice benchmark covering 57 academic subjects from STEM to humanities. | Knowledge | 18 | accuracy | 2021-01 | Dan Hendrycks et al. |
| SWE-bench Verified A curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues. | Coding | 12 | percentage | 2024-08 | OpenAI / Princeton NLP |
| MATH A dataset of 12,500 competition mathematics problems testing mathematical reasoning across difficulty levels 1-5. | Math | 10 | accuracy | 2021-03 | Dan Hendrycks et al. |
| HumanEval A benchmark of 164 hand-written Python programming problems with unit tests, evaluating code generation from docstrings. | Coding | 10 | pass_at_1 | 2021-07 | OpenAI |
| GPQA Diamond Graduate-level Google-Proof Q&A Diamond subset — extremely difficult questions in physics, chemistry, and biology that even domain experts struggle with. | Reasoning | 8 | accuracy | 2023-11 | David Rein et al. |
| ARC-AGI Abstraction and Reasoning Corpus — a benchmark of visual pattern recognition tasks designed to test fluid intelligence and novel reasoning. | Reasoning | 2 | accuracy | 2019-11 | Francois Chollet |
| OSWorld A benchmark for multimodal agents on real-world computer tasks across operating systems, testing GUI interaction and task completion. | Agentic | 2 | percentage | 2024-04 | |
| ARC-AGI-2 Second iteration of the ARC benchmark with harder tasks, designed to remain challenging as AI capabilities improve. | Reasoning | 1 | accuracy | 2025-01 | Francois Chollet / ARC Prize Foundation |
| Terminal-Bench Hard A benchmark evaluating AI agents on complex terminal-based tasks requiring multi-step reasoning and system administration skills. | Agentic | 1 | percentage | 2025-01 | |
| Artificial Analysis Intelligence Index A composite intelligence index from Artificial Analysis that aggregates performance across multiple benchmarks to provide a single capability score. | General | 1 | points | 2024-06 | Artificial Analysis |
| Terminal-Bench 2 Second version of the Terminal-Bench benchmark with expanded task coverage and difficulty. | Agentic | percentage | 2025-06 | ||
| AIME 2025 American Invitational Mathematics Examination 2025 — a prestigious math competition used as a benchmark for advanced mathematical reasoning. | Math | accuracy | 2025-02 | Mathematical Association of America |