Test Scores AI vs Humans - Our World in Data
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Our World in Data
A widely-cited empirical resource for tracking AI capability trajectories; useful when grounding claims about AI progress, timeline discussions, or benchmarking in the AI safety and governance literature.
Metadata
Summary
An interactive dataset and visualization tracking AI performance across multiple domains—including language understanding, image recognition, and reasoning—relative to human-level baselines. It charts the historical progression of AI capabilities, illustrating where and when AI systems have surpassed human benchmarks. Useful for understanding the pace and trajectory of AI capability development.
Key Points
- •Tracks AI test scores across diverse domains including language, vision, and problem-solving relative to human performance baselines.
- •Visualizes the historical trend of AI surpassing human-level performance on various standardized benchmarks.
- •Provides a comparative framework useful for assessing progress in AI capabilities over time.
- •Highlights how rapidly AI has advanced in specific domains, relevant to discussions of transformative AI timelines.
- •Data sourced from prominent AI benchmarks, making it a useful reference for empirical claims about capability growth.
Review
Cached Content Preview
Test scores of AI systems on various capabilities relative to human performance - Our World in Data Data Test scores of AI systems on various capabilities relative to human performance
See all data and research on: Artificial Intelligence About this data
Test scores of AI systems on various capabilities relative to human performance Human performance, as the benchmark, is set to zero. The capability of each AI system is normalized to an initial performance of -100. Source Kiela et al. (2023) – with minor processing by Our World in Data Last updated April 2, 2024 Next expected update May 2026 Date range 1998–2023 Related research and writing
The brief history of artificial intelligence: the world has changed fast — what might be next?
Max Roser More Data on Artificial Intelligence
Sources and processing
This data is based on the following sources
Kiela et al. – Dynabench: Rethinking Benchmarking in NLP
This dataset captures the progression of AI evaluation benchmarks, reflecting their adaptation to the rapid advancements in AI technology. The benchmarks cover a wide range of tasks, from language understanding to image processing, and are designed to test AI models' capabilities in various domains. The dataset includes performance metrics for each benchmark, providing insights into AI models' proficiency in different areas of machine learning research.
BBH (BIG-Bench Hard): This benchmark serves as a rigorous evaluation framework for advanced language models, targeting their capacity for complex reasoning and problem-solving. It identifies tasks where AI models traditionally underperform compared to human benchmarks, emphasizing the enhancement of AI reasoning through innovative prompting methods like Chain-of-Thought.
GLUE (General Language Understanding Evaluation): GLUE is a comprehensive benchmark suite designed to assess the breadth of an AI model's language understanding capabilities across a variety of tasks, including sentiment analysis, textual entailment, and question answering. It aims to advance the field towards more generalized models of language comprehension.
GSM8K: This dataset challenges AI models with a collection of grade-level math word problems, designed to test computational and reasoning abilities. By requiring models to perform a sequence of arithmetic operations, GSM8K evaluates the AI's capacity for engaging in multi-step mathematical problem-solving.
HellaSwag: HellaSwag assesses AI models on their ability to predict the continuation of scenarios, demanding a nuanced understanding of context and narrative. This benchmark pushes the boundaries of predictive modeling and contextual comprehension within AI systems.
HumanEval: Targeting the intersection of AI and software development, HumanEval presents programming challenges to evaluate the code generation capabilities of AI models. This benchmark tests models' understanding of coding logic
... (truncated, 15 KB total)653a55bdf7195c0c | Stable ID: sid_BMkMwmjm23