Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Our World in Data

A widely-cited empirical resource for tracking AI capability trajectories; useful when grounding claims about AI progress, timeline discussions, or benchmarking in the AI safety and governance literature.

Metadata

Importance: 62/100dataset

Summary

An interactive dataset and visualization tracking AI performance across multiple domains—including language understanding, image recognition, and reasoning—relative to human-level baselines. It charts the historical progression of AI capabilities, illustrating where and when AI systems have surpassed human benchmarks. Useful for understanding the pace and trajectory of AI capability development.

Key Points

  • Tracks AI test scores across diverse domains including language, vision, and problem-solving relative to human performance baselines.
  • Visualizes the historical trend of AI surpassing human-level performance on various standardized benchmarks.
  • Provides a comparative framework useful for assessing progress in AI capabilities over time.
  • Highlights how rapidly AI has advanced in specific domains, relevant to discussions of transformative AI timelines.
  • Data sourced from prominent AI benchmarks, making it a useful reference for empirical claims about capability growth.

Review

This source represents a critical compilation of AI benchmark data, systematically tracking the progression of artificial intelligence capabilities across multiple domains. By normalizing human performance as zero and initial AI performance at -100, the dataset offers a nuanced view of technological advancement in areas such as language understanding, image recognition, mathematical reasoning, and code generation. The research is significant for AI safety because it provides empirical evidence of AI systems' evolving capabilities, highlighting both remarkable progress and persistent limitations. Benchmarks like BBH, MMLU, and HumanEval demonstrate AI's growing sophistication in complex reasoning, knowledge application, and problem-solving. However, the varied performance across different domains also underscores the importance of comprehensive evaluation and the need for careful development of AI systems to ensure alignment with human values and capabilities.

Cached Content Preview

HTTP 200Fetched Apr 9, 202615 KB
Test scores of AI systems on various capabilities relative to human performance - Our World in Data Data Test scores of AI systems on various capabilities relative to human performance

 See all data and research on: Artificial Intelligence About this data

 Test scores of AI systems on various capabilities relative to human performance Human performance, as the benchmark, is set to zero. The capability of each AI system is normalized to an initial performance of -100. Source Kiela et al. (2023) – with minor processing by Our World in Data Last updated April 2, 2024 Next expected update May 2026 Date range 1998–2023 Related research and writing

 The brief history of artificial intelligence: the world has changed fast — what might be next?

 Max Roser More Data on Artificial Intelligence 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Sources and processing

 This data is based on the following sources

 Kiela et al. – Dynabench: Rethinking Benchmarking in NLP

 This dataset captures the progression of AI evaluation benchmarks, reflecting their adaptation to the rapid advancements in AI technology. The benchmarks cover a wide range of tasks, from language understanding to image processing, and are designed to test AI models' capabilities in various domains. The dataset includes performance metrics for each benchmark, providing insights into AI models' proficiency in different areas of machine learning research.

 
 BBH (BIG-Bench Hard): This benchmark serves as a rigorous evaluation framework for advanced language models, targeting their capacity for complex reasoning and problem-solving. It identifies tasks where AI models traditionally underperform compared to human benchmarks, emphasizing the enhancement of AI reasoning through innovative prompting methods like Chain-of-Thought.

 GLUE (General Language Understanding Evaluation): GLUE is a comprehensive benchmark suite designed to assess the breadth of an AI model's language understanding capabilities across a variety of tasks, including sentiment analysis, textual entailment, and question answering. It aims to advance the field towards more generalized models of language comprehension.

 GSM8K: This dataset challenges AI models with a collection of grade-level math word problems, designed to test computational and reasoning abilities. By requiring models to perform a sequence of arithmetic operations, GSM8K evaluates the AI's capacity for engaging in multi-step mathematical problem-solving.

 HellaSwag: HellaSwag assesses AI models on their ability to predict the continuation of scenarios, demanding a nuanced understanding of context and narrative. This benchmark pushes the boundaries of predictive modeling and contextual comprehension within AI systems.

 HumanEval: Targeting the intersection of AI and software development, HumanEval presents programming challenges to evaluate the code generation capabilities of AI models. This benchmark tests models' understanding of coding logic 

... (truncated, 15 KB total)
Resource ID: 653a55bdf7195c0c | Stable ID: sid_BMkMwmjm23