Inspect Evals: Community Evaluation Repository for AI Models

web

inspect.aisi.org.uk·inspect.aisi.org.uk/evals/

Maintained by the UK AI Safety Institute, this repository provides standardized implementations of major AI benchmarks useful for capability assessment and safety evaluation research.

Metadata

Importance: 62/100tool pagetool

Summary

Inspect Evals is a repository of community-contributed AI model evaluations maintained by the UK AI Safety Institute (AISI), featuring implementations of popular benchmarks across coding, reasoning, safety, and agent capabilities. Evaluations can be installed and run with a single command against any model, covering domains from Python coding to cybersecurity to scientific reasoning.

Key Points

•Comprehensive collection of pip-installable benchmarks spanning coding (HumanEval, SWE-bench, MBPP), reasoning, safety, and agent evaluation categories
•Maintained by UK AISI as part of the Inspect evaluation framework, enabling standardized model comparison across diverse capability domains
•Includes safety-relevant evals such as cybersecurity, autonomous agents, and research replication benchmarks (PaperBench, MLE-bench)
•Serves dual purpose as both a practical evaluation toolkit and a learning resource demonstrating diverse evaluation techniques
•Covers frontier-relevant capabilities like software engineering (SWE-bench), ML engineering (MLE-bench), and scientific code generation (SciCode)

Cited by 1 page

Page	Type	Quality
UK AI Safety Institute	Organization	52.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202630 KB

Inspect Evals – Inspect 

 
 

 
 

 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 Categories

 All (107) Agent (25) AI R&D (1) Assistants (9) Bias (2) Coding (14) Cybersecurity (11) Editing (1) Game (1) Healthcare (1) Knowledge (20) Mathematics (6) Medical (1) Multimodal (9) Personality (1) Reasoning (20) Safeguards (15) Scheming (4) Vim (1) Writing (1) 
 
 

 Inspect Evals is a repository of community contributed evaluations featuring implementations of many popular benchmarks and papers.

 These evals can be pip installed and run with a single command against any model. They are also useful as a learning resource as they demonstrate a wide variety of evaluation types and techniques.

 

 

 

 
 
 
 Coding

 
 
 
 
 
 
 
 
 
 
 HumanEval: Python Function Generation from Instructions
 
 
 Assesses how accurately language models can write correct Python functions based solely on natural-language instructions provided as docstrings.
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 MBPP: Basic Python Coding Challenges
 
 
 Measures the ability of language models to generate short Python programs from simple natural-language descriptions, testing basic coding proficiency.
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 SWE-bench Verified: Resolving Real-World GitHub Issues
 
 
 Evaluates AI's ability to resolve genuine software engineering issues sourced from 12 popular Python GitHub repositories, reflecting realistic coding and debugging scenarios.
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
 
 
 Machine learning tasks drawn from 75 Kaggle competitions.
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
 
 
 Code generation benchmark with a thousand data science problems spanning seven Python libraries. 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
 
 
 Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation
 
 
 Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks.
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 SciCode: A Research Coding Benchmark Curated by Scientists
 
 
 SciCode tests the ability of language models to generate code to solve scientific research problems. It assesses models on 65 problems from mathematics, physics, chemistry, biology, and materials science.
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 APPS: Automated Programming Progress Standard
 
 
 APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,000 at introductory, 3,000 at interview, and 1,000 at competition level. The dataset consists of an additional 5,000 

... (truncated, 30 KB total)

Resource ID: 5110fa50a77a1872 | Stable ID: sid_lu3Fj2lphT