Skip to content
Longterm Wiki
Back

UK AI Safety Institute's Inspect framework

web
inspect.aisi.org.uk·inspect.aisi.org.uk/

Inspect is a practical evaluation toolkit from the UK government's AI Safety Institute, relevant to researchers building safety benchmarks or conducting model evaluations; note that current tags like 'interpretability' and 'rlhf' appear mismatched to this resource's actual focus on evaluation infrastructure.

Metadata

Importance: 65/100tool pagetool

Summary

Inspect is an open-source framework developed by the UK AI Safety Institute (AISI) for evaluating large language models and AI systems. It provides standardized tools for running safety evaluations, benchmarks, and red-teaming tasks. The framework enables researchers and developers to assess AI model capabilities and safety properties in a reproducible and extensible way.

Key Points

  • Open-source Python framework for conducting rigorous AI model evaluations and benchmarks developed by the UK AISI
  • Supports a wide range of evaluation tasks including reasoning, coding, safety, and agentic capability assessments
  • Designed for reproducibility and extensibility, allowing custom solvers, scorers, and datasets to be integrated
  • Part of AISI's broader mission to provide public infrastructure for AI safety testing and frontier model evaluation
  • Enables standardized comparisons across models and facilitates third-party safety auditing workflows

Cited by 6 pages

Cached Content Preview

HTTP 200Fetched Apr 10, 202614 KB
Inspect 

 
 
 

 
 

 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Welcome

 Welcome to Inspect, a framework for large language model evaluations created by the UK AI Security Institute .

 Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:

 
 
 A set of straightforward interfaces for implementing evaluations and re-using components across evaluations.

 A collection of over 100 pre-built evaluations ready to run on any model.

 Extensive tooling, including a web-based Inspect View tool for monitoring and visualizing evaluations and a VS Code Extension that assists with authoring and debugging.

 Flexible support for tool calling—custom and MCP tools, as well as built-in bash, python, text editing, web search, web browsing, and computer tools.

 Support for agent evaluations, including flexible built-in agents, multi-agent primitives, the ability to run arbitrary external agents like Claude Code, Codex CLI, and Gemini CLI.

 A sandboxing system that supports running untrusted model code in Docker, Kubernetes, Modal, Proxmox, and other systems via an extension API.

 
 
 We’ll walk through a fairly trivial “Hello, Inspect” example below. Read on to learn the basics, then read the documentation on Datasets , Solvers , Scorers , Tools , and Agents to learn how to create more advanced evaluations.

 If you are primarily interested in running evaluations rather than developing new evaluations, Inspect Evals provides implementations for a large collection of popular benchmarks.

 
 
 Getting Started

 To get started using Inspect:

 
 Install Inspect from PyPI with:

 pip install inspect-ai 

 If you are using VS Code, install the Inspect VS Code Extension (not required but highly recommended).

 
 To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment.

 Assuming you had written an evaluation in a script named arc.py , here’s how you would setup and run the eval for a few different model providers:

 
 OpenAI 
 Anthropic 
 Google 
 Grok 
 Mistral 
 HF 
 
 
 
 pip install openai 
 export OPENAI_API_KEY = your-openai-api-key 
 inspect eval arc.py --model openai/gpt-4o 
 
 
 pip install anthropic 
 export ANTHROPIC_API_KEY = your-anthropic-api-key 
 inspect eval arc.py --model anthropic/claude-sonnet-4-0 
 
 
 pip install google-genai 
 export GOOGLE_API_KEY = your-google-api-key 
 inspect eval arc.py --model google/gemini-2.5-pro 
 
 
 pip install openai 
 export GROK_API_KEY = your-grok-api-key 
 inspect eval arc.py --model grok/grok-3-mini 
 
 
 pip install mistralai 
 export MISTRAL_API_KEY = your-mistral-api-key 
 inspect eval arc.py --model mistral/mistral-large-latest 
 
 
 pip install torch transformers 
 export HF_TOKEN = your-hf-token 
 

... (truncated, 14 KB total)
Resource ID: fc3078f3c2ba5ebb | Stable ID: sid_0iP9XlmUtA