Back
UK AI Safety Institute's Inspect framework
webinspect.aisi.org.uk·inspect.aisi.org.uk/
Inspect is a practical evaluation toolkit from the UK government's AI Safety Institute, relevant to researchers building safety benchmarks or conducting model evaluations; note that current tags like 'interpretability' and 'rlhf' appear mismatched to this resource's actual focus on evaluation infrastructure.
Metadata
Importance: 65/100tool pagetool
Summary
Inspect is an open-source framework developed by the UK AI Safety Institute (AISI) for evaluating large language models and AI systems. It provides standardized tools for running safety evaluations, benchmarks, and red-teaming tasks. The framework enables researchers and developers to assess AI model capabilities and safety properties in a reproducible and extensible way.
Key Points
- •Open-source Python framework for conducting rigorous AI model evaluations and benchmarks developed by the UK AISI
- •Supports a wide range of evaluation tasks including reasoning, coding, safety, and agentic capability assessments
- •Designed for reproducibility and extensibility, allowing custom solvers, scorers, and datasets to be integrated
- •Part of AISI's broader mission to provide public infrastructure for AI safety testing and frontier model evaluation
- •Enables standardized comparisons across models and facilitates third-party safety auditing workflows
Cited by 6 pages
| Page | Type | Quality |
|---|---|---|
| UK AI Safety Institute | Organization | 52.0 |
| AI Safety Institutes (AISIs) | Policy | 69.0 |
| Evals-Based Deployment Gates | Approach | 66.0 |
| International AI Safety Summit Series | Event | 63.0 |
| Scalable Eval Approaches | Approach | 65.0 |
| Technical AI Safety Research | Crux | 66.0 |
Cached Content Preview
HTTP 200Fetched Apr 10, 202614 KB
Inspect
Welcome
Welcome to Inspect, a framework for large language model evaluations created by the UK AI Security Institute .
Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:
A set of straightforward interfaces for implementing evaluations and re-using components across evaluations.
A collection of over 100 pre-built evaluations ready to run on any model.
Extensive tooling, including a web-based Inspect View tool for monitoring and visualizing evaluations and a VS Code Extension that assists with authoring and debugging.
Flexible support for tool calling—custom and MCP tools, as well as built-in bash, python, text editing, web search, web browsing, and computer tools.
Support for agent evaluations, including flexible built-in agents, multi-agent primitives, the ability to run arbitrary external agents like Claude Code, Codex CLI, and Gemini CLI.
A sandboxing system that supports running untrusted model code in Docker, Kubernetes, Modal, Proxmox, and other systems via an extension API.
We’ll walk through a fairly trivial “Hello, Inspect” example below. Read on to learn the basics, then read the documentation on Datasets , Solvers , Scorers , Tools , and Agents to learn how to create more advanced evaluations.
If you are primarily interested in running evaluations rather than developing new evaluations, Inspect Evals provides implementations for a large collection of popular benchmarks.
Getting Started
To get started using Inspect:
Install Inspect from PyPI with:
pip install inspect-ai
If you are using VS Code, install the Inspect VS Code Extension (not required but highly recommended).
To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment.
Assuming you had written an evaluation in a script named arc.py , here’s how you would setup and run the eval for a few different model providers:
OpenAI
Anthropic
Google
Grok
Mistral
HF
pip install openai
export OPENAI_API_KEY = your-openai-api-key
inspect eval arc.py --model openai/gpt-4o
pip install anthropic
export ANTHROPIC_API_KEY = your-anthropic-api-key
inspect eval arc.py --model anthropic/claude-sonnet-4-0
pip install google-genai
export GOOGLE_API_KEY = your-google-api-key
inspect eval arc.py --model google/gemini-2.5-pro
pip install openai
export GROK_API_KEY = your-grok-api-key
inspect eval arc.py --model grok/grok-3-mini
pip install mistralai
export MISTRAL_API_KEY = your-mistral-api-key
inspect eval arc.py --model mistral/mistral-large-latest
pip install torch transformers
export HF_TOKEN = your-hf-token
... (truncated, 14 KB total)Resource ID:
fc3078f3c2ba5ebb | Stable ID: sid_0iP9XlmUtA