Pre-Deployment evaluation of OpenAI's o1 model
governmentCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: UK AI Safety Institute
This is an official government evaluation report from AISI (UK) and its US counterpart, representing one of the first formal pre-deployment government safety assessments of a frontier model and a key case study in operationalizing AI governance frameworks.
Metadata
Summary
The US and UK AI Safety Institutes jointly conducted a pre-deployment safety evaluation of OpenAI's o1 reasoning model, assessing its capabilities in cyber, biological, and software development domains. The evaluation benchmarked o1 against reference models to identify potential risks before public release. This represents an early example of government-led pre-deployment AI safety testing through formal institute collaboration.
Key Points
- •First major joint evaluation between the US and UK AI Safety Institutes, establishing a model for international government cooperation on AI safety assessments.
- •Tested o1 across high-risk capability domains including cybersecurity, biological threats, and software development to identify uplift risks.
- •Compared o1's performance against multiple reference models to contextualize capability levels and potential for misuse.
- •Evaluation occurred pre-deployment, reflecting a proactive rather than reactive approach to frontier model risk assessment.
- •Represents an operationalization of safety commitments made by frontier AI labs to engage with government safety bodies before release.
Review
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Evals-Based Deployment Gates | Approach | 66.0 |
| International AI Safety Summit Series | Event | 63.0 |
| Third-Party Model Auditing | Approach | 64.0 |
Cached Content Preview
Pre-Deployment evaluation of OpenAI’s o1 model | AISI Work
Read the Frontier AI Trends Report Please enable javascript for this website.
A
A
Careers
Blog
Organisation Pre-Deployment evaluation of OpenAI’s o1 model
The UK Artificial Intelligence Safety Institute and the U.S. Artificial Intelligence Safety Institute conducted a joint pre-deployment evaluation of OpenAI's o1 model
— Dec 18, 2024 Note to readers: we changed our name to the AI Security Institute on 14 February 2025. Read more here.
Introduction
The UK Artificial Intelligence Safety Institute (UK AISI) and the U.S. Artificial Intelligence Safety Institute (US AISI) conducted a joint pre-deployment evaluation of OpenAI’s latest model, o1 (released December 5, 2024).
The following is a high-level overview of the evaluations conducted, as well as a snapshot of the findings from each domain tested. A more detailed technical report can be found here.
Overview of the Joint Safety Research & Testing Exercise
US AISI and UK AISI conducted testing during a limited period of pre-deployment access to the o1 model. Testing was conducted by expert engineers, scientists, and subject matter specialists from staff at both Institutes, and the findings were shared with OpenAI before the model was publicly released.
US AISI and UK AISI ran separate but complementary tests to assess the model’s capabilities across three domains: (1) cyber capabilities , (2) biological capabilities , (3) and software and AI development .
To assess the model’s relative capabilities and evaluate the potential real-world impacts of o1 across these areas, US AISI and UK AISI compared its performance to a series of similar reference models: OpenAI’s o1-preview, OpenAI’s GPT-4o, and both the upgraded and earlier version of Anthropic’s Claude 3.5 Sonnet.
These comparisons are intended only to assess the relative capability improvements of o1, in order to improve scientific interpretation of evaluation results.
The version of o1 that was tested exhibited a number of performance issues related to tool-calling and output formatting. US AISI and UK AISI took steps to address these issues by adapting their agent designs, including adjusting prompts and introducing simple mechanisms to help the agent recover from errors. The results below reflect o1’s performance with this scaffolding in place. A version of o1 that was better optimized for tool use might exhibit better performance on many evaluations. This report makes no claims about the performance of other versions of o1.
Methodology
US AISI and UK AISI tested the model by drawing on a range of techniques including:
Question Answering: The model was asked to correctly answer a series of questions that test knowledge or problem solving on a given topic. Answers were most often graded automatically by another model, then checked by a human with knowledge of the correct answers.
Agent Tasks: The model opera
... (truncated, 9 KB total)e23f70e673a090c1 | Stable ID: sid_peJnCBcxL9