Back
Pre-Deployment Evaluation of OpenAI's o1 Model
governmentCredibility Rating
5/5
Gold(5)Gold standard. Rigorous peer review, high editorial standards, and strong institutional reputation.
Rating inherited from publication venue: NIST
This is a landmark government-led safety evaluation representing one of the first formal pre-deployment assessments of a frontier AI model by national safety institutes, relevant to discussions of AI governance frameworks and capability evaluations.
Metadata
Importance: 72/100organizational reportprimary source
Summary
The US and UK AI Safety Institutes conducted a joint pre-deployment evaluation of OpenAI's o1 model, assessing its capabilities and risks across three domains including potential for misuse. The evaluation compared o1's performance to reference models and represents an early example of government-led frontier AI safety testing prior to public release.
Key Points
- •First joint evaluation by US AISI (NIST) and UK AISI of a frontier model before public deployment, marking a milestone in international AI safety cooperation.
- •Testing spanned three domains assessing both capabilities and potential harms, including risks relevant to biosecurity and other high-stakes areas.
- •Results compared o1 against reference models to contextualize capability jumps and identify whether new risks emerged with the more advanced reasoning model.
- •Demonstrates emerging government infrastructure for pre-deployment safety evaluations as a policy mechanism for frontier AI oversight.
- •Part of a broader framework stemming from the UK AI Safety Summit commitments where leading AI developers agreed to share models with safety institutes.
Review
The study represents a significant collaborative effort to systematically evaluate an advanced AI model's capabilities and potential safety implications before public deployment. By conducting rigorous testing across cyber capabilities, biological research tasks, and software development challenges, the institutes aimed to understand the model's performance, limitations, and potential dual-use risks. The methodology employed a multi-faceted approach, including question answering, agent tasks, and qualitative probing, with expert involvement from various government agencies. While the findings suggest o1's performance is largely comparable to reference models, notable advances were observed in cryptography-related cyber challenges. The research underscores the importance of pre-deployment safety assessments, acknowledging the preliminary nature of the findings and the rapidly evolving landscape of AI safety research.
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| US AI Safety Institute (now CAISI) | Organization | 91.0 |
| AI Safety Institutes (AISIs) | Policy | 69.0 |
| US Executive Order on Safe, Secure, and Trustworthy AI | Policy | 91.0 |
| AI Value Lock-in | Risk | 64.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 202610 KB
Pre-Deployment Evaluation of OpenAI's o1 Model | NIST
Skip to main content
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock (
Lock
A locked padlock
) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
https://www.nist.gov/news-events/news/2024/12/pre-deployment-evaluation-openais-o1-model
UPDATES
Pre-Deployment Evaluation of OpenAI's o1 Model
The U.S. AI Safety Institute and the UK AI Safety Institute conducted joint pre-deployment testing of OpenAI's o1 Model
December 18, 2024
Share
Facebook
Linkedin
X.com
Email
The U.S. Artificial Intelligence Safety Institute (US AISI) and the UK Artificial Intelligence Safety Institute (UK AISI) conducted a joint pre-deployment evaluation of OpenAI’s latest model, o1 (released December 5, 2024).
The following is a high-level overview of the evaluations conducted, as well as a snapshot of the findings from each domain tested. A more detailed technical report can be found here .
Overview of the Joint Safety Research & Testing Exercise
US AISI and UK AISI conducted testing during a limited period of pre-deployment access to the o1 model. Testing was conducted by expert engineers, scientists, and subject matter specialists from staff at both Institutes, and the findings were shared with OpenAI before the model was publicly released.
US AISI and UK AISI ran separate but complementary tests to assess the model’s capabilities across three domains: (1) cyber capabilities , (2) biological capabilities , (3) and software and AI development .
To assess the model’s relative capabilities and evaluate the potential real-world impacts of o1 across these areas, US AISI and UK AISI compared its performance to a series of similar reference models: OpenAI’s o1-preview, OpenAI’s GPT-4o, and both the upgraded and earlier version of Anthropic’s Claude 3.5 Sonnet.
These comparisons are intended only to assess the relative capability improvements of o1, in order to improve scientific interpretation of evaluation results.
The version of o1 that was tested exhibited a number of performance issues related to tool-calling and output formatting. US AISI and UK AISI took steps to address these issues by adapting their agent designs, including adjusting prompts and introducing simple mechanisms to help the agent recover from errors. The results below reflect o1’s performance with this scaffolding in place. A version of o1 that was better optimized for tool use might exhibit better performance on many evaluations. This report m
... (truncated, 10 KB total)Resource ID:
be88ea80f559e453 | Stable ID: sid_QQnAi98DNH