Skip to content
Longterm Wiki
Back

OpenAI Deep Research February 2025 - FutureSearch

web

Relevant to AI safety discussions around deployment reliability, calibration, and the gap between perceived and actual AI research capabilities; useful as a concrete empirical evaluation of a widely-used frontier AI tool.

Metadata

Importance: 42/100blog postanalysis

Summary

FutureSearch conducted a detailed evaluation of OpenAI's Deep Research tool shortly after its February 2025 launch, documenting six specific failure cases and identifying key failure modes including overconfidence, poor source selection, and misinformation by omission. The evaluation concludes OAIDR outperforms some competing systems but falls significantly short of human researchers, exhibiting a 'jagged frontier' of inconsistent performance.

Key Points

  • OAIDR demonstrates a 'jagged frontier' of performance—inconsistently reliable in ways that make failure hard to predict before deployment
  • Overconfidence is a critical flaw: the system confidently reports wrong answers (e.g., 17.5% vs. 34.5% on Cybench) rather than admitting uncertainty
  • Poor source selection: OAIDR frequently prioritizes company blogs and SEO-spam over authoritative sources, undermining research quality
  • Misinformation by omission is especially dangerous—incomplete research that appears comprehensive may mislead users more than obvious errors
  • Recommended only for non-critical synthesis tasks; risky for topic introductions or time-sensitive factual queries

Cited by 1 page

PageTypeQuality
FutureSearchOrganization50.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20266 KB
OpenAI Deep Research: Honest Analysis and Real Limitations in 2025

 

 
 
 
 

 Feb
 MAR
 Apr
 

 
 

 
 10
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260310120909/https://futuresearch.ai/oaidr-feb-2025/

 

futuresearch
☰
Solutions

Research

Evals

Careers

Get Researchers

← Back to Research

OpenAI Deep Research: Honest Analysis and Real Limitations in 2025

February 19, 2025 • Updated January 7, 2026

By Dan Schwarz

The first release of a Deep Research tool failed to live up to expectations

Key Takeaways

As later evidence shows (Deep Research Bench), while OpenAI's Deep Research tool was initially impressive, it actually underperformed the later release of ChatGPT-o3+search

OpenAI Deep Research (OAIDR) shows a "jagged frontier" of performance—better than some competing systems but significantly worse than intelligent humans

Overconfidence is a major issue: OAIDR often reports wrong answers confidently when it should admit uncertainty

Peculiar source selection: The system frequently chooses company blogs or SEO-spam sites over authoritative sources

Risk of misinformation by omission: Incomplete research that appears comprehensive is particularly dangerous

Mixed Reactions to OpenAI Deep Research

On February 3, OpenAI launched Deep Research, their long-form research tool, provoking a flurry of interest and intrigued reactions. Many were seriously impressed, while others warned of frequent inaccuracies.

FutureSearch conducted a detailed evaluation to understand OAIDR's true capabilities and limitations.

The Verdict: Better Than Some, Worse Than Humans

Our evaluation found that OAIDR is better than some competing systems but still significantly worse than human researchers. The system exhibits what researchers call a "jagged frontier" of performance—inconsistent quality that makes it difficult to predict when it will succeed or fail.

Key Failure Modes

Overconfidence: Reports wrong answers when it should admit uncertainty

Peculiar source selection: Prioritizes unreliable sources over authoritative ones

Difficulty reading complex webpages: Struggles with PDFs, images, and certain website formats

Misinformation by omission: Produces incomplete research that appears comprehensive

Recommended Usage

Good for: Synthesizing information where completeness isn't critical

Risky for: Topic introductions (high risk of missing key information)

Potentially useful: Niche, qualitative explorations

Six Strange Failures: Detailed Examples

We tested OAIDR on six research queries where we knew the correct answers. Here's what we found:

Failure #1: Cybench Benchmark Performance

Query: Find the highest reported agent performance on the Cybench benchmark.

OAIDR&#x

... (truncated, 6 KB total)
Resource ID: 386bc4dbf25d7d34 | Stable ID: sid_649AeWcmmh