Back
OpenAI Deep Research February 2025 - FutureSearch
webfuturesearch.ai·futuresearch.ai/oaidr-feb-2025/
Relevant to AI safety discussions around deployment reliability, calibration, and the gap between perceived and actual AI research capabilities; useful as a concrete empirical evaluation of a widely-used frontier AI tool.
Metadata
Importance: 42/100blog postanalysis
Summary
FutureSearch conducted a detailed evaluation of OpenAI's Deep Research tool shortly after its February 2025 launch, documenting six specific failure cases and identifying key failure modes including overconfidence, poor source selection, and misinformation by omission. The evaluation concludes OAIDR outperforms some competing systems but falls significantly short of human researchers, exhibiting a 'jagged frontier' of inconsistent performance.
Key Points
- •OAIDR demonstrates a 'jagged frontier' of performance—inconsistently reliable in ways that make failure hard to predict before deployment
- •Overconfidence is a critical flaw: the system confidently reports wrong answers (e.g., 17.5% vs. 34.5% on Cybench) rather than admitting uncertainty
- •Poor source selection: OAIDR frequently prioritizes company blogs and SEO-spam over authoritative sources, undermining research quality
- •Misinformation by omission is especially dangerous—incomplete research that appears comprehensive may mislead users more than obvious errors
- •Recommended only for non-critical synthesis tasks; risky for topic introductions or time-sensitive factual queries
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| FutureSearch | Organization | 50.0 |
Cached Content Preview
HTTP 200Fetched Apr 7, 20266 KB
OpenAI Deep Research: Honest Analysis and Real Limitations in 2025
Feb
MAR
Apr
10
2025
2026
2027
success
fail
About this capture
COLLECTED BY
Collection: Save Page Now Outlinks
TIMESTAMPS
The Wayback Machine - http://web.archive.org/web/20260310120909/https://futuresearch.ai/oaidr-feb-2025/
futuresearch
☰
Solutions
Research
Evals
Careers
Get Researchers
← Back to Research
OpenAI Deep Research: Honest Analysis and Real Limitations in 2025
February 19, 2025 • Updated January 7, 2026
By Dan Schwarz
The first release of a Deep Research tool failed to live up to expectations
Key Takeaways
As later evidence shows (Deep Research Bench), while OpenAI's Deep Research tool was initially impressive, it actually underperformed the later release of ChatGPT-o3+search
OpenAI Deep Research (OAIDR) shows a "jagged frontier" of performance—better than some competing systems but significantly worse than intelligent humans
Overconfidence is a major issue: OAIDR often reports wrong answers confidently when it should admit uncertainty
Peculiar source selection: The system frequently chooses company blogs or SEO-spam sites over authoritative sources
Risk of misinformation by omission: Incomplete research that appears comprehensive is particularly dangerous
Mixed Reactions to OpenAI Deep Research
On February 3, OpenAI launched Deep Research, their long-form research tool, provoking a flurry of interest and intrigued reactions. Many were seriously impressed, while others warned of frequent inaccuracies.
FutureSearch conducted a detailed evaluation to understand OAIDR's true capabilities and limitations.
The Verdict: Better Than Some, Worse Than Humans
Our evaluation found that OAIDR is better than some competing systems but still significantly worse than human researchers. The system exhibits what researchers call a "jagged frontier" of performance—inconsistent quality that makes it difficult to predict when it will succeed or fail.
Key Failure Modes
Overconfidence: Reports wrong answers when it should admit uncertainty
Peculiar source selection: Prioritizes unreliable sources over authoritative ones
Difficulty reading complex webpages: Struggles with PDFs, images, and certain website formats
Misinformation by omission: Produces incomplete research that appears comprehensive
Recommended Usage
Good for: Synthesizing information where completeness isn't critical
Risky for: Topic introductions (high risk of missing key information)
Potentially useful: Niche, qualitative explorations
Six Strange Failures: Detailed Examples
We tested OAIDR on six research queries where we knew the correct answers. Here's what we found:
Failure #1: Cybench Benchmark Performance
Query: Find the highest reported agent performance on the Cybench benchmark.
OAIDR&#x
... (truncated, 6 KB total)Resource ID:
386bc4dbf25d7d34 | Stable ID: sid_649AeWcmmh