Back
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
webos-world.github.io·os-world.github.io/
OSWorld is a prominent benchmark for autonomous computer-use agents; relevant to AI safety discussions around agentic systems, capability evaluation, and the gap between current AI performance and human-level task completion in real-world environments.
Metadata
Importance: 72/100tool pagedataset
Summary
OSWorld is a benchmark for evaluating multimodal AI agents performing open-ended tasks in real computer environments across multiple operating systems. It tests agents' ability to use GUIs, execute code, and interact with real applications like browsers, file systems, and productivity tools. The benchmark reveals that current state-of-the-art models achieve very low success rates compared to humans, highlighting a significant capability gap.
Key Points
- •Provides a scalable, real-computer environment supporting Windows, macOS, and Linux for evaluating multimodal agents on complex tasks
- •Includes 369 real-world computer tasks spanning web browsing, file management, coding, and multi-app workflows
- •Current best models achieve only ~12% success rate versus ~72% for humans, revealing large performance gaps
- •Evaluates agents using both GUI-based (screenshot) and accessibility-tree observation modes
- •Serves as a key benchmark for measuring progress toward autonomous computer-use AI agents
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Tool Use and Computer Use | Capability | 67.0 |
| Eval Saturation & The Evals Gap | Approach | 65.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 202617 KB
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie 1 ,
Danyang Zhang 1 ,
Jixuan Chen 1 ,
Xiaochuan Li 1 ,
Siheng Zhao 1 ,
Ruisheng
Cao 1 ,
Toh Jing Hua 1 ,
Zhoujun Cheng 1 ,
Dongchan
Shin 1 ,
Fangyu Lei 1 ,
Yitao Liu 1 ,
Yiheng Xu 1 ,
Shuyan Zhou 3 ,
Silvio Savarese 2 ,
Caiming Xiong 2 ,
Victor Zhong 4 ,
Tao Yu 1
1 The University of Hong Kong,
2 Salesforce Research,
3 Carnegie Mellon University,
4 University of Waterloo
-->
2025-07-28: Major Upgrade! OSWorld has been enhanced and is now OSWorld-Verified with comprehensive improvements : fixed community-reported examples , AWS support reducing evaluation time to within 1 hour , and updated benchmark results . See the verified benchmark results in the Benchmark section below. Please compare your OSWorld results with the new benchmark results when running the latest version.
-->
-->
-->
-->
-->
-->
Paper -->
-->
-->
Paper
-->
-->
-->
-->
-->
Video -->
-->
-->
Code
Doc
Data
Data Viewer
Slides
Twitter
Discord
**OSWorld** is a first-of-its-kind scalable, real computer environment for multimodal agents,
supporting task setup, execution-based evaluation, and interactive learning across operating systems.
It can serve as a unified environment for evaluating open-ended computer tasks that involve arbitrary
apps (e.g., task examples in the above Fig). We also create a benchmark of 369 real-world computer
tasks in **OSWorld** with reliable, reproducible setup and evaluation scripts. *Note: 8 Google Drive tasks may require manual configuration or can be excluded (361 tasks) due to network dependencies.*
Abstract
Autonomous agents that accomplish complex computer tasks with minimal human
interventions have the potential to transform human-computer interaction, significantly enhancing
accessibility and productivity. However, existing benchmarks
either lack an interactive environment or are limited to environments specific to
certain applications or domains, failing to reflect the diverse and complex nature of real-world computer
use, thereby limiting the scope of tasks and agent
scalability. To address this issue, we introduce **OSWorld**, the first-of-its-kind
scalable, real computer environment for multimodal agents, supporting task setup,
execution-based evaluation, and interactive learning across various operating systems such as Ubuntu,
Windows, and macOS. **OSWorld** can serve as a unified,
integrated computer environment for as
... (truncated, 17 KB total)Resource ID:
c819ef71cbf34802 | Stable ID: sid_45L3hYpyno