Longterm Wiki

OSWorld

Agentic

A benchmark for multimodal agents on real-world computer tasks across operating systems, testing GUI interaction and task completion.

Models Tested
2
Best Score
72.7%
Median Score
67.05%
Scoring: percentage
Introduced: 2024-04

Leaderboard2 models

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic72.7%
🥈Claude Sonnet 4.5Anthropic61.4%