OSWorld
AgenticA benchmark for multimodal agents on real-world computer tasks across operating systems, testing GUI interaction and task completion.
Models Tested
2
Best Score
72.7%
Median Score
67.05%
Scoring: percentage
Introduced: 2024-04
Leaderboard2 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude Opus 4.6 | Anthropic | 72.7% |
| 🥈 | Claude Sonnet 4.5 | Anthropic | 61.4% |