Terminal-Bench Hard
AgenticA benchmark evaluating AI agents on complex terminal-based tasks requiring multi-step reasoning and system administration skills.
Models Tested
1
Best Score
44%
Median Score
44%
Scoring: percentage
Introduced: 2025-01
Leaderboard1 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude Opus 4.5 | Anthropic | 44% |