Longterm Wiki

Terminal-Bench Hard

Agentic

A benchmark evaluating AI agents on complex terminal-based tasks requiring multi-step reasoning and system administration skills.

Models Tested
1
Best Score
44%
Median Score
44%
Scoring: percentage
Introduced: 2025-01

Leaderboard1 models

#ModelDeveloperScore
🥇Claude Opus 4.5Anthropic44%