Terminal-Bench Hard
AgenticA benchmark evaluating AI agents on complex terminal-based tasks requiring multi-step reasoning and system administration skills.
Models Tested
0
Scoring: percentage
Introduced: 2025-01
No model scores recorded for this benchmark yet.