Terminal-Bench Hard

Agentic

A benchmark evaluating AI agents on complex terminal-based tasks requiring multi-step reasoning and system administration skills.

Models Tested

Scoring: percentage

Introduced: 2025-01

No model scores recorded for this benchmark yet.