Longterm Wiki

SWE-bench Verified

Coding

A curated subset of SWE-bench with human-verified task instances for evaluating AI systems on real-world software engineering tasks from GitHub issues.

Models Tested
0
Scoring: percentage
Introduced: 2024-08
Maintainer: OpenAI / Princeton NLP
No model scores recorded for this benchmark yet.