Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Data Status

Full text fetchedFetched Dec 28, 2025

Summary

OpenAI collaborated with software developers to improve the SWE-bench benchmark by identifying and filtering out problematic test samples. The resulting SWE-bench Verified provides a more reliable evaluation of AI models' software engineering skills.

Key Points

  • Human-validated benchmark that addresses limitations in original SWE-bench dataset
  • 68.3% of original samples filtered due to evaluation inconsistencies
  • Performance improvements show previous benchmarks underestimated AI capabilities

Review

OpenAI's SWE-bench Verified represents a significant advancement in AI model evaluation for software engineering tasks. By systematically screening 1,699 samples with 93 professional software developers, they identified critical issues in the original benchmark that could systematically underestimate AI models' capabilities. The key problems included underspecified issue descriptions, overly specific or unrelated unit tests, and unreliable development environment setups. The research methodology involved a rigorous human annotation process where each sample was labeled three times across multiple criteria, including problem specification clarity, test validity, and task difficulty. This approach led to filtering out 68.3% of the original samples, resulting in a more robust 500-sample dataset. Notably, the GPT-4o model's performance improved from 16% to 33.2% on this verified dataset, demonstrating that the original benchmark was indeed constraining. The work highlights the importance of continuous improvement in AI evaluation benchmarks and the need for careful, nuanced assessment of AI capabilities.

Cited by 4 pages

PageTypeQuality
Agentic AICapability68.0
Long-Horizon Autonomous TasksCapability65.0
Tool Use and Computer UseCapability67.0
Minimal ScaffoldingCapability52.0
Resource ID: e1f512a932def9e2 | Stable ID: Nzg3MThkY2