Skip to content
Longterm Wiki
Search
Entities
Research
Policy
Sources
FactBase
About
Internal
Search
⌘K
Benchmarks
/
Arena-Hard-Auto
Arena-Hard-Auto
General
Wiki page
Data
An automated benchmark that uses GPT-4 as a judge for pairwise model comparisons, designed to approximate Chatbot Arena rankings without human evaluation.
Models Tested
0
Scoring:
percentage
Introduced:
2024-04
Maintainer:
LMArena
No model scores recorded for this benchmark yet.