Skip to content
Longterm Wiki

Arena-Hard-Auto

General
An automated benchmark that uses GPT-4 as a judge for pairwise model comparisons, designed to approximate Chatbot Arena rankings without human evaluation.
Models Tested
0
Scoring: percentage
Introduced: 2024-04
Maintainer: LMArena
No model scores recorded for this benchmark yet.