Arena-Hard-Auto

General

An automated benchmark that uses GPT-4 as a judge for pairwise model comparisons, designed to approximate Chatbot Arena rankings without human evaluation.

Models Tested

Scoring: percentage

Introduced: 2024-04

Maintainer: LMArena

No model scores recorded for this benchmark yet.