Skip to content
Longterm Wiki
Back

AgentBench

paper

Authors

Xiao Liu·Hao Yu·Hanchen Zhang·Yifan Xu·Xuanyu Lei·Hanyu Lai·Yu Gu·Hangliang Ding·Kaiwen Men·Kejuan Yang·Shudan Zhang·Xiang Deng·Aohan Zeng·Zhengxiao Du·Chenhui Zhang·Sheng Shen·Tianjun Zhang·Yu Su·Huan Sun·Minlie Huang·Yuxiao Dong·Jie Tang

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

AgentBench is a multi-dimensional benchmark for evaluating large language models as autonomous agents across 8 interactive environments, directly relevant to AI safety research on agent reliability, reasoning robustness, and decision-making capabilities in complex tasks.

Paper Details

Citations
652
43 influential
Year
2023
Methodology
peer-reviewed
Categories
Proceedings of the AAAI Conference on Artificial I

Metadata

arxiv preprintprimary source

Abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

Summary

AgentBench is a comprehensive multi-dimensional benchmark designed to evaluate Large Language Models (LLMs) as autonomous agents across 8 distinct interactive environments. The study evaluates both API-based and open-source LLMs, revealing significant performance gaps between top commercial models and open-source alternatives up to 70B parameters. The research identifies key failure modes—poor long-term reasoning, weak decision-making, and inadequate instruction following—and proposes that improvements in instruction following and high-quality multi-round alignment training could enhance agent performance. Notably, the findings challenge conventional assumptions about code training's universal benefits for agent tasks.

Cited by 2 pages

PageTypeQuality
Long-Horizon Autonomous TasksCapability65.0
Minimal ScaffoldingCapability52.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB
[2308.03688] AgentBench: Evaluating LLMs as Agents 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \useunder 
 \ul 

 

 \doparttoc \faketableofcontents 

 
 
 AgentBench : Evaluating LLMs as Agents

 
 
 Xiao Liu
 
 Tsinghua University
 
 
 
 
 Hao Yu
 
 Tsinghua University
 
 
 
 
 Hanchen Zhang
 
 Tsinghua University
 
 
 Yifan Xu
 
 Tsinghua University
 
 
 Xuanyu Lei
 
 Tsinghua University
 
 
 Hanyu Lai
 
 Tsinghua University
 
 
 Yu Gu
 
 The Ohio State University
 
 
 Hangliang Ding
 
 Tsinghua University
 
 
 Kaiwen Men
 
 Tsinghua University
 
 
 Kejuan Yang
 
 Tsinghua University
 
 
 Shudan Zhang
 
 Tsinghua University
 
 
 Xiang Deng
 
 The Ohio State University
 
 
 Aohan Zeng
 
 Tsinghua University
 
 
 Zhengxiao Du
 
 Tsinghua University
 
 
 Chenhui Zhang
 
 Tsinghua University
 
 
 Sheng Shen
 
 UC Berkeley
 
 
 Tianjun Zhang
 
 UC Berkeley
 
 
 Yu Su
 
 The Ohio State University
 
 
 Huan Sun
 
 The Ohio State University
 
 
 Minlie Huang
 
 Tsinghua University
 
 
 Yuxiao Dong
 
 Tsinghua University
 
 
 Jie Tang
 
 Tsinghua University
 
 

 
 Abstract

 Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments.
We present AgentBench , a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities in a multi-turn open-ended generation setting.
Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors.
We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents.
Training on code and high quality multi-turn alignment data could improve agent performance.
Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench .

 
 1 1 footnotetext: XL and HY are lead authors that contributed equally. Email: {shawliu9,longinyh}@gmail.com 2 2 footnotetext: Work partially done when HY, YG visited Tsinghua University. 3 3 footnotetext: Website for AgentBench leaderboard & demos: https://llmbench.ai/agent 
 
 Figure 1: An overview of LLMs on AgentBench . While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance toward practical usability are significant. 
 
 
 
 1 Introduction

 
 Intelligent agents and autonomous entities  (Searle, 1970 ; Maes, 1994 ; Wooldridge & Jennings, 1995 ) that are capable of decision-making and action execution in particular environments have been key concepts of artificia

... (truncated, 98 KB total)
Resource ID: d234ade2718a748e | Stable ID: N2M4MzIyMj