Skip to content
Longterm Wiki
Back

Vectara, "Hallucination Leaderboard" (https://github.com/vectara/hallucination-leaderboard)

web

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: GitHub

Relevant to AI safety discussions around reliability and trustworthiness of LLMs; hallucination is a key failure mode affecting safe deployment in high-stakes contexts.

Metadata

Importance: 62/100tool pagetool

Summary

A public leaderboard that benchmarks large language models on their tendency to hallucinate or introduce factual inconsistencies when summarizing documents. It provides a standardized evaluation framework comparing models on 'groundedness' — how faithfully they summarize source material without fabricating information. The leaderboard is regularly updated as new models are released.

Key Points

  • Evaluates LLMs on hallucination rate using a summarization task, measuring how often models introduce facts not present in the source document.
  • Uses a dataset of news articles to test whether model outputs are faithful to source content, providing a consistent cross-model benchmark.
  • Ranks major commercial and open-source models (e.g., GPT-4, Claude, Llama) by hallucination frequency, enabling direct comparison.
  • Highlights that even top-tier models hallucinate at non-trivial rates, underscoring reliability concerns for real-world deployment.
  • Serves as a practical tool for practitioners selecting models where factual accuracy and trustworthiness are critical requirements.

Cited by 1 page

PageTypeQuality
Large Language ModelsCapability60.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202624 KB
GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents · GitHub 

 
 
 
 

 
 

 

 

 
 

 
 

 

 

 

 

 

 

 

 

 

 

 
 
 

 
 
 

 

 

 
 
 
 

 

 

 

 

 

 

 
 

 

 

 
 

 
 
 

 
 

 

 

 
 
 
 

 
 Skip to content 

 
 
 
 
 
 

 
 
 
 
 

 

 

 

 
 
 
 
 
 You signed in with another tab or window. Reload to refresh your session. 
 You signed out in another tab or window. Reload to refresh your session. 
 You switched accounts on another tab or window. Reload to refresh your session. 

 
 
 
 Dismiss alert 

 
 
 

 

 

 
 
 
 
 
 
 
 
 
 
 
 {{ message }} 

 
 
 
 
 

 

 
 
 
 
 
 

 

 

 

 

 
 
 
 
 
 
 
 
 
 vectara
 
 / 
 
 hallucination-leaderboard 
 

 Public 
 

 

 
 
 
 

 
 
 
 Notifications
 You must be signed in to change notification settings 

 

 
 
 
 Fork
 100 
 
 

 
 
 
 
 
 Star
 3.2k 
 
 

 

 
 

 
 

 

 
 

 
 
 

 
 
 

 
 
 
   main Branches Tags Go to file Code Open more actions menu Folders and files

 Name Name Last commit message Last commit date Latest commit

   History

 442 Commits 442 Commits img img     CITATION.cff CITATION.cff     LICENSE LICENSE     README.md README.md     View all files Repository files navigation

 Hallucination Leaderboard

 
 Public LLM leaderboard computed using Vectara's Hallucination Evaluation Model, also known as HHEM. This evaluates how often an LLM introduces hallucinations when summarizing a document. We plan to update this regularly as our model and the LLMs get updated over time.

 Feel free to check out the interactive hallucination leaderboard on Hugging Face.

 If you are interested in previous versions os this leaderboard:

 
 First version based on HHEM-1.0, it is available here 

 Most recent version, based on the previous dataset is available here 

 
 
 
 
 
 
 
 In loving memory of Simon Mark Hughes ...
 
 
 

 Last updated on March 20, 2026

 

 
 
 
 Model 
 Hallucination Rate 
 Factual Consistency Rate 
 Answer Rate 
 Average Summary Length (Words) 
 
 
 
 
 antgroup/finix_s1_32b 
 1.8 % 
 98.2 % 
 99.5 % 
 172.4 
 
 
 openai/gpt-5.4-nano-2026-03-17 
 3.1 % 
 96.9 % 
 100.0 % 
 144.4 
 
 
 google/gemini-2.5-flash-lite 
 3.3 % 
 96.7 % 
 99.5 % 
 95.7 
 
 
 microsoft/Phi-4 
 3.7 % 
 96.3 % 
 80.7 % 
 120.9 
 
 
 meta-llama/Llama-3.3-70B-Instruct-Turbo 
 4.1 % 
 95.9 % 
 99.5 % 
 64.6 
 
 
 snowflake/snowflake-arctic-instruct 
 4.3 % 
 95.7 % 
 62.7 % 
 81.4 
 
 
 google/gemma-3-12b-it 
 4.4 % 
 95.6 % 
 97.4 % 
 89.7 
 
 
 mistralai/mistral-large-2411 
 4.5 % 
 95.5 % 
 99.9 % 
 85.0 
 
 
 qwen/qwen3-8b 
 4.8 % 
 95.2 % 
 99.9 % 
 83.6 
 
 
 amazon/nova-pro-v1:0 
 5.1 % 
 94.9 % 
 99.3 % 
 66.2 
 
 
 amazon/nova-2-lite-v1:0 
 5.1 % 
 94.9 % 
 99.6 % 
 94.1 
 
 
 mistralai/mistral-small-2501 
 5.1 % 
 94.9 % 
 97.9 % 
 98.8 
 
 
 ibm-granite/granite-4.0-h-small 
 5.2 % 
 94.8 % 
 100.0 % 
 107.4 
 
 
 ai21labs/jamba-mini-2 
 5.3 % 
 94.7 % 
 99.6 % 
 109

... (truncated, 24 KB total)
Resource ID: b44f883dc65dd0e9 | Stable ID: OWJmYTEwNj