Deepfake-Eval-2024 benchmark

paper

2025·arXiv·arxiv.org/html/2503.02857v2

Authors

Nuria Alina Chandra·Ryan Murtfeldt·Lin Qiu·Arnab Karmakar·Hannah Lee·Emmanuel Tanumihardja·Kevin Farhat·Ben Caffee·Sejin Paik·Changyeon Lee·Jongwook Choi·Aerin Kim·Oren Etzioni

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety practitioners concerned with misuse of generative AI for disinformation and fraud; highlights how rapidly deepfake technology outpaces detection, with implications for trust, verification, and governance of synthetic media.

Paper Details

Citations

8 influential

Year

2025

arXiv:2503.02857 DOI:10.48550/arXiv.2503.02857 Semantic Scholar

Metadata

Importance: 62/100arxiv preprintdataset

Abstract

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

Summary

Deepfake-Eval-2024 introduces a large-scale benchmark of in-the-wild deepfakes collected from social media and detection platforms in 2024, revealing that state-of-the-art detectors suffer dramatic performance drops (45-50% AUC decrease) compared to academic benchmarks. The dataset spans 88 websites, 52 languages, and includes video, audio, and images using the latest manipulation technologies. Results show commercial and fine-tuned models improve over open-source baselines but still lag behind human forensic analysts.

Key Points

•State-of-the-art open-source deepfake detectors drop ~50% AUC on real-world data vs. academic benchmarks, exposing a critical generalization gap.
•Dataset includes 45 hours of video, 56.5 hours of audio, and 1,975 images collected from 88 websites in 52 languages in 2024.
•Commercial and fine-tuned models outperform off-the-shelf open-source solutions but still cannot match the accuracy of human forensic analysts.
•Academic benchmarks are shown to be significantly out of date and non-representative of modern deepfake generation technologies.
•The benchmark is publicly available, enabling reproducible evaluation of deepfake detection systems against current real-world threats.

Cited by 2 pages

Page	Type	Quality
Epistemic Collapse	Risk	49.0
AI-Powered Fraud	Risk	69.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202667 KB

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024 
 
 
 
 
 
 

 
 

 
 
 
 
 Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

 
 
 Nuria Alina Chandra
 
 TrueMedia.org
 
 
 Ryan Murtfeldt
 
 TrueMedia.org
 
 University of Washington, Seattle
 
 
 Lin Qiu
 
 TrueMedia.org
 
 University of Washington, Seattle
 
 
 Arnab Karmakar
 
 TrueMedia.org
 
 University of Washington, Seattle
 
 
 Hannah Lee
 
 TrueMedia.org
 
 
 Emmanuel Tanumihardja
 
 TrueMedia.org
 
 University of Washington, Seattle
 
 
 Kevin Farhat
 
 TrueMedia.org
 
 University of Washington, Seattle
 
 
 Ben Caffee
 
 TrueMedia.org
 
 University of Washington, Seattle
 
 
 Sejin Paik
 
 TrueMedia.org
 
 Georgetown University, Washington D.C.
 
 
 Changyeon Lee
 
 Miraflow AI
 
 Yonsei University, Seoul
 
 
 Jongwook Choi
 
 TrueMedia.org
 
 Chung-Ang University, Seoul
 
 
 Aerin Kim
 
 TrueMedia.org
 
 Miraflow AI
 
 
 Oren Etzioni
 
 TrueMedia.org
 
 University of Washington, Seattle
 
 
 
 Abstract

 In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024 .

 
 
 Figure 1: Examples of Deepfake-Eval-2024 video and audio (rows 1–2), and images (rows 3–4), demonstrating a diversity of content styles and generation techniques, including lipsync, faceswap, and diffusion. Images have been resized for presentation. 
 
 
 
 1 Introduction

 
 Advances in generative AI models have precipitated a surge of highly realistic deepfakes, which have been used to fabricate messages from politicians [ 1 ] , create non-consensual pornographic content [ 2 ] , spread misinformation [ 3 ] , and damage reputations [ 4 ] , harming lives, businesses, and nations [ 5 ] . Between 2023 and 2024, there was

... (truncated, 67 KB total)

Resource ID: f39c2cc4c0f303cc | Stable ID: sid_bmdkuj7uLa