Deepfakes research by the University of Washington

web

grail.cs.washington.edu·grail.cs.washington.edu/projects/AudioToObama/

This early deepfake research paper is foundational for understanding AI-generated synthetic media risks; it predates widespread deepfake awareness and illustrates how academic capability research can have significant dual-use implications for information integrity and democratic processes.

Metadata

Importance: 72/100tool pageprimary source

Summary

This SIGGRAPH 2017 paper from the University of Washington demonstrates a technique for synthesizing photorealistic video of a person speaking by mapping audio features to mouth shapes using a recurrent neural network, trained on hours of Obama's weekly address footage. The system produces high-quality lip-synced video composited with accurate 3D pose matching, representing an early landmark in what became known as deepfake technology.

Key Points

•Uses an RNN to learn mappings from raw audio features to mouth shapes, enabling realistic lip-sync video synthesis of a specific individual.
•Trained on many hours of publicly available video footage, demonstrating that large amounts of single-subject data can enable convincing face manipulation.
•Produces photorealistic results by synthesizing mouth texture and compositing with 3D pose matching into target video clips.
•Represents a foundational academic contribution to deepfake/synthetic media technology with significant dual-use implications for disinformation.
•Compared favorably to contemporaneous face-reenactment methods like Face2Face (Thies et al. 2016), establishing a new quality benchmark.

Cited by 1 page

Page	Type	Quality
AI Disinformation	Risk	54.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20262 KB

Synthesizing Obama: Learning Lip Sync from Audio 
 
 
 
 

 
 
 Synthesizing Obama: Learning Lip Sync from Audio

 SIGGRAPH 2017 
 Supasorn Suwajanakorn , Steven M. Seitz , Ira Kemelmacher-Shlizerman 

 

 
 
Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

 Supplementary Video

 

 Publication

 SIGGRAPH 2017 Paper 

 Training Videos

 
 A list of youtube videos used for training our recurrent neural network:

 obama_addresses.txt 
 
 Video A - Teaser

 Input Audio : nIxM8rL5GVE (0:10 - 1:16)

 Target Video : 3vPdtajOJfw 

 Video B - Comparison to face2face [Thies et al. 2016]

 Input Audio : nIxM8rL5GVE (0:10 - 0:25)

 Target Video : k4OZOTaf3lk 

 Video C - Method Pipeline

 Input Audio : deF-f0OqvQ4 (1:37 - 2:14)

 Target Video : 25GOnaY8ZCY 

 Video D - Target Video Retiming

 Input Audio : nIxM8rL5GVE (2:02 - 2:23)

 Target Video : 25GOnaY8ZCY 

 Video E - Weekly Address Speech (4-Obama)

 Input Audio 1 : nIxM8rL5GVE (3:53 - 4:20)

 Input Audio 2 : WtOhZ--YeFY (0:58 - 1:31)

 Target Videos: 

 Top-Left : k4OZOTaf3lk 

 Top-Right : E3gfMumXCjI 

 Bottom-Left : 3vPdtajOJfw 

 Bottom-Right : 25GOnaY8ZCY 

 Video F - Non-address Speech

 Input Audio 1 : Steve Harvey's (0:45 - 1:14)

 Target Video 1 : k4OZOTaf3lk 

 Input Audio 2 : 60 Minutes Interview (1:12 - 1:31)

 Target Video 2 : 3vPdtajOJfw 

 Input Audio 3 : The View (15:48 - 16:08)

 Target Video 3 : k4OZOTaf3lk 
 

 Input Audio 4 : Obama in 1990 (0:00 - 0:21)

 Target Video 4 : 3vPdtajOJfw 

 Input Audio 5 : Impressionist, Ryan Goldsher (1:18 - 1:36)

 Target Video 5 : EAZIHIiuhrc 

 Video G - Speech Summarization

 Input Audio : deF-f0OqvQ4 

 Target Video : 25GOnaY8ZCY 
 
 
 Last modified:

Resource ID: 62e052ee54819423 | Stable ID: sid_zFVXCDpJVn