Back
Deepfakes research by the University of Washington
webgrail.cs.washington.edu·grail.cs.washington.edu/projects/AudioToObama/
This early deepfake research paper is foundational for understanding AI-generated synthetic media risks; it predates widespread deepfake awareness and illustrates how academic capability research can have significant dual-use implications for information integrity and democratic processes.
Metadata
Importance: 72/100tool pageprimary source
Summary
This SIGGRAPH 2017 paper from the University of Washington demonstrates a technique for synthesizing photorealistic video of a person speaking by mapping audio features to mouth shapes using a recurrent neural network, trained on hours of Obama's weekly address footage. The system produces high-quality lip-synced video composited with accurate 3D pose matching, representing an early landmark in what became known as deepfake technology.
Key Points
- •Uses an RNN to learn mappings from raw audio features to mouth shapes, enabling realistic lip-sync video synthesis of a specific individual.
- •Trained on many hours of publicly available video footage, demonstrating that large amounts of single-subject data can enable convincing face manipulation.
- •Produces photorealistic results by synthesizing mouth texture and compositing with 3D pose matching into target video clips.
- •Represents a foundational academic contribution to deepfake/synthetic media technology with significant dual-use implications for disinformation.
- •Compared favorably to contemporaneous face-reenactment methods like Face2Face (Thies et al. 2016), establishing a new quality benchmark.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Disinformation | Risk | 54.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 20262 KB
Synthesizing Obama: Learning Lip Sync from Audio Synthesizing Obama: Learning Lip Sync from Audio SIGGRAPH 2017 Supasorn Suwajanakorn , Steven M. Seitz , Ira Kemelmacher-Shlizerman Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results. Supplementary Video Publication SIGGRAPH 2017 Paper Training Videos A list of youtube videos used for training our recurrent neural network: obama_addresses.txt Video A - Teaser Input Audio : nIxM8rL5GVE (0:10 - 1:16) Target Video : 3vPdtajOJfw Video B - Comparison to face2face [Thies et al. 2016] Input Audio : nIxM8rL5GVE (0:10 - 0:25) Target Video : k4OZOTaf3lk Video C - Method Pipeline Input Audio : deF-f0OqvQ4 (1:37 - 2:14) Target Video : 25GOnaY8ZCY Video D - Target Video Retiming Input Audio : nIxM8rL5GVE (2:02 - 2:23) Target Video : 25GOnaY8ZCY Video E - Weekly Address Speech (4-Obama) Input Audio 1 : nIxM8rL5GVE (3:53 - 4:20) Input Audio 2 : WtOhZ--YeFY (0:58 - 1:31) Target Videos: Top-Left : k4OZOTaf3lk Top-Right : E3gfMumXCjI Bottom-Left : 3vPdtajOJfw Bottom-Right : 25GOnaY8ZCY Video F - Non-address Speech Input Audio 1 : Steve Harvey's (0:45 - 1:14) Target Video 1 : k4OZOTaf3lk Input Audio 2 : 60 Minutes Interview (1:12 - 1:31) Target Video 2 : 3vPdtajOJfw Input Audio 3 : The View (15:48 - 16:08) Target Video 3 : k4OZOTaf3lk Input Audio 4 : Obama in 1990 (0:00 - 0:21) Target Video 4 : 3vPdtajOJfw Input Audio 5 : Impressionist, Ryan Goldsher (1:18 - 1:36) Target Video 5 : EAZIHIiuhrc Video G - Speech Summarization Input Audio : deF-f0OqvQ4 Target Video : 25GOnaY8ZCY Last modified:
Resource ID:
62e052ee54819423 | Stable ID: sid_zFVXCDpJVn