Emergent capability detection
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Introduces DataComp, a large-scale benchmark for dataset design with 12.8B image-text pairs, addressing how dataset curation impacts model capabilities and safety—relevant for understanding emergent abilities and data's role in AI system behavior.
Paper Details
Metadata
Abstract
Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.
Summary
DataComp is a new benchmark testbed for dataset design and curation in multimodal machine learning, addressing the lack of research attention on datasets compared to model architectures. The benchmark provides a 12.8 billion image-text pair candidate pool from Common Crawl and enables researchers to design filtering techniques or curate data sources, then evaluate results using standardized CLIP training across 38 downstream tasks. Spanning four orders of magnitude in compute scales, DataComp makes dataset research accessible to researchers with varying resources. The authors demonstrate that their best baseline (DataComp-1B) achieves 79.2% zero-shot ImageNet accuracy with CLIP ViT-L/14, outperforming OpenAI's CLIP by 3.7 percentage points using identical training procedures.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Evaluation | Approach | 72.0 |
Cached Content Preview
[2304.14108] DataComp: In search of the next generation of multimodal datasets
0 0 footnotetext: Equal contribution, randomly ordered. Correspondence to contact@datacomp.ai .
1 University of Washington
2 Columbia University
3 Tel Aviv University
4 Apple
5 UT Austin
6 LAION
7 AI2
8 Juelich Supercomputing Center, Research Center Juelich
9 University of Illinois Urbana-Champaign
10 Graz University of Technology
11 Hebrew University
12 Google Research
13 Snorkel AI
DataComp :
In search of the next generation of multimodal datasets
Samir Yitzhak Gadre* 2 , Gabriel Ilharco* 1 , Alex Fang* 1 , Jonathan Hayase 1 ,
Georgios Smyrnis 5 , Thao Nguyen 1 , Ryan Marten 7,9 , Mitchell Wortsman 1 ,
Dhruba Ghosh 1 , Jieyu Zhang 1 , Eyal Orgad 3 , Rahim Entezari 10 , Giannis Daras 5 ,
Sarah Pratt 1 , Vivek Ramanujan 1 , Yonatan Bitton 11 , Kalyani Marathe 1 ,
Stephen Mussmann 1 , Richard Vencu 6 , Mehdi Cherti 6,8 , Ranjay Krishna 1 ,
Pang Wei Koh 1,12 , Olga Saukh 10 , Alexander Ratner 1,13 , Shuran Song 2 ,
Hannaneh Hajishirzi 1,7 , Ali Farhadi 1 , Romain Beaumont 6 ,
Sewoong Oh 1 , Alex Dimakis 5 , Jenia Jitsev 6,8 ,
Yair Carmon 3 , Vaishaal Shankar 4 , Ludwig Schmidt 1,6,7
Abstract
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms.
To address this shortcoming in the machine learning ecosystem, we introduce DataComp , a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets.
Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources.
Our baseline experiments show that the DataComp workflow leads to better training sets.
Our best baseline, DataComp -1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute.
We release DataComp and all accompanying code at www.datacomp.ai .
1 Introduction
Recent advances in multimodal learning such as CLIP [ 111 ] , DALL-E [ 115 , 116 ] , Stable Diffusion [ 123 ] , Flamingo [ 8 ] , and GPT-4 [ 103 ] offer unprecedented generalization capabilities in zero-shot classification, image generation, and in-context learning.
While these advances use different algorithmic techniques, e.g., contrastive learning, diffusion, or auto-regressive modeling, they all rest on a common foundation: large d
... (truncated, 98 KB total)aa5d540c12c0114d | Stable ID: sid_SBci8zKaav