The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time.

web

jalammar.github.io·jalammar.github.io/illustrated-bert

A widely-cited visual tutorial on BERT and NLP transfer learning; useful background for understanding large language model capabilities relevant to AI safety discussions about emergent behaviors and evaluation.

Metadata

Importance: 45/100blog posteducational

Summary

A visual explainer by Jay Alammar covering BERT, ELMo, ULMFiT, and related NLP transfer learning models. It walks through how BERT works architecturally, how it is pre-trained and fine-tuned, and why 2018 represented an inflection point for NLP. Uses diagrams to make complex transformer-based concepts accessible.

Key Points

•BERT uses two-step development: unsupervised pre-training on massive datasets, then task-specific fine-tuning with minimal labeled data.
•Transfer learning in NLP mirrors the 'ImageNet moment' in computer vision, enabling reuse of powerful pre-trained language representations.
•BERT builds on ELMo, ULMFiT, OpenAI GPT, and the original Transformer architecture, synthesizing prior NLP advances.
•Fine-tuning BERT for downstream tasks (classification, QA, NER) requires relatively small labeled datasets and compute.
•Visual diagrams make the attention mechanism, tokenization, and model architecture understandable to non-specialists.

Cited by 1 page

Page	Type	Quality
Deep Learning Revolution Era	Historical	44.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202619 KB

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time. 

 
 
 
 

 
 
 
 
 

 
 
 
 

 
 
 

 
 
 
 

 
 
 

 
 

 
 
 

 
 

 
 

 
 

 -->
 

 
 

 
 
 
 
 
 

 
 
 The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

 
 Discussions:
 Hacker News (98 points, 19 comments) , Reddit r/MachineLearning (164 points, 20 comments) 
 

 Translations: Chinese (Simplified) , French 1 , French 2 , Japanese , Korean , Persian , Russian , Spanish 

 2021 Update: I created this brief and highly accessible video intro to BERT

 
 
 

 The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines (It’s been referred to as NLP’s ImageNet moment , referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks) .

 
 

 

 

 (ULM-FiT has nothing to do with Cookie Monster. But I couldn’t think of anything else..) 

 One of the latest milestones in this development is the release of BERT , an event described as marking the beginning of a new era in NLP. BERT is a model that broke several records for how well models can handle language-based tasks. Soon after the release of the paper describing the model, the team also open-sourced the code of the model, and made available for download versions of the model that were already pre-trained on massive datasets. This is a momentous development since it enables anyone building a machine learning model involving language processing to use this powerhouse as a readily-available component – saving the time, energy, knowledge, and resources that would have gone to training a language-processing model from scratch.

 
 
 

 The two steps of how BERT is developed. You can download the model pre-trained in step 1 (trained on un-annotated data), and only worry about fine-tuning it for step 2. [ Source for book icon].
 

 BERT builds on top of a number of clever ideas that have been bubbling up in the NLP community recently – including but not limited to Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le ) , ELMo (by Matthew Peters and researchers from AI2 and UW CSE ) , ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder ) , the OpenAI transformer (by OpenAI researchers Radford , Narasimhan , Salimans , and Sutskever ) , and the Transformer ( Vaswani et al ) .

 There are a number of concepts one needs to be aware of to properly wrap one’s head around what BERT is. So let’s start by looking at ways you can use BERT before looking at the concepts involved

... (truncated, 19 KB total)

Resource ID: 6101ece4184a7530 | Stable ID: sid_EAPXgi5O6D