[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

paper

Author

Dongjie Zhu

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

BERT is a foundational large language model architecture that established bidirectional transformer pre-training, relevant to AI safety for understanding capabilities, alignment challenges, and behavioral properties of modern LLMs.

Paper Details

Citations

22051 influential

Year

2019

arXiv:1810.04805 DOI:10.1109/tim.2024.3374300/mm1 Semantic Scholar

Metadata

arxiv preprintprimary source

Summary

BERT (Bidirectional Encoder Representations from Transformers) introduces a novel pre-training approach for language models that conditions on both left and right context across all layers, enabling deep bidirectional representations from unlabeled text. Unlike previous language models, BERT can be fine-tuned with minimal task-specific modifications to achieve state-of-the-art results across diverse NLP tasks. The model demonstrates significant empirical improvements on eleven benchmark tasks, including GLUE (80.5%), MultiNLI (86.7%), and SQuAD question answering (93.2% and 83.1% on v1.1 and v2.0 respectively).

Cited by 1 page

Page	Type	Quality
Deep Learning Revolution Era	Historical	44.0

Cached Content Preview

HTTP 200Fetched Apr 24, 202669 KB

BERT : Pre-training of Deep Bidirectional Transformers for 
 Language Understanding

 
 
 Jacob Devlin  Ming-Wei Chang  Kenton Lee  Kristina Toutanova 
 Google AI Language 
 {jacobdevlin,mingweichang,kentonl,kristout}@google.com 
 
 
 

 
 Abstract

 We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models  Peters et al. ( 2018a ); Radford et al. ( 2018 ) , BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

 BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

 
 
 
 1 Introduction

 
 Language model pre-training has been shown to be effective for improving many natural language processing tasks  Dai and Le ( 2015 ); Peters et al. ( 2018a ); Radford et al. ( 2018 ); Howard and Ruder ( 2018 ) . These include sentence-level tasks such as natural language inference  Bowman et al. ( 2015 ); Williams et al. ( 2018 ) and paraphrasing  Dolan and Brockett ( 2005 ) , which aim to predict the relationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level  Tjong Kim Sang and De Meulder ( 2003 ); Rajpurkar et al. ( 2016 ) .

 
 
 There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning . The feature-based approach, such as ELMo  Peters et al. ( 2018a ) , uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT)  Radford et al. ( 2018 ) , introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

 
 
 We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and thi

... (truncated, 69 KB total)

Resource ID: 32f3c7d144f036a0 | Stable ID: sid_VcujSQcex2