[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
paperAuthor
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
BERT is a foundational large language model architecture that established bidirectional transformer pre-training, relevant to AI safety for understanding capabilities, alignment challenges, and behavioral properties of modern LLMs.
Paper Details
Metadata
Summary
BERT (Bidirectional Encoder Representations from Transformers) introduces a novel pre-training approach for language models that conditions on both left and right context across all layers, enabling deep bidirectional representations from unlabeled text. Unlike previous language models, BERT can be fine-tuned with minimal task-specific modifications to achieve state-of-the-art results across diverse NLP tasks. The model demonstrates significant empirical improvements on eleven benchmark tasks, including GLUE (80.5%), MultiNLI (86.7%), and SQuAD question answering (93.2% and 83.1% on v1.1 and v2.0 respectively).
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Deep Learning Revolution Era | Historical | 44.0 |
Cached Content Preview
BERT : Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout}@google.com
Abstract
We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models Peters et al. ( 2018a ); Radford et al. ( 2018 ) , BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
1 Introduction
Language model pre-training has been shown to be effective for improving many natural language processing tasks Dai and Le ( 2015 ); Peters et al. ( 2018a ); Radford et al. ( 2018 ); Howard and Ruder ( 2018 ) . These include sentence-level tasks such as natural language inference Bowman et al. ( 2015 ); Williams et al. ( 2018 ) and paraphrasing Dolan and Brockett ( 2005 ) , which aim to predict the relationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level Tjong Kim Sang and De Meulder ( 2003 ); Rajpurkar et al. ( 2016 ) .
There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning . The feature-based approach, such as ELMo Peters et al. ( 2018a ) , uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) Radford et al. ( 2018 ) , introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.
We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and thi
... (truncated, 69 KB total)32f3c7d144f036a0 | Stable ID: sid_VcujSQcex2