Low-Rank Adaptation (LoRA)

paper

2021·arXiv·arxiv.org/abs/2106.09685

Authors

Edward J. Hu·Yelong Shen·Phillip Wallis·Zeyuan Allen-Zhu·Yuanzhi Li·Shean Wang·Lu Wang·Weizhu Chen

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

LoRA is a foundational technique for efficient fine-tuning of large language models by adapting only low-rank decompositions, relevant to AI safety for reducing computational barriers to model alignment and enabling safer, more accessible model customization.

Paper Details

Citations

2549 influential

Year

2025

arXiv:2106.09685 DOI:10.36227/techrxiv.176521385.54906010/v1 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

Summary

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that freezes pre-trained model weights and injects trainable low-rank decomposition matrices into Transformer layers, dramatically reducing the number of trainable parameters needed for task adaptation. The approach reduces trainable parameters by 10,000x and GPU memory by 3x compared to full fine-tuning of GPT-3 175B, while maintaining or exceeding model quality across multiple benchmarks (RoBERTa, DeBERTa, GPT-2, GPT-3). LoRA achieves these efficiency gains without introducing additional inference latency, making it practical for deploying adapted versions of large language models.

Cited by 1 page

Page	Type	Quality
AI Proliferation	Risk	60.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2106.09685] LoRA: Low-Rank Adaptation of Large Language Models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 LoRA: Low-Rank Adaptation of Large Language Models

 
 
 Edward Hu
  Yelong Shen ∗ 
  Phillip Wallis
   Zeyuan Allen-Zhu 
 Yuanzhi Li 
   Shean Wang 
   Lu Wang 
   Weizhu Chen
 Microsoft Corporation 
 {edwardhu, yeshe, phwallis, zeyuana, 
 yuanzhil, swang, luw, wzchen}@microsoft.com 
 yuanzhil@andrew.cmu.edu 
 (Version 2)
 Equal contribution. 
 

 
 Abstract

 An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains.
As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible.
Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive.
We propose Lo w- R ank A daptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.
LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency .
We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA.
We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at  https://github.com/microsoft/LoRA .

 
 0 0 footnotetext: Compared to V1, this draft includes better baselines, experiments on GLUE, and more on adapter latency. 
 
 
 1 Introduction

 
 Figure 1: Our reparametrization. We only train A 𝐴 A and B 𝐵 B . 
 
 
 Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications.
Such adaptation is usually done via fine-tuning , which updates all the parameters of the pre-trained model.
The major downside of fine-tuning is that the new model contains as many parameters as in the original model.
As larger models are trained every few months, this changes from a mere “inconvenience” for GPT-2  (Radford et al., b ) or RoBERTa large  (Liu et al., 2019 ) to a critical deployment challenge for GPT-3  (Brown et al., 2020 ) with 175 billion trainable parameters. 1 1 1 While GPT-3 175B achieves non-trivial performance with few-shot learning, fine-tuning boosts its performance significantly as shown in  Appendix A . 

 
 
 Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks.
T

... (truncated, 98 KB total)

Resource ID: cae140a2c5e76d68 | Stable ID: sid_LsNmvaz3I8