Watermarking language models

paper

Authors

Kirchenbauer, John·Geiping, Jonas·Wen, Yuxin·Katz, Jonathan·Miers, Ian·Goldstein, Tom

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Proposes a watermarking technique for detecting machine-generated text from language models, addressing AI safety concerns around detecting synthetic content and maintaining transparency about AI-generated outputs.

Paper Details

Citations

809

195 influential

Year

2023

arXiv:2301.10226 DOI:10.48550/arXiv.2301.10226 Semantic Scholar

Metadata

arxiv preprintprimary source

Summary

Researchers propose a watermarking framework that can embed signals into language model outputs to detect machine-generated text. The watermark is computationally detectable but invisible to humans.

Key Points

•Watermark can be embedded without noticeable impact on text quality
•Detection is possible from as few as 25 tokens with high statistical confidence
•Works across different language model architectures and sampling strategies

Review

This groundbreaking paper introduces a sophisticated watermarking method for large language models that addresses critical challenges in AI-generated text detection. The core innovation is a 'soft' watermarking technique that probabilistically promotes certain tokens during text generation, creating a statistically detectable signature without significantly degrading text quality. The methodology involves selecting a randomized set of 'green' tokens and subtly biasing the language model's sampling towards these tokens. This approach is particularly powerful because it works across different sampling strategies like multinomial sampling and beam search, and can be implemented with minimal impact on text perplexity. The authors provide rigorous theoretical analysis, demonstrating how the watermark's detectability relates to the entropy of generated text, and present comprehensive empirical validation using the OPT model family.

Cited by 1 page

Page	Type	Quality
Authentication Collapse	Risk	57.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2301.10226] 1 Introduction 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \ulposdef \hlgray 
 [xoffset=1pt] 
 \ulposdef \hlgreen [xoffset=1pt] 
 \ulposdef \hlyel [xoffset=1pt] 
 \ulposdef \hlred [xoffset=1pt] 

marginparsep has been altered.
 topmargin has been altered.
 marginparpush has been altered.
 The page layout violates the ICML style. 
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.

 
 
   

 
 
 A Watermark for Large Language Models 

 
 
   

 
 
 John Kirchenbauer   *  
 Jonas Geiping   *  
 Yuxin Wen   
 Jonathan Katz   
 Ian Miers   
 Tom Goldstein 

 
 
 University of Maryland 

 
 
 
 
 † † footnotetext: * Equal contribution . Code and demo are available at github.com/jwkirchenbauer/lm-watermarking .
Correspondence to: John Kirchenbauer <jkirchen@umd.edu>.  
 
 
 Abstract

 Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens.
We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of “green” tokens before a word is generated, and then softly promoting use of green tokens during sampling.
We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.

 
 
 \BgThispage \backgroundsetup 
 contents=                                                                                

 
 
 
 1 Introduction

 
 Large language models (LLMs), such as the recently developed ChatGPT, can write documents, create executable code, and answer questions, often with human-like capabilities (Schulman et al., 2022 ) . As these systems become more pervasive, there is increasing risk that they may be used for malicious purposes (Bergman et al., 2022 ; Mirsky et al., 2023 ) . These include social engineering and election manipulation campaigns that exploit automated bots on social media platforms, creation of fake news and web content, and use of AI systems for cheating on academic writing and coding assignments. Furthermore, the proliferation of synthetic data on the web complicates future dataset creation efforts, as synthetic data is often inferior to human content and must be detected and excluded before model training (Radford et al., 2022 ) .
For many reasons, the

... (truncated, 98 KB total)

Resource ID: b35324fe10a56f49 | Stable ID: sid_bh41wJKnTQ