Stanford: Detecting AI-generated text unreliable

paper

Authors

Sadasivan, Vinu Sankar·Kumar, Aounon·Balasubramanian, Sriram·Wang, Wenxiao·Feizi, Soheil

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Stanford research demonstrating vulnerabilities in AI-generated text detection systems through adversarial paraphrasing attacks, highlighting safety concerns around detection reliability for harmful AI outputs.

Paper Details

Citations

49 influential

Year

2023

Methodology

dissertation

arXiv:2303.11156 DOI:10.24124/2024/59574 Semantic Scholar

Metadata

arxiv preprintprimary source

Summary

This Stanford study explores the vulnerabilities of AI text detection techniques by developing recursive paraphrasing attacks that significantly reduce detection accuracy across multiple detection methods with minimal text quality degradation.

Key Points

•Recursive paraphrasing can dramatically reduce AI text detection accuracy across multiple detection methods
•Current AI text detection techniques have significant vulnerabilities that can be exploited by motivated attackers
•Theoretical analysis suggests detection will become increasingly difficult as AI models advance

Review

This groundbreaking research systematically exposes critical weaknesses in current AI-generated text detection systems. The authors developed a novel recursive paraphrasing attack methodology that can effectively evade detection across watermarking, neural network-based, zero-shot, and retrieval-based detectors. By recursively paraphrasing AI-generated text using advanced language models, they demonstrated dramatic drops in detection rates - for instance, reducing watermark detection rates from 99.8% to as low as 9.7%. The study's most significant contribution is revealing the fundamental challenges in reliably distinguishing between human and AI-generated text. Through both empirical experiments and theoretical analysis, the researchers establish that as AI language models become more sophisticated, the total variation distance between human and AI text distributions decreases, making detection progressively more difficult. Their theoretical framework provides important insights into the inherent limitations of text detection methods, suggesting that as AI models improve, the detection problem will become increasingly challenging.

Cited by 2 pages

Page	Type	Quality
Authentication Collapse Timeline Model	Analysis	59.0
Authentication Collapse	Risk	57.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2303.11156] Can AI-Generated Text be Reliably Detected? 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Can AI-Generated Text be Reliably Detected?

 
 
 Vinu Sankar Sadasivan 
 vinu@umd.edu 
 &Aounon Kumar 
 aounon@umd.edu 
 Sriram Balasubramanian 
 sriramb@umd.edu 
 &Wenxiao Wang 
 wwx@umd.edu 
 &Soheil Feizi 
 sfeizi@umd.edu 
 
 Department of Computer Science 
 University of Maryland 
 
 
 

 
 Abstract

 The rapid progress of large language models (LLMs) has made them capable of performing astonishingly well on various tasks including document completion and question answering. The unregulated use of these models, however, can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them.

 In this paper, both empirically and theoretically, we show that these detectors are not reliable in practical scenarios. Empirically, we show that paraphrasing attacks , where a light paraphraser is applied on top of the generative text model, can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors and zero-shot classifiers.
Our experiments demonstrate that retrieval-based detectors, designed to evade paraphrasing attacks, are still vulnerable against recursive paraphrasing. We then provide a theoretical impossibility result indicating that as language models become more sophisticated and better at emulating human text, the performance of even the best-possible detector decreases.
For a sufficiently advanced language model seeking to imitate human text, even the best-possible detector may only perform marginally better than a random classifier.
Our result is general enough to capture specific scenarios such as particular writing styles, clever prompt design, or text paraphrasing.
We also extend the impossibility result to include the case where pseudorandom number generators are used for AI-text generation instead of true randomness.
We show that the same result holds with a negligible correction term for all polynomial-time computable detectors.
Finally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks where adversarial humans can infer hidden LLM text signatures and add them to human-generated text to be detected as text generated by the LLMs, potentially causing reputational damage to their developers. We believe these results can open an honest conversation in the community regarding the ethical and reliable use of AI-generated text. Our code is publicly available at
 https://github.com/vinusankars/Reliability-of-AI-text-detectors .

 
 

 
 
 
 1 Introduction

 
 Figure 1 : An illustration of vul

... (truncated, 98 KB total)

Resource ID: 786286889baca739 | Stable ID: sid_QAbfcD4650