Representation Engineering: A Top-Down Approach to AI Transparency

paper

2023·arXiv·arxiv.org/abs/2310.01405

Authors

Andy Zou·Long Phan·Sarah Chen·James Campbell·Phillip Guo·Richard Ren·Alexander Pan·Xuwang Yin·Mantas Mazeika·Ann-Kathrin Dombrowski·Shashwat Goel·Nathaniel Li·Michael J. Byun·Zifan Wang·Alex Mallen·Steven Basart·Sanmi Koyejo·Dawn Song·Matt Fredrikson·J. Zico Kolter·Dan Hendrycks

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper introduces representation engineering, a method for enhancing AI transparency by analyzing and manipulating population-level representations in deep neural networks, directly addressing the interpretability and control challenges central to AI safety.

Paper Details

Citations

831

116 influential

Year

2023

arXiv:2310.01405 DOI:10.48550/arXiv.2310.01405 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Summary

This paper introduces representation engineering (RepE), a top-down approach to AI transparency that analyzes population-level representations in deep neural networks rather than individual neurons. Drawing from cognitive neuroscience, RepE provides methods for monitoring and manipulating high-level cognitive phenomena in large language models. The authors demonstrate that RepE techniques can effectively address safety-relevant problems including honesty, harmlessness, and power-seeking behavior, offering a promising direction for improving AI system transparency and control.

Cited by 5 pages

Page	Type	Quality
Center for AI Safety (CAIS)	Organization	42.0
Eliciting Latent Knowledge (ELK)	Approach	91.0
AI Evaluation	Approach	72.0
Probing / Linear Probes	Approach	55.0
Representation Engineering	Approach	72.0

1 FactBase fact citing this source

Entity	Property	Value	As Of
Center for AI Safety (CAIS)	publication	Representation Engineering: A Top-Down Approach to AI Transparency — proposes methods to read and control LLM internal representations for safety	Oct 2023

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2310.01405] Representation Engineering: A Top-Down Approach to AI Transparency 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Representation Engineering:
 A Top-Down Approach to AI Transparency

 
 
 Andy Zou
 
 Center for AI Safety
 
 Carnegie Mellon University
 
 
 Long Phan ∗ 
 
 Center for AI Safety
 
 
 Sarah Chen ∗ 
 
 Center for AI Safety
 
 Stanford University
 
 
 James Campbell ∗ 
 
 Cornell University
 
 
 Phillip Guo ∗ 
 
 University of Maryland
 
 
 Richard Ren ∗ 
 
 University of Pennsylvania
 
 
 Alexander Pan
 
 UC Berkeley
 
 
 Xuwang Yin
 
 Center for AI Safety
 
 
 Mantas Mazeika
 
 Center for AI Safety
 
 University of Illinois Urbana-Champaign
 
 
 Ann-Kathrin Dombrowski
 
 Center for AI Safety
 
 
 
 Shashwat Goel
 
 Center for AI Safety
 
 
 Nathaniel Li
 
 Center for AI Safety
 
 UC Berkeley
 
 
 Michael J. Byun
 
 Stanford University
 
 
 Zifan Wang
 
 Center for AI Safety
 
 
 
 Alex Mallen
 
 EleutherAI
 
 
 Steven Basart
 
 Center for AI Safety
 
 
 Sanmi Koyejo
 
 Stanford University
 
 
 Dawn Song
 
 UC Berkeley
 
 
 
 Matt Fredrikson
 
 Carnegie Mellon University
 
 
 Zico Kolter
 
 Carnegie Mellon University
 
 
 Dan Hendrycks
 
 Center for AI Safety
 
 

 
 Abstract

 We identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems. Code is available at github.com/andyzoujm/representation-engineering .

 
 † † ∗ Equal contribution. Correspondence to: andyzou@cmu.edu 
 
 Figure 1: Overview of topics in the paper. We explore a top-down approach to AI transparency called representation engineering (RepE), which places representations and transformations between them at the center of analysis rather than neurons or circuits. Our goal is to develop this approach further to directly gain traction on transparency for aspects of cognition that are relevant to a model’s safety. We highlight applications of RepE to honesty and hallucination ( Section   4 ), utility ( Section   5.1 ), power-aversion ( Section   5.2 ), probability and risk ( Section   5.3 ), emotion ( Section   6.1 ), harmlessness ( Section   6.2 ), fairness and bias ( Section   6.3 ), knowledge edit

... (truncated, 98 KB total)

Resource ID: 5d708a72c3af8ad9 | Stable ID: sid_ud3GLuUZwl