Sparse Autoencoders for Interpretability in Reinforcement Learning Models

web

math.mit.edu·math.mit.edu/research/highschool/primes/materials/2024/Du...

Data Status

Not fetched

Cited by 1 page

Page	Type	Quality
Interpretability	Safety Agenda	66.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202630 KB

# Sparse Autoencoders for Interpretability in Reinforcement Learning Models

Coleman DuPlessie

# Abstract

Recent work has shown that sparse autoencoders (SAEs) are able to effectively discover human-interpretable features in language models, at scales ranging from toy models to state-of-the-art large language models. This work explores whether the use of SAEs can be generalized to other varieties of machine learning, specifically, reinforcement learning, and what, if any, modifications are necessary to adapt SAEs to this substantially different task. This research investigates both qualitative and quantitative measures of SAEs’ ability to represent reinforcement learning models’ activations as interpretable features, using a toy reinforcement learning environment to conduct empirical experiments. It finds that SAEs are successfully able to break down deep Q networks’ internal activations into humaninterpretable features, and, furthermore, that some of these human-interpretable features represent an internal understanding of the underlying task that could not have been discovered from a deep Q network’s output alone.

# 1 Introduction

Recently, there has been great progress in the field of mechanistic interpretability, which studies methods used to make trained machine learning models’ decision-making processes understandable to humans. The vast majority of this work has centered around transformers, which provide an ideal testing ground for three reasons: first, their inputs and outputs (natural language) are very easily understood and manipulated by humans; second, natural language tends to be extremely conceptually sparse (i.e. for any concept, the vast majority of text is unrelated to that concept); and third, there are immediate practical uses for interpretability in transformers today (e.g. being used as a method to fine-tune the outputs of Large Language Models (LLMs) without requiring large amounts of human feedback\[12\]).

This paper focuses on the use of sparse autoencoders (SAEs). SAEs have been successfully applied to transformers to decompose their activations into human-interpretable features\[4\], which can then be manipulated to change the transformer’s output in meaningful, interpretable ways\[12\].

In this paper, a variety of sparse autoencoders are trained on the activations of deep Q networks (DQNs). Many, though not all, features of these SAEs appear interpretable, and some features represent phenomena that do not improve model performence, but are learned regardless.

# 1.1 Sparse Autoencoders

Sparse autoencoders are a variety of autoencoders useful for taking features out of superposition. Superposition refers to the theory (proven to exist in toy models, and conjectured to hold in many large models) that “features,” independent concepts represented by a machine learning model, are not actually represented independently, one in each neuron. Instead, features are each represented by a linear combination of neuron activa

... (truncated, 30 KB total)

Resource ID: 79d34ba5f8c0407b | Stable ID: Njg4ZjQyZW