Longterm Wiki

Representation Engineering

Interpretabilityactive

Intervening on model representations to steer behavior (e.g., activation addition, representation reading).

Key Papers
1
First Proposed: 2023 (Zou et al.)
Cluster: Interpretability
Parent Area: Interpretability

Tags

function:assurancescope:technique

Key Papers & Resources1