Back
anthropic.com/research/team/interpretability
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Data Status
Not fetched
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| AI Safety Technical Pathway Decomposition | Analysis | 62.0 |
| Interpretability | Safety Agenda | 66.0 |
| Mechanistic Interpretability | Approach | 59.0 |
| Probing / Linear Probes | Approach | 55.0 |
Cached Content Preview
HTTP 200Fetched Feb 26, 20265 KB
[Back to Overview](https://www.anthropic.com/research)
# Interpretability
The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.
Research teams: [Alignment](https://www.anthropic.com/research/team/alignment) [Economic Research](https://www.anthropic.com/research/team/economic-research) [Interpretability](https://www.anthropic.com/research/team/interpretability) [Societal Impacts](https://www.anthropic.com/research/team/societal-impacts)

### Safety through understanding
It's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.
### Multidisciplinary approach
Some Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.

[**Tracing the thoughts of a large language model** \\
\\
InterpretabilityMar 27, 2025\\
\\
Circuit tracing lets us watch Claude think, uncovering a shared conceptual space where reasoning happens before being translated into language—suggesting the model can learn something in one language and apply it in another.](https://www.anthropic.com/research/tracing-thoughts-language-model)
[InterpretabilityOct 29, 2025\\
**Signs of introspection in large language models** \\
Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect—a step toward understanding what's actually happening inside these models.](https://www.anthropic.com/research/introspection) [InterpretabilityAug 1, 2025\\
**Persona vectors: Monitoring and controlling character traits in language models** \\
AI models represent character traits as patterns of activations within their neural networks. By extracting "persona vectors" for traits like sycophancy or hallucination, we can monitor personality shifts and mitigate undesirable behaviors.](https://www.anthropic.com/research/persona-vectors) [InterpretabilitySep 14, 2022\\
**Toy Models of Superposition** \\
Neural networks pack many concepts into single neurons. This paper shows how and when models represent more features than they have dimensions.
... (truncated, 5 KB total)Resource ID:
dfc21a319f95a75d | Stable ID: OTgyYjhlYj