Chris Olah

Person

Chris Olah

Comprehensive biographical profile of Chris Olah covering his unconventional career path, foundational contributions to mechanistic interpretability (feature visualization, circuit analysis, sparse autoencoders), and leadership of Anthropic's interpretability team; notably includes substantive external criticism of the research program's scalability and safety relevance alongside acknowledgment of Anthropic's institutional conflict of interest.

Grokipedia

AffiliationAnthropic

RoleCo-founder, Head of Interpretability

Known ForMechanistic interpretability, neural network visualization, clarity of research communication

ProfileView profile page

Websitecolah.github.io

Organizations

Research Areas

People

3.6k words · 24 backlinks

Quick Assessment

Dimension	Assessment
Primary Role	Co-founder and interpretability research lead at Anthropic
Key Contributions	Feature visualization techniques, circuit analysis methodology, sparse autoencoder applications for interpretability, co-founding Distill journal
Key Publications	"Towards Monosemanticity" (2023), "Scaling Monosemanticity" (2024), "Toy Models of Superposition" (2022), "Feature Visualization" (2017), "The Building Blocks of Interpretability" (2018)
Institutional Affiliation	Anthropic (2021–present); previously OpenAI (2018–2020), Google Brain (2015–2018)
Recognition	Named to TIME's 100 Most Influential People in AI (2024); 2012 Thiel Fellow
Influence on AI Safety	Contributed to establishing Mechanistic Interpretability as a research direction within AI safety; applied transparency and verification approaches to Large Language Models

Overview

Chris Olah is a Canadian machine learning researcher specializing in neural network interpretability and a co-founder of Anthropic. He is known primarily for developing and advancing the research program now called mechanistic interpretability, which aims to reverse-engineer the internal algorithms and representations of neural networks — based on the hypothesis that such reverse-engineering is tractable, a claim that remains contested in the research community.¹ His career has spanned Google Brain, OpenAI, and Anthropic, where he currently leads interpretability research.²

Olah followed an unconventional path into research: he has no undergraduate degree, left university as a teenager, and built his early reputation through independent blog posts at colah.github.io and a 2012 Thiel Fellowship.³ His blog posts on topics such as LSTM networks and neural network representations attracted significant readership in the machine learning community before he joined Google Brain in 2015.⁴

In 2016, Olah co-founded Distill, a peer-reviewed journal emphasizing interactive visualizations and web-native presentation of machine learning research, which operated until it entered an indefinite hiatus in July 2021.⁵ At Anthropic, he leads a team — which had grown to 17 researchers by April 2024 — focused on understanding the internal mechanisms of frontier AI systems including Claude.⁶ TIME magazine named him to its 2024 list of 100 Most Influential People in AI, describing him as "one of the pioneers of an entirely new scientific field, mechanistic interpretability."⁷

A notable feature of Olah's institutional position is that he both leads the interpretability research program and co-founded the commercial AI laboratory — Anthropic — that funds, publishes, and benefits reputationally from demonstrating safety progress. This dual role is worth bearing in mind when evaluating claims about the maturity or impact of the interpretability program, particularly given that Anthropic has commercial interests in being seen as a safety-conscious organization.

Background

Early Life and Education

Olah is Canadian and grew up in Toronto, where he developed an early interest in technology through participation in the local hacker community.⁸ As a teenager, he joined hacklab.to, a Toronto hackerspace, in 2009, and later served as a director from 2012 to 2014, teaching workshops on topics including integral transforms and LaTeX.⁴

He graduated from The Abelard School in Toronto in 2010 as an AP National Scholar, having completed six Advanced Placement courses.⁴ He briefly attended the University of Toronto but left without completing a degree — according to Wired, at approximately age 18.⁹ His departure was partly connected to his support for Byron Sonne, a security researcher who faced criminal charges related to legitimate security research; Olah provided court support and documentation for the "Free Byron" campaign from 2010 to 2012.³

After leaving university, Olah did not return to formal education. He engaged in a range of self-directed technical projects, including open-source 3D printing work (the ImplicitCAD project and the Toronto 3D Printers group) and DIY biology meetups.⁴ In July 2012, he was selected as a Thiel Fellow, receiving a $100,000 grant from the Thiel Foundation to support independent research.¹⁰ The fellowship recognized his work on 3D printing and self-directed technical exploration. Vitalik Buterin, co-founder of Ethereum, is another alumnus of The Abelard School who also received a Thiel Fellowship.¹¹

Career at Google Brain

Olah first joined Google Brain as an intern in summer 2014, hosted by Jeff Dean, where he worked on visualizing neural network representations.⁴ He returned for a second internship in 2015 before transitioning to full-time roles: Research Associate from October 2015 to October 2016, then Research Scientist from October 2016 to October 2018.⁴

During this period he co-authored the "Inceptionism: Going Deeper into Neural Networks" blog post in June 2015, which described techniques for generating visualizations by maximizing neural network activations — work associated with what became known as DeepDream.⁴ He was also a co-author on the TensorFlow whitepaper published in November 2015.⁴ His blog posts at colah.github.io on topics including LSTM networks (2015) and attention mechanisms (2016) attracted substantial readership in the machine learning community.

Career at OpenAI

In 2018, Olah joined OpenAI as a Member of Technical Staff and founded the "Clarity team" within OpenAI's safety division, serving as its technical lead.⁴ He has described his own career order as: "previously led interpretability research at OpenAI, worked at Google Brain, and co-founded Distill."²

The Clarity team, from 2018 to 2020, developed the foundational work on circuit-based interpretability that would define the field. This included the Circuits thread on Distill, which launched in March 2020, and papers including "Zoom In: An Introduction to Circuits." A 2020 CVPR presentation was explicitly attributed to "Chris Olah, OpenAI Clarity Team."¹²

Co-founding Anthropic

In 2021, Olah co-founded Anthropic with Dario Amodei and other former OpenAI researchers. At Anthropic, he continues to lead interpretability research, now focused on production-scale models.¹

Distill Journal

In 2017, Olah co-founded Distill with Shan Carter and Arvind Satyanarayan (MIT CSAIL). The journal was established in March 2017 with institutional backing from Google, OpenAI, DeepMind, and Y Combinator Research.¹³ Olah served as editor-in-chief, with Carter at Google Brain and Satyanarayan at MIT CSAIL.¹³

Distill operated as a peer-reviewed scientific journal with a distinctive emphasis on interactive graphics and web-native explanations, arguing that "traditional academic publishing remains focused on the PDF" despite the web's capacity for richer communication.¹⁴ Articles underwent review for both correctness and clarity of presentation.

The journal published research on neural network interpretability and visualization, attention mechanisms, optimization dynamics, and feature learning. One notable experiment was the Circuits thread, launched March 10, 2020, which invited short articles on features and circuits in neural networks, interspersed with commentary from researchers in adjacent fields — an attempt at a more continuous, faster publication format.¹⁵

On July 2, 2021, the editorial team announced an indefinite hiatus.⁵ The announcement cited three reasons: volunteer burnout from running the journal; structural friction that made it difficult to focus on the most exciting aspects of publishing; and a loss of confidence in their original theory of impact — they had concluded that publishing in a journal like Distill did not significantly affect how seriously institutions treat non-traditional publications.⁵ Papers under active review at the time were not affected, and published threads could continue to receive additions. The journal's open-source template remains publicly available.¹⁴

When Distill entered hiatus, Olah's team at Anthropic created transformer-circuits.pub as a successor venue, noting that they "previously the team would have submitted to Distill, but with Distill on hiatus, they took a page from David Ha's approach of simply creating websites for research projects."¹⁶

Mechanistic Interpretability Research

Olah's research program aims to understand neural networks by reverse-engineering their internal algorithms and representations. This approach, termed mechanistic interpretability, treats neural networks as systems that — based on the hypothesis that they are tractably interpretable — can be understood at the level of individual features and circuits, rather than solely through input-output behavior.¹⁷ In a 2022 essay, Olah described the goal as analogous to reverse-engineering a compiled binary computer program: recovering human-readable structure from a system whose internal representation was not designed for human comprehension.¹⁸

Feature Visualization

Feature visualization techniques synthesize inputs that maximally activate specific neurons or layers in a neural network. Olah's 2017 work on feature visualization established methods for generating these visualizations and interpreting what features neural networks learn. The approach involves optimizing input images to maximize activation of target neurons, revealing the visual patterns those neurons respond to.

The "Feature Visualization" (2017) paper introduced optimization-based activation maximization and methods for visualizing intermediate layers to understand hierarchical feature learning, with the aim of understanding what neural networks represent — a goal the paper's authors treat as tractable but which critics contend remains undemonstrated at scale. This work involved collaboration with researchers at Google Brain including Alexander Mordvintsev and Ludwig Schubert.

Circuit Analysis

Circuit analysis extends feature visualization by tracing how features connect and process information. The 2018 paper "The Building Blocks of Interpretability" demonstrated that individual features can be identified and visualized, that connections between features form interpretable circuits, and that these circuits implement specific algorithms or computations. Co-authors included Shan Carter, Ludwig Schubert, and other Google Brain researchers.

The 2020 paper "Zoom In: An Introduction to Circuits" further developed this framework, putting forward three speculative claims: (1) Features — neural network neurons represent understandable features; (2) Circuits — connections between neurons form meaningful algorithms; (3) Universality — analogous features and circuits form across different models and tasks.¹⁹ The paper documented early-layer features such as curve detectors and edge detectors, and proposed that circuits are falsifiable: if a circuit is understood, changes to weights should produce predictable behavioral changes.¹⁹ The ideas had been previously presented as a keynote at the VISxAI workshop in 2019.¹⁹ Co-authors included Nick Cammarata and Gabriel Goh.

Superposition and Sparse Autoencoders

"Toy Models of Superposition" (2022) provided a mathematical framework for understanding a core difficulty in interpretability. The paper demonstrated that neural networks can represent more features than they have dimensions by storing features in superposition — allowing multiple features to interfere in the same neurons. Key findings included that networks learn to represent sparse features in superposition, that the number of representable features scales with sparsity, and that this explains polysemanticity (neurons responding to multiple unrelated concepts). Co-authors included Anthropic researchers Nelson Elhage, Tom Henighan, and others.

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023) addressed the superposition problem more directly by applying sparse autoencoders — a dictionary learning technique — to decompose a one-layer transformer's MLP activations into monosemantic features.²⁰ Where individual neurons are polysemantic (responding to multiple unrelated concepts), the paper argued that "features" — patterns in linear combinations of neuron activations — are a better unit of analysis.²⁰ A layer with 512 neurons was decomposed into more than 4,000 features representing distinct concepts including DNA sequences, legal language, HTTP requests, Hebrew text, and nutrition statements.²⁰ The paper also introduced the concept of "feature splitting": as the autoencoder is made larger, features split into more specific sub-features.²⁰ The work was published on transformer-circuits.pub in October 2023, with Trenton Bricken, Adly Templeton, and Joshua Batson as core contributors alongside Olah.²⁰

"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024) extended the sparse autoencoder approach to Claude 3 Sonnet, a production-scale large language model.²¹ The team trained autoencoders with approximately 1 million, 4 million, and 34 million features, discovering features corresponding to concepts such as "The Golden Gate Bridge," code bugs, bias recognition, and scam email recognition.²¹ Feature steering — forcing specific features to high values — was found to alter the model's demeanor, preferences, stated goals, biases, and in some cases its ability to circumvent safeguards.²¹ The paper also noted limitations: even the largest 34-million-feature model covered only approximately 60% of London boroughs, suggesting the full model's knowledge substantially exceeds what current sparse autoencoders can capture.²¹

Work at Anthropic

At Anthropic, Olah leads interpretability research with a focus on understanding frontier AI systems. By April 2024, the team had grown to 17 researchers — having hired 10 new people during 2023 alone — drawn from backgrounds including astrophysics, condensed matter physics, mathematics, and neuroscience.⁶ This team represented a substantial fraction of an estimated approximately 50 full-time researchers globally working on mechanistic interpretability at that time.⁶

The research program aims to:

Scale interpretability to production models: Develop techniques that work on models the size of Claude rather than only small research models
Connect interpretability to safety: Use understanding of model internals to detect potentially dangerous capabilities or behaviors
Automate interpretability: Use AI systems to help interpret other AI systems, enabling analysis at scale
Develop verification methods: Create techniques that can verify properties of AI systems through understanding their internals

It is worth noting that Olah co-founded Anthropic and leads the interpretability program at a commercial AI laboratory with financial and reputational interests in demonstrating safety progress. This institutional context does not invalidate the research, but it is relevant when evaluating claims about the program's impact or maturity, particularly where those claims originate from Anthropic itself.

Interpretability for AI Safety

The interpretability program at Anthropic aims to support safety through several approaches:

Capability detection: Identifying when models possess specific capabilities by examining internal representations and features, potentially enabling detection of dangerous capabilities before they manifest in behavior.

Behavior verification: Understanding the mechanisms behind model outputs to assess whether models are reporting their actual internal states, relevant to concerns about Deceptive Alignment.

Debugging: Using mechanistic understanding to identify and potentially modify problematic model behaviors or learned heuristics.

Monitoring: Developing methods to detect anomalous internal activations that might indicate Scheming or other concerning behaviors.

Transformer Circuits Thread

Following Distill's hiatus, Olah's team created transformer-circuits.pub as a venue for publishing mechanistic interpretability research in a similar web-native format.¹⁶ Key papers hosted on this platform include "A Mathematical Framework for Transformer Circuits," "In-context Learning and Induction Heads," "Toy Models of Superposition," "Towards Monosemanticity," and "Scaling Monosemanticity."¹⁶

Research Philosophy and Communication

Olah's research approach emphasizes several recurring themes:

Visual communication: Using diagrams, interactive visualizations, and carefully designed figures to convey technical concepts. His blog posts and papers typically include extensive visualizations. His 2015 blog post on LSTM networks became a frequently-cited reference for readers learning about recurrent architectures, combining technical explanations with interactive visualizations.³

Accessibility with technical precision: Explaining complex topics clearly while maintaining technical rigor. His blog at colah.github.io covers topics including LSTM networks, neural network representations, and attention mechanisms in this style.

Infrastructure investment: Building tools and frameworks for interpretability research, including visualization libraries and analysis frameworks.

Long-term research: Pursuing research directions over multiple years, with superposition research spanning from initial theoretical work in 2022 to scaled demonstrations in 2024.

Olah has also described the interpretability program in explicitly strategic terms, characterizing it as "deliberately targeted at trying to fill in holes in our portfolio for pessimistic scenarios" — a "high-risk, high-reward bet" that "may not succeed in time but could be a powerful tool if it does."²² He has emphasized concern about understanding model safety off-distribution as a key motivation for the mechanistic approach over correlational interpretability methods.²²

Key Publications

Blog Posts (colah.github.io):

"Understanding LSTM Networks" (2015)
"Visualizing Representations: Deep Learning and Human Beings" (2015)
"Attention and Augmented Recurrent Neural Networks" (2016, with Shan Carter)

Research Papers:

"Feature Visualization" (2017, with Alexander Mordvintsev, Ludwig Schubert, and others)
"The Building Blocks of Interpretability" (2018, with Shan Carter, Ludwig Schubert, and others)
"Zoom In: An Introduction to Circuits" (2020, with Nick Cammarata, Gabriel Goh, and others) — distill.pub
"Toy Models of Superposition" (2022, with Nelson Elhage, Tristan Hume, Tom Henighan, and others) — transformer-circuits.pub
"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023, with Trenton Bricken, Adly Templeton, Joshua Batson, and others) — transformer-circuits.pub
"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024, with Adly Templeton, Tom Conerly, and others) — transformer-circuits.pub

Essays and Informal Notes:

"Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases" (2022) — transformer-circuits.pub
"Interpretability Dreams" (2023) — transformer-circuits.pub

Views on AI Safety

Olah has written and spoken about interpretability research as one component of AI safety rather than a complete solution. Positions he has articulated include:

Necessity of understanding: Deployment of powerful AI systems requires understanding their internal operations, not just observing input-output behavior. In a 2023 80,000 Hours interview, he described the central challenge as: "How is it that these models are doing things that we don't know how to do?" and explained that understanding individual neurons in principle allows researchers to "read algorithms off of the weights."²³

Conditional tractability: Olah has argued that neural networks can be understood mechanistically through sustained research effort, against the view that they are inherently inscrutable — while also acknowledging that this is a bet that may not succeed in time.²² He has noted that even if full interpretability is not achievable, understanding "small slices" of model behavior might allow detection of manipulative behavior in the moment.²³

Complementarity: Interpretability is framed as working alongside other safety approaches including RLHF, Scalable Oversight, and Constitutional AI.

Automation necessity: Fully understanding large models requires using AI to assist in interpretation, as human analysis alone cannot scale to billions of parameters.

Access requirements: Interpretability research on frontier models requires working with those models, a consideration that influenced the decision to conduct research at Anthropic rather than academia.

Challenges, Limitations, and External Criticism

Several challenges to interpretability research have been identified, both from within the field and by external researchers.

Internal Acknowledgments

Scaling limitations: While sparse autoencoder approaches have been applied to Claude 3 Sonnet, it remains an open question whether interpretability techniques can keep pace with capability improvements in future systems. The "Scaling Monosemanticity" paper itself noted that even its largest autoencoder captured only a fraction of the model's representations.²¹

Verification gaps: Understanding model internals does not automatically provide verification that models lack dangerous capabilities, as understanding is necessarily incomplete and features may be missed.

Deceptive models: Models exhibiting Deceptive Alignment might develop internal representations specifically designed to appear benign under interpretability analysis.

Resource requirements: Interpretability research on frontier models requires substantial computational resources and access to those models.

External Criticism

The mechanistic interpretability research program has attracted substantive criticism from researchers outside the program.

Scalability concerns: Critics have argued that mechanistic interpretability has "failed to scale to challenging problems, and might always fail to scale" because current methods depend on human-generated mechanistic hypotheses — sidestepping the hard problem of automated hypothesis generation. This critique holds that most work in the field relies on "intuition-based or weak ad-hoc evaluation."²⁴

Safety relevance: Stanford NLP professor Christopher Potts has argued that interpretability research has not yet come close to making AI meaningfully safer in practice, observing that "gains in safety seem mostly to stem from behavioral evaluations, heuristic adjustments to training regimes, and robust software system design." As a concrete example, he notes that the GPT-4o sycophancy problem "was detected behaviorally and fixed by improving post-training — no circuit was discovered, no particular weights or activations were held responsible, and no mechanistic analysis sounded a warning bell or informed the solutions."²⁵

Theory of impact: Some researchers have questioned whether the theory of impact behind mechanistic interpretability is well-specified, arguing that even a complete solution to the superposition problem would not address "enumerative safety" for large-scale models.²⁶

Complexity mismatch: A broader critique holds that the reductionist framing of "mechanistic" interpretability is misapplied to complex systems, which exhibit emergent properties that cannot easily be understood by tracing fundamental interactions. Proponents of this view note that Google DeepMind deprioritized work on sparse autoencoders in early 2025, around the same time Anthropic CEO Dario Amodei published an essay advocating for greater focus on the field — indicating substantive disagreement among leading labs about the value of this approach.²⁷ The claim about DeepMind's deprioritization is based on a single source and should be treated with some caution absent additional corroboration.

Philosophical limitations: A 2024 peer-reviewed philosophical analysis identified conceptual limitations in mechanistic interpretability, noting that "obvious structural components like neurons, attention heads, and parameters often fail to map cleanly onto functionally meaningful roles," and that mechanistic explanation as an approach has critics who favor causal-interventionist but non-mechanistic alternatives.²⁸

Institutional conflict of interest: As noted above, much of the published work on mechanistic interpretability originates from Anthropic, a commercial AI laboratory with interests in demonstrating that safety-oriented interpretability research is both feasible and progressing. This does not invalidate the research, but external replication and evaluation by independent groups remains limited relative to the volume of Anthropic-internal publication.

Olah has acknowledged skeptical arguments in public discussions, describing the research program as a high-risk bet while maintaining that even partial success could be valuable for AI safety.²² ²³

Recognition

Olah has received several forms of recognition for his work:

TIME 100 Most Influential People in AI (2024): Named to TIME magazine's list, described as a pioneer of mechanistic interpretability as a scientific field.⁷
Thiel Fellowship (2012): Received a $100,000 grant from the Thiel Foundation supporting independent research outside of university.¹⁰
AP National Scholar (2010): Recognized for completing six Advanced Placement courses upon high school graduation.⁴

No academic appointments, ACM prizes, or MIT Technology Review Innovators Under 35 recognition were identified in available sources.

Influence on the Field

Interpretability research has grown as a subfield within AI safety and machine learning since the mid-2010s:

Research groups: Multiple organizations now have dedicated interpretability teams, including Anthropic, OpenAI, Google DeepMind, and others. A dedicated mechanistic interpretability workshop was held at NeurIPS 2023, reflecting the subfield's growth.²⁹

Methods adoption: Feature visualization and circuit analysis techniques developed through Olah's work are used by researchers studying neural networks across domains.

Communication practices: Distill's emphasis on interactive visualizations and web-native explanations influenced how some machine learning researchers approach research communication, though the journal itself ceased accepting new submissions in 2021.

Community formation: The Circuits thread on Distill (launched 2020) and its successor transformer-circuits.pub served as organizing venues for the mechanistic interpretability research community, providing a shared publication venue and common research agenda.

The extent to which current interpretability techniques will scale to future AI systems, and whether they will provide actionable safety benefits, remain actively debated questions within the research community.

"Chris Olah: The 100 Most Influential People in AI 2024." — TIME Magazine. "Chris Olah: The 100 Most Influential People in AI 2024." September 2024. ↩ ↩²
"About Me." — Chris Olah. "About Me." colah.github.io. Accessed 2024. ↩ ↩²
"Chris Olah on Working at Top AI Labs Without an Undergrad Degree." — 80,000 Hours. "Chris Olah on Working at Top AI Labs Without an Undergrad Degree." 80000hours.org. Episode 108. ↩ ↩² ↩³
"Chris Olah." — Grokipedia. "Chris Olah." 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
"Distill Hiatus." — Distill Editorial Team. "Distill Hiatus." distill.pub. July 2, 2021. ↩ ↩² ↩³
"Circuits Updates — April 2024." — Anthropic Interpretability Team. "Circuits Updates — April 2024." transformer-circuits.pub. April 2024. ↩ ↩² ↩³
"Chris Olah: The 100 Most Influential People in AI 2024." — TIME Magazine. "Chris Olah: The 100 Most Influential People in AI 2024." September 2024. ↩ ↩²
"Chris Olah." — Wikipedia contributors. "Chris Olah." Wikipedia. 2024. ↩
"Chris Olah." — Wikipedia contributors. "Chris Olah." Wikipedia. 2024. Citing Wired. ↩
"Chris Olah." — Grokipedia. "Chris Olah." 2024. Thiel Fellowship, July 2012. ↩ ↩²
"Tiny High School in Toronto Produces Two Thiel Fellowship Winners." — The Abelard School. "Tiny High School in Toronto Produces Two Thiel Fellowship Winners." October 3, 2014. ↩
"An Introduction to Circuits in CNNs." — Chris Olah. "An Introduction to Circuits in CNNs." CVPR 2020 slide deck. Attributed to "Chris Olah, OpenAI Clarity Team." ↩
["Distill (journal)."](https://en.wikipedia.org/wiki/Distill_(journal) — Wikipedia contributors. "Distill (journal)." Wikipedia. 2024. ↩ ↩²
Citation rc-1270 ↩ ↩²
"Thread: Circuits." — Distill / Chris Olah et al. "Thread: Circuits." distill.pub. March 10, 2020. ↩
"Transformer Circuits Thread." — Anthropic Interpretability Team. "Transformer Circuits Thread." transformer-circuits.pub. 2021–present. ↩ ↩² ↩³
"Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." — Chris Olah. "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." transformer-circuits.pub. June 27, 2022. ↩
"Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." — Chris Olah. "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." transformer-circuits.pub. June 27, 2022. ↩
"Zoom In: An Introduction to Circuits." — Chris Olah et al. "Zoom In: An Introduction to Circuits." Distill. March 10, 2020. ↩ ↩² ↩³
"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." — Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, et al. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." transformer-circuits.pub. October 4, 2023. ↩ ↩² ↩³ ↩⁴ ↩⁵
"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." — Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, et al. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." transformer-circuits.pub. May 24, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
AI Alignment Forum profile and comments. — Chris Olah. AI Alignment Forum profile and comments. alignmentforum.org. 2022–2024. ↩ ↩² ↩³ ↩⁴
"Chris Olah on What the Hell Is Going On Inside Neural Networks." — 80,000 Hours. "Chris Olah on What the Hell Is Going On Inside Neural Networks." 80000hours.org. Episode 107. 2023. ↩ ↩² ↩³
"EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety." — LessWrong. "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety." lesswrong.com. 2023. ↩
"Assessing Skeptical Views of Interpretability Research." — Christopher Potts. "Assessing Skeptical Views of Interpretability Research." web.stanford.edu. August 2025. ↩
"Against Almost Every Theory of Impact of Interpretability." — LessWrong. "Against Almost Every Theory of Impact of Interpretability." lesswrong.com. 2023. ↩
"The Misguided Quest for Mechanistic AI Interpretability." — Dan Hendrycks and Laura Hiscott. "The Misguided Quest for Mechanistic AI Interpretability." AI Frontiers. 2024. ↩
"Mechanistic Interpretability Needs Philosophy." — PhilArchive / arXiv. "Mechanistic Interpretability Needs Philosophy." 2024. ↩
"NeurIPS 2023 Mechanistic Interpretability Workshop." — NeurIPS 2023 Workshop Organizers. "NeurIPS 2023 Mechanistic Interpretability Workshop." December 2023. ↩

References

1Assessing skeptical views of interpretability research | Christopher Pottsweb.stanford.edu▸

Christopher Potts (Stanford/Goodfire) systematically addresses common dismissive arguments against interpretability research, arguing that such dismissal is historically ill-conceived given how many once-marginal AI areas became central to the field. He counters three main skeptical positions—that interpretability is fundamentally unachievable, that analysis is overrated in engineering, and that interpretability lacks practical utility—advocating for claim-by-claim engagement rather than wholesale dismissal.

web.stanford.edu

2Chris Olah - Researcher Profilegrokipedia.com·Reference▸

A reference profile of Chris Olah, a prominent AI safety researcher known for foundational work in neural network interpretability and mechanistic interpretability. Olah is a co-founder of Anthropic and previously worked at Google Brain, where he pioneered influential research on understanding what neural networks learn.

grokipedia.com

3Distill (journal - WikipediaWikipedia·Reference▸

Wikipedia article describing Distill, an online scientific journal focused on clear and interactive explanations of machine learning research. Distill was notable for pioneering interactive, web-native ML publications and placing high value on interpretability and clarity of explanation. The journal ceased accepting new submissions in 2021.

★★★☆☆

en.wikipedia.org

4Chris Olah - WikipediaWikipedia·Reference▸

Wikipedia biography of Chris Olah, a prominent AI safety researcher known for foundational work in neural network interpretability and mechanistic interpretability. He is a co-founder of Anthropic and previously worked at Google Brain, where he developed influential visualization and interpretability techniques. His blog posts and research have been highly influential in shaping the field of mechanistic interpretability.

★★★☆☆

en.wikipedia.org

5Colah's Blog (Christopher Olah)colah.github.io▸

Christopher Olah's personal blog featuring highly influential technical essays on neural networks, deep learning, and interpretability. Known for exceptionally clear visual explanations of complex ML concepts, including foundational work on LSTMs, neural network visualization, and mechanistic interpretability.

colah.github.io

6Transformer Circuits ThreadTransformer Circuits·Paper▸

The Transformer Circuits Thread is Anthropic's primary publication hub for mechanistic interpretability research on large language models. It hosts foundational and ongoing research aimed at understanding the internal workings of transformer models, including work on circuits, features, sparse autoencoders, and attribution graphs. The thread represents a sustained research program toward making AI systems more understandable and safer.

★★★★☆

transformer-circuits.pub

7Zoom In: An Introduction to Circuitsdistill.pub▸

This foundational Distill article introduces the 'circuits' framework for neural network interpretability, arguing that by studying connections between neurons we can reverse-engineer meaningful algorithms in neural network weights. It proposes three speculative claims: that features are the fundamental units of neural networks, that features are connected by circuits, and that similar features and circuits recur across different models and tasks.

distill.pub

8Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetTransformer Circuits·Paper▸

Anthropic applies sparse autoencoders (SAEs) to extract millions of interpretable, monosemantic features from Claude 3 Sonnet, a large production-scale language model. The work demonstrates that features learned at scale are human-interpretable, multimodal, and exhibit meaningful geometric structure including concept hierarchies. This represents a major step toward mechanistic interpretability of frontier AI systems.

★★★★☆

transformer-circuits.pub

Property	Value	Source
Education	Attended University of Toronto (did not complete degree); Thiel Fellow
Notable For	Pioneer of neural network interpretability and visualization; co-founder of Anthropic; creator of Distill.pub and the Circuits thread at Transformer Circuits
Social Media	@ch402
GitHub	https://github.com/colah	—
Google Scholar	https://scholar.google.com/citations?user=vKAKE1gAAAAJ	—
Wikipedia	https://en.wikipedia.org/wiki/Chris_Olah

Organization	Title	Start	End
Anthropic	Co-founder, Interpretability Research Lead	2021-01	—
Google Brain	Research Scientist	2015	2018
OpenAI	Researcher	2018	2021-01
Google DeepMind	Research Scientist	2015	2018

Chris Olah

Chris Olah

Quick Assessment

Overview

Background

Early Life and Education

Career at Google Brain

Career at OpenAI

Co-founding Anthropic

Distill Journal

Mechanistic Interpretability Research

Feature Visualization

Circuit Analysis

Superposition and Sparse Autoencoders

Work at Anthropic

Interpretability for AI Safety

Transformer Circuits Thread

Research Philosophy and Communication

Key Publications

Views on AI Safety

Challenges, Limitations, and External Criticism

Internal Acknowledgments

External Criticism

Recognition

Influence on the Field

Footnotes

References

Structured Data

All Facts

Career History

Related Wiki Pages

Top Related Pages

Anthropic

Interpretability

Mechanistic Interpretability

Goodfire

Dario Amodei

Approaches

Other

Risks

Safety Research

Key Debates

Analysis

Concepts

Historical

Organizations

Policy