Longterm Wiki
Updated 2026-03-12HistoryData
Page StatusContent
Edited 1 day ago3.4k words23 backlinksUpdated every 6 weeksDue in 6 weeks
27QualityDraft •79ImportanceHigh39ResearchLow
Summary

Biographical overview of Chris Olah's career trajectory from self-taught researcher to Google Brain, OpenAI, and co-founding Anthropic, focusing on his work in mechanistic interpretability including feature visualization, circuit analysis, sparse autoencoder research (Towards Monosemanticity 2023, Scaling Monosemanticity 2024). Documents his unconventional educational background, contributions to science communication through Distill journal, and the state of external debate over mechanistic interpretability as a research program.

Content6/13
LLM summaryScheduleEntityEdit history2Overview
Tables1/ ~13Diagrams0/ ~1Int. links35/ ~27Ext. links12/ ~17Footnotes0/ ~10References8/ ~10Quotes0Accuracy0RatingsN:2 R:3.5 A:2 C:5Backlinks23
Change History2
Auto-improve (standard): Chris Olah3 weeks ago

Improved "Chris Olah" via standard pipeline (1357.1s). Quality score: 74. Issues resolved: Frontmatter contains malformed XML-like tags in 'llmSummary'; Overview section contains duplicate 'name' attribute on Enti; Co-founding Anthropic section also contains duplicate 'name'.

1357.1s · $5-8

Wiki editing system refactoring#1843 weeks ago

Six refactors to the wiki editing pipeline: (1) extracted shared regex patterns to `crux/lib/patterns.ts`, (2) refactored validation in page-improver to use in-process engine calls instead of subprocess spawning, (3) split the 694-line `phases.ts` into 7 individual phase modules under `phases/`, (4) created shared LLM abstraction `crux/lib/llm.ts` unifying duplicated streaming/retry/tool-loop code, (5) added Zod schemas for LLM JSON response validation, (6) decomposed 820-line mermaid validation into `crux/lib/mermaid-checks.ts` (604 lines) + slim orchestrator (281 lines). Follow-up review integrated patterns.ts across 19+ files, fixed dead imports, corrected ToolHandler type, wired mdx-utils.ts to use shared patterns, replaced hardcoded model strings with MODELS constants, replaced `new Anthropic()` with `createLlmClient()`, replaced inline `extractText` implementations with shared `extractText()` from llm.ts, integrated `MARKDOWN_LINK_RE` into link validators, added `objectivityIssues` to the `AnalysisResult` type (removing an unsafe cast in utils.ts), fixed CI failure from eager client creation, and tested the full pipeline by improving 3 wiki pages. After manual review of 3 improved pages, fixed 8 systematic pipeline issues: (1) added content preservation instructions to prevent polish-tier content loss, (2) made auto-grading default after --apply, (3) added polish-tier citation suppression to prevent fabricated citations, (4) added Quick Assessment table requirement for person pages, (5) added required Overview section enforcement, (6) added section deduplication and content repetition checks to review phase, (7) added bare URL→markdown link conversion instruction, (8) extended biographical claim checker to catch publication/co-authorship and citation count claims. Subsequent iterative testing and prompt refinement: ran pipeline on jan-leike, chris-olah, far-ai pages. Discovered and fixed: (a) `<!-- NEEDS CITATION -->` HTML comments break MDX compilation (changed to `{/* NEEDS CITATION */}`), (b) excessive citation markers at polish tier — added instruction to only mark NEW claims (max 3-5 per page), (c) editorial meta-comments cluttering output — added no-meta-comments instruction, (d) thin padding sections — added anti-padding instruction, (e) section deduplication needed stronger emphasis — added merge instruction with common patterns. Final test results: jan-leike 1254→1997 words, chris-olah 1187→1687 words, far-ai 1519→2783 words, miri-era 2678→4338 words; all MDX compile, zero critical issues.

Issues2
QualityRated 27 but structure suggests 87 (underrated by 60 points)
Links8 links could use <R> components

Chris Olah

Person

Chris Olah

Biographical overview of Chris Olah's career trajectory from self-taught researcher to Google Brain, OpenAI, and co-founding Anthropic, focusing on his work in mechanistic interpretability including feature visualization, circuit analysis, sparse autoencoder research (Towards Monosemanticity 2023, Scaling Monosemanticity 2024). Documents his unconventional educational background, contributions to science communication through Distill journal, and the state of external debate over mechanistic interpretability as a research program.

AffiliationAnthropic
RoleCo-founder, Head of Interpretability
Known ForMechanistic interpretability, neural network visualization, clarity of research communication
Related
Organizations
Anthropic
Safety Agendas
Interpretability
People
Dario Amodei
3.4k words · 23 backlinks
Person

Chris Olah

Biographical overview of Chris Olah's career trajectory from self-taught researcher to Google Brain, OpenAI, and co-founding Anthropic, focusing on his work in mechanistic interpretability including feature visualization, circuit analysis, sparse autoencoder research (Towards Monosemanticity 2023, Scaling Monosemanticity 2024). Documents his unconventional educational background, contributions to science communication through Distill journal, and the state of external debate over mechanistic interpretability as a research program.

AffiliationAnthropic
RoleCo-founder, Head of Interpretability
Known ForMechanistic interpretability, neural network visualization, clarity of research communication
Related
Organizations
Anthropic
Safety Agendas
Interpretability
People
Dario Amodei
3.4k words · 23 backlinks

Quick Assessment

DimensionAssessment
Primary RoleCo-founder and interpretability research lead at Anthropic
Key ContributionsFeature visualization techniques, circuit analysis methodology, sparse autoencoder applications for interpretability, co-founding Distill journal
Key Publications"Towards Monosemanticity" (2023), "Scaling Monosemanticity" (2024), "Toy Models of Superposition" (2022), "Feature Visualization" (2017), "The Building Blocks of Interpretability" (2018)
Institutional AffiliationAnthropic (2021–present); previously OpenAI (2018–2021), Google Brain (2015–2018)
RecognitionNamed to TIME's 100 Most Influential People in AI (2024); 2012 Thiel Fellow
Influence on AI SafetyContributed to establishing Mechanistic Interpretability as a research direction within AI safety; applied transparency and verification approaches to Large Language Models

Overview

Chris Olah is a Canadian machine learning researcher specializing in neural network interpretability and a co-founder of Anthropic. He is known primarily for developing and advancing the research program now called mechanistic interpretability, which aims to reverse-engineer the internal algorithms and representations of neural networks.1 His career has spanned Google Brain, OpenAI, and Anthropic, where he currently leads interpretability research.2

Olah followed an unconventional path into research: he has no undergraduate degree, left university as a teenager, and built his early reputation through independent blog posts at colah.github.io and a 2012 Thiel Fellowship.3 His blog posts on topics such as LSTM networks and neural network representations attracted significant readership in the machine learning community before he joined Google Brain in 2015.4

In 2016, Olah co-founded Distill, a peer-reviewed journal emphasizing interactive visualizations and web-native presentation of machine learning research, which operated until it entered an indefinite hiatus in July 2021.5 At Anthropic, he leads a team — which had grown to 17 researchers by April 2024 — focused on understanding the internal mechanisms of frontier AI systems including Claude.6 TIME magazine named him to its 2024 list of 100 Most Influential People in AI, describing him as "one of the pioneers of an entirely new scientific field, mechanistic interpretability."7

Background

Early Life and Education

Olah is Canadian and grew up in Toronto, where he developed an early interest in technology through participation in the local hacker community.8 As a teenager, he joined hacklab.to, a Toronto hackerspace, in 2009, and later served as a director from 2012 to 2014, teaching workshops on topics including integral transforms and LaTeX.4

He graduated from The Abelard School in Toronto in 2010 as an AP National Scholar, having completed six Advanced Placement courses.4 He briefly attended the University of Toronto but left without completing a degree — according to Wired, at approximately age 18.9 His departure was partly connected to his support for Byron Sonne, a security researcher who faced criminal charges related to legitimate security research; Olah provided court support and documentation for the "Free Byron" campaign from 2010 to 2012.3

After leaving university, Olah did not return to formal education. He engaged in a range of self-directed technical projects, including open-source 3D printing work (the ImplicitCAD project and the Toronto 3D Printers group) and DIY biology meetups.4 In July 2012, he was selected as a Thiel Fellow, receiving a $100,000 grant from the Thiel Foundation to support independent research.10 The fellowship recognized his work on 3D printing and self-directed technical exploration. Vitalik Buterin, co-founder of Ethereum, is another alumnus of The Abelard School who also received a Thiel Fellowship.11

Career at Google Brain

Olah first joined Google Brain as an intern in summer 2014, hosted by Jeff Dean, where he worked on visualizing neural network representations.4 He returned for a second internship in 2015 before transitioning to full-time roles: Research Associate from October 2015 to October 2016, then Research Scientist from October 2016 to October 2018.4

During this period he co-authored the "Inceptionism: Going Deeper into Neural Networks" blog post in June 2015, which described techniques for generating visualizations by maximizing neural network activations — work associated with what became known as DeepDream.4 He was also a co-author on the TensorFlow whitepaper published in November 2015.4 His blog posts at colah.github.io on topics including LSTM networks (2015) and attention mechanisms (2016) attracted substantial readership in the machine learning community.

Career at OpenAI

In 2018, Olah joined OpenAI as a Member of Technical Staff and founded the "Clarity team" within OpenAI's safety division, serving as its technical lead.4 He has described his own career order as: "previously led interpretability research at OpenAI, worked at Google Brain, and co-founded Distill."2

The Clarity team, from 2018 to 2021, developed the foundational work on circuit-based interpretability that would define the field. This included the Circuits thread on Distill, which launched in March 2020, and papers including "Zoom In: An Introduction to Circuits." A 2020 CVPR presentation was explicitly attributed to "Chris Olah, OpenAI Clarity Team."12

Co-founding Anthropic

In 2021, Olah co-founded Anthropic with Dario Amodei and other former OpenAI researchers. At Anthropic, he continues to lead interpretability research, now focused on production-scale models.1

Distill Journal

In 2017, Olah co-founded Distill with Shan Carter and Arvind Satyanarayan (MIT CSAIL). The journal was established in March 2017 with institutional backing from Google, OpenAI, DeepMind, and Y Combinator Research.13 Olah served as editor-in-chief, with Carter at Google Brain and Satyanarayan at MIT CSAIL.13

Distill operated as a peer-reviewed scientific journal with a distinctive emphasis on interactive graphics and web-native explanations, arguing that "traditional academic publishing remains focused on the PDF" despite the web's capacity for richer communication.14 Articles underwent review for both correctness and clarity of presentation.

The journal published research on neural network interpretability and visualization, attention mechanisms, optimization dynamics, and feature learning. One notable experiment was the Circuits thread, launched March 10, 2020, which invited short articles on features and circuits in neural networks, interspersed with commentary from researchers in adjacent fields — an attempt at a more continuous, faster publication format.15

On July 2, 2021, the editorial team announced an indefinite hiatus.5 The announcement cited three reasons: volunteer burnout from running the journal; structural friction that made it difficult to focus on the most exciting aspects of publishing; and a loss of confidence in their original theory of impact — they had concluded that publishing in a journal like Distill did not significantly affect how seriously institutions treat non-traditional publications.5 Papers under active review at the time were not affected, and published threads could continue to receive additions. The journal's open-source template remains publicly available.14

When Distill entered hiatus, Olah's team at Anthropic created transformer-circuits.pub as a successor venue, noting that they "previously the team would have submitted to Distill, but with Distill on hiatus, they took a page from David Ha's approach of simply creating websites for research projects."16

Mechanistic Interpretability Research

Olah's research program aims to understand neural networks by reverse-engineering their internal algorithms and representations. This approach, termed mechanistic interpretability, treats neural networks as systems that can be understood at the level of individual features and circuits — rather than solely through input-output behavior.17 In a 2022 essay, Olah described the goal as analogous to reverse-engineering a compiled binary computer program: recovering human-readable structure from a system whose internal representation was not designed for human comprehension.18

Feature Visualization

Feature visualization techniques synthesize inputs that maximally activate specific neurons or layers in a neural network. Olah's 2017 work on feature visualization established methods for generating these visualizations and interpreting what features neural networks learn. The approach involves optimizing input images to maximize activation of target neurons, revealing the visual patterns those neurons respond to.

The "Feature Visualization" (2017) paper introduced optimization-based activation maximization and methods for visualizing intermediate layers to understand hierarchical feature learning. This work involved collaboration with researchers at Google Brain including Alexander Mordvintsev and Ludwig Schubert.

Circuit Analysis

Circuit analysis extends feature visualization by tracing how features connect and process information. The 2018 paper "The Building Blocks of Interpretability" demonstrated that individual features can be identified and visualized, that connections between features form interpretable circuits, and that these circuits implement specific algorithms or computations. Co-authors included Shan Carter, Ludwig Schubert, and other Google Brain researchers.

The 2020 paper "Zoom In: An Introduction to Circuits" further developed this framework, putting forward three speculative claims: (1) Features — neural network neurons represent understandable features; (2) Circuits — connections between neurons form meaningful algorithms; (3) Universality — analogous features and circuits form across different models and tasks.19 The paper documented early-layer features such as curve detectors and edge detectors, and proposed that circuits are falsifiable: if a circuit is understood, changes to weights should produce predictable behavioral changes.19 The ideas had been previously presented as a keynote at the VISxAI workshop in 2019.19 Co-authors included Nick Cammarata and Gabriel Goh.

Superposition and Sparse Autoencoders

"Toy Models of Superposition" (2022) provided a mathematical framework for understanding a core difficulty in interpretability. The paper demonstrated that neural networks can represent more features than they have dimensions by storing features in superposition — allowing multiple features to interfere in the same neurons. Key findings included that networks learn to represent sparse features in superposition, that the number of representable features scales with sparsity, and that this explains polysemanticity (neurons responding to multiple unrelated concepts). Co-authors included Anthropic researchers Nelson Elhage, Tom Henighan, and others.

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023) addressed the superposition problem more directly by applying sparse autoencoders — a dictionary learning technique — to decompose a one-layer transformer's MLP activations into monosemantic features.20 Where individual neurons are polysemantic (responding to multiple unrelated concepts), the paper argued that "features" — patterns in linear combinations of neuron activations — are a better unit of analysis.20 A layer with 512 neurons was decomposed into more than 4,000 features representing distinct concepts including DNA sequences, legal language, HTTP requests, Hebrew text, and nutrition statements.20 The paper also introduced the concept of "feature splitting": as the autoencoder is made larger, features split into more specific sub-features.20 The work was published on transformer-circuits.pub in October 2023, with Trenton Bricken, Adly Templeton, and Joshua Batson as core contributors alongside Olah.20

"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024) extended the sparse autoencoder approach to Claude 3 Sonnet, a production-scale large language model.21 The team trained autoencoders with approximately 1 million, 4 million, and 34 million features, discovering features corresponding to concepts such as "The Golden Gate Bridge," code bugs, bias recognition, and scam email recognition.21 Feature steering — forcing specific features to high values — was found to alter the model's demeanor, preferences, stated goals, biases, and in some cases its ability to circumvent safeguards.21 The paper also noted limitations: even the largest 34-million-feature model covered only approximately 60% of London boroughs, suggesting the full model's knowledge substantially exceeds what current sparse autoencoders can capture.21

Work at Anthropic

At Anthropic, Olah leads interpretability research with a focus on understanding frontier AI systems. By April 2024, the team had grown to 17 researchers — having hired 10 new people during 2023 alone — drawn from backgrounds including astrophysics, condensed matter physics, mathematics, and neuroscience.6 This team represented a substantial fraction of an estimated approximately 50 full-time researchers globally working on mechanistic interpretability at that time.6

The research program aims to:

  1. Scale interpretability to production models: Develop techniques that work on models the size of Claude rather than only small research models
  2. Connect interpretability to safety: Use understanding of model internals to detect potentially dangerous capabilities or behaviors
  3. Automate interpretability: Use AI systems to help interpret other AI systems, enabling analysis at scale
  4. Develop verification methods: Create techniques that can verify properties of AI systems through understanding their internals

Interpretability for AI Safety

The interpretability program at Anthropic aims to support safety through several approaches:

Capability detection: Identifying when models possess specific capabilities by examining internal representations and features, potentially enabling detection of dangerous capabilities before they manifest in behavior.

Behavior verification: Understanding the mechanisms behind model outputs to assess whether models are reporting their actual internal states, relevant to concerns about Deceptive Alignment.

Debugging: Using mechanistic understanding to identify and potentially modify problematic model behaviors or learned heuristics.

Monitoring: Developing methods to detect anomalous internal activations that might indicate Scheming or other concerning behaviors.

Transformer Circuits Thread

Following Distill's hiatus, Olah's team created transformer-circuits.pub as a venue for publishing mechanistic interpretability research in a similar web-native format.16 Key papers hosted on this platform include "A Mathematical Framework for Transformer Circuits," "In-context Learning and Induction Heads," "Toy Models of Superposition," "Towards Monosemanticity," and "Scaling Monosemanticity."16

Research Philosophy and Communication

Olah's research approach emphasizes several recurring themes:

Visual communication: Using diagrams, interactive visualizations, and carefully designed figures to convey technical concepts. His blog posts and papers typically include extensive visualizations. His 2015 blog post on LSTM networks became a frequently-cited reference for readers learning about recurrent architectures, combining technical explanations with interactive visualizations.3

Accessibility with technical precision: Explaining complex topics clearly while maintaining technical rigor. His blog at colah.github.io covers topics including LSTM networks, neural network representations, and attention mechanisms in this style.

Infrastructure investment: Building tools and frameworks for interpretability research, including visualization libraries and analysis frameworks.

Long-term research: Pursuing research directions over multiple years, with superposition research spanning from initial theoretical work in 2022 to scaled demonstrations in 2024.

Olah has also described the interpretability program in explicitly strategic terms, characterizing it as "deliberately targeted at trying to fill in holes in our portfolio for pessimistic scenarios" — a "high-risk, high-reward bet" that "may not succeed in time but could be a powerful tool if it does."22 He has emphasized concern about understanding model safety off-distribution as a key motivation for the mechanistic approach over correlational interpretability methods.22

Key Publications

Blog Posts (colah.github.io):

  • "Understanding LSTM Networks" (2015)
  • "Visualizing Representations: Deep Learning and Human Beings" (2015)
  • "Attention and Augmented Recurrent Neural Networks" (2016, with Shan Carter)

Research Papers:

  • "Feature Visualization" (2017, with Alexander Mordvintsev, Ludwig Schubert, and others)
  • "The Building Blocks of Interpretability" (2018, with Shan Carter, Ludwig Schubert, and others)
  • "Zoom In: An Introduction to Circuits" (2020, with Nick Cammarata, Gabriel Goh, and others) — distill.pub
  • "Toy Models of Superposition" (2022, with Nelson Elhage, Tristan Hume, Tom Henighan, and others) — transformer-circuits.pub
  • "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023, with Trenton Bricken, Adly Templeton, Joshua Batson, and others) — transformer-circuits.pub
  • "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024, with Adly Templeton, Tom Conerly, and others) — transformer-circuits.pub

Essays and Informal Notes:

Views on AI Safety

Olah has written and spoken about interpretability research as one component of AI safety rather than a complete solution. Positions he has articulated include:

Necessity of understanding: Deployment of powerful AI systems requires understanding their internal operations, not just observing input-output behavior. In a 2023 80,000 Hours interview, he described the central challenge as: "How is it that these models are doing things that we don't know how to do?" and explained that understanding individual neurons in principle allows researchers to "read algorithms off of the weights."23

Conditional tractability: Olah has argued that neural networks can be understood mechanistically through sustained research effort, against the view that they are inherently inscrutable — while also acknowledging that this is a bet that may not succeed in time.22 He has noted that even if full interpretability is not achievable, understanding "small slices" of model behavior might allow detection of manipulative behavior in the moment.23

Complementarity: Interpretability is framed as working alongside other safety approaches including RLHF, Scalable Oversight, and Constitutional AI.

Automation necessity: Fully understanding large models requires using AI to assist in interpretation, as human analysis alone cannot scale to billions of parameters.

Access requirements: Interpretability research on frontier models requires working with those models, a consideration that influenced the decision to conduct research at Anthropic rather than academia.

Challenges, Limitations, and External Criticism

Several challenges to interpretability research have been identified, both from within the field and by external researchers.

Internal Acknowledgments

Scaling limitations: While sparse autoencoder approaches have been applied to Claude 3 Sonnet, it remains an open question whether interpretability techniques can keep pace with capability improvements in future systems. The "Scaling Monosemanticity" paper itself noted that even its largest autoencoder captured only a fraction of the model's representations.21

Verification gaps: Understanding model internals does not automatically provide verification that models lack dangerous capabilities, as understanding is necessarily incomplete and features may be missed.

Deceptive models: Models exhibiting Deceptive Alignment might develop internal representations specifically designed to appear benign under interpretability analysis.

Resource requirements: Interpretability research on frontier models requires substantial computational resources and access to those models.

External Criticism

The mechanistic interpretability research program has attracted substantive criticism from researchers outside the program.

Scalability concerns: Critics have argued that mechanistic interpretability has "failed to scale to challenging problems, and might always fail to scale" because current methods depend on human-generated mechanistic hypotheses — sidestepping the hard problem of automated hypothesis generation. This critique holds that most work in the field relies on "intuition-based or weak ad-hoc evaluation."24

Safety relevance: Stanford NLP professor Christopher Potts has argued that interpretability research has not yet come close to making AI meaningfully safer in practice, observing that "gains in safety seem mostly to stem from behavioral evaluations, heuristic adjustments to training regimes, and robust software system design." As a concrete example, he notes that the GPT-4o sycophancy problem "was detected behaviorally and fixed by improving post-training — no circuit was discovered, no particular weights or activations were held responsible, and no mechanistic analysis sounded a warning bell or informed the solutions."25

Theory of impact: Some researchers have questioned whether the theory of impact behind mechanistic interpretability is well-specified, arguing that even a complete solution to the superposition problem would not address "enumerative safety" for large-scale models.26

Complexity mismatch: A broader critique holds that the reductionist framing of "mechanistic" interpretability is misapplied to complex systems, which exhibit emergent properties that cannot easily be understood by tracing fundamental interactions. Proponents of this view note that even Google DeepMind deprioritized work on sparse autoencoders in early 2025, around the same time Anthropic CEO Dario Amodei published an essay advocating for greater focus on the field — indicating substantive disagreement among leading labs.27

Philosophical limitations: A 2024 peer-reviewed philosophical analysis identified conceptual limitations in mechanistic interpretability, noting that "obvious structural components like neurons, attention heads, and parameters often fail to map cleanly onto functionally meaningful roles," and that mechanistic explanation as an approach has critics who favor causal-interventionist but non-mechanistic alternatives.28

Olah has acknowledged skeptical arguments in public discussions, describing the research program as a high-risk bet while maintaining that even partial success could be valuable for AI safety.22 23

Recognition

Olah has received several forms of recognition for his work:

  • TIME 100 Most Influential People in AI (2024): Named to TIME magazine's list, described as a pioneer of mechanistic interpretability as a scientific field.7
  • Thiel Fellowship (2012): Received a $100,000 grant from the Thiel Foundation supporting independent research outside of university.10
  • AP National Scholar (2010): Recognized for completing six Advanced Placement courses upon high school graduation.4

No academic appointments, ACM prizes, or MIT Technology Review Innovators Under 35 recognition were identified in available sources.

Influence on the Field

Interpretability research has grown as a subfield within AI safety and machine learning since the mid-2010s:

Research groups: Multiple organizations now have dedicated interpretability teams, including Anthropic, OpenAI, Google DeepMind, and others. A dedicated mechanistic interpretability workshop was held at NeurIPS 2023, reflecting the subfield's growth.29

Methods adoption: Feature visualization and circuit analysis techniques developed through Olah's work are used by researchers studying neural networks across domains.

Communication practices: Distill's emphasis on interactive visualizations and web-native explanations influenced how some machine learning researchers approach research communication, though the journal itself ceased accepting new submissions in 2021.

Community formation: The Circuits thread on Distill (launched 2020) and its successor transformer-circuits.pub served as organizing venues for the mechanistic interpretability research community, providing a shared publication venue and common research agenda.

The extent to which current interpretability techniques will scale to future AI systems, and whether they will provide actionable safety benefits, remain actively debated questions within the research community.

Footnotes

  1. "Chris Olah: The 100 Most Influential People in AI 2024." — TIME Magazine. "Chris Olah: The 100 Most Influential People in AI 2024." September 2024. 2

  2. "About Me." — Chris Olah. "About Me." colah.github.io. Accessed 2024. 2

  3. "Chris Olah on Working at Top AI Labs Without an Undergrad Degree." — 80,000 Hours. "Chris Olah on Working at Top AI Labs Without an Undergrad Degree." 80000hours.org. Episode 108. 2 3

  4. "Chris Olah." — Grokipedia. "Chris Olah." 2024. 2 3 4 5 6 7 8 9 10

  5. "Distill Hiatus." — Distill Editorial Team. "Distill Hiatus." distill.pub. July 2, 2021. 2 3

  6. "Circuits Updates — April 2024." — Anthropic Interpretability Team. "Circuits Updates — April 2024." transformer-circuits.pub. April 2024. 2 3

  7. "Chris Olah: The 100 Most Influential People in AI 2024." — TIME Magazine. "Chris Olah: The 100 Most Influential People in AI 2024." September 2024. 2

  8. "Chris Olah." — Wikipedia contributors. "Chris Olah." Wikipedia. 2024.

  9. "Chris Olah." — Wikipedia contributors. "Chris Olah." Wikipedia. 2024. Citing Wired.

  10. "Chris Olah." — Grokipedia. "Chris Olah." 2024. Thiel Fellowship, July 2012. 2

  11. "Tiny High School in Toronto Produces Two Thiel Fellowship Winners." — The Abelard School. "Tiny High School in Toronto Produces Two Thiel Fellowship Winners." October 3, 2014.

  12. "An Introduction to Circuits in CNNs." — Chris Olah. "An Introduction to Circuits in CNNs." CVPR 2020 slide deck. Attributed to "Chris Olah, OpenAI Clarity Team."

  13. ["Distill (journal)."](https://en.wikipedia.org/wiki/Distill_(journal) — Wikipedia contributors. "Distill (journal)." Wikipedia. 2024. 2

  14. Citation rc-1270 (data unavailable — rebuild with wiki-server access) 2

  15. "Thread: Circuits." — Distill / Chris Olah et al. "Thread: Circuits." distill.pub. March 10, 2020.

  16. "Transformer Circuits Thread." — Anthropic Interpretability Team. "Transformer Circuits Thread." transformer-circuits.pub. 2021–present. 2 3

  17. "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." — Chris Olah. "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." transformer-circuits.pub. June 27, 2022.

  18. "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." — Chris Olah. "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases." transformer-circuits.pub. June 27, 2022.

  19. "Zoom In: An Introduction to Circuits." — Chris Olah et al. "Zoom In: An Introduction to Circuits." Distill. March 10, 2020. 2 3

  20. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." — Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, et al. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." transformer-circuits.pub. October 4, 2023. 2 3 4 5

  21. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." — Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, et al. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." transformer-circuits.pub. May 24, 2024. 2 3 4 5

  22. AI Alignment Forum profile and comments. — Chris Olah. AI Alignment Forum profile and comments. alignmentforum.org. 2022–2024. 2 3 4

  23. "Chris Olah on What the Hell Is Going On Inside Neural Networks." — 80,000 Hours. "Chris Olah on What the Hell Is Going On Inside Neural Networks." 80000hours.org. Episode 107. 2023. 2 3

  24. "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety." — LessWrong. "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety." lesswrong.com. 2023.

  25. "Assessing Skeptical Views of Interpretability Research." — Christopher Potts. "Assessing Skeptical Views of Interpretability Research." web.stanford.edu. August 2025.

  26. "Against Almost Every Theory of Impact of Interpretability." — LessWrong. "Against Almost Every Theory of Impact of Interpretability." lesswrong.com. 2023.

  27. "The Misguided Quest for Mechanistic AI Interpretability." — Dan Hendrycks and Laura Hiscott. "The Misguided Quest for Mechanistic AI Interpretability." AI Frontiers. 2024.

  28. "Mechanistic Interpretability Needs Philosophy." — PhilArchive / arXiv. "Mechanistic Interpretability Needs Philosophy." 2024.

  29. "NeurIPS 2023 Mechanistic Interpretability Workshop." — NeurIPS 2023 Workshop Organizers. "NeurIPS 2023 Mechanistic Interpretability Workshop." December 2023.

References

2Chris Olahgrokipedia.com·Reference
3Distill (journal - Wikipediaen.wikipedia.org·Reference
4Chris Olah - Wikipediaen.wikipedia.org·Reference
5colah.github.io
6transformer-circuits.pub·Paper
8transformer-circuits.pub·Paper

Structured Data

10 facts·3 recordsView full profile →
Employed By
Anthropic
as of Jan 2021
Role / Title
Co-founder, Interpretability
as of Jan 2021

All Facts

People
PropertyValueAs OfSource
Employed ByAnthropicJan 2021
2 earlier values
2018OpenAI
2015Google DeepMind
Role / TitleCo-founder, InterpretabilityJan 2021
Biographical
PropertyValueAs OfSource
EducationAttended University of Toronto (did not complete degree); Thiel Fellow
Notable ForPioneer of neural network interpretability and visualization; co-founder of Anthropic; creator of Distill.pub and the Circuits thread at Transformer Circuits
Social Media@ch402
GitHubhttps://github.com/colah
Google Scholarhttps://scholar.google.com/citations?user=vKAKE1gAAAAJ
General
PropertyValueAs OfSource
Websitehttps://colah.github.io

Career History

3
OrganizationTitleStartEnd
Google BrainResearch Scientist20152018
openaiResearch Scientist2018Jan 2021
anthropicCo-founder; Research Lead, Mechanistic InterpretabilityJan 2021

Related Pages

Top Related Pages

Organizations

OpenAI

Other

Neel NandaConnor LeahyAnthropic Stakeholders

Risks

Deceptive AlignmentScheming

Approaches

AI Alignment

Safety Research

Anthropic Core Views

Key Debates

AI Accident Risk CruxesIs Interpretability Sufficient for Safety?

Analysis

Anthropic Valuation AnalysisAnthropic (Funder)

Concepts

Large Language ModelsRLHFFrontier Ai ComparisonAgentic AI

Historical

Deep Learning Revolution EraMainstream Era

Policy

California SB 53Model Registries