26QualityDraftQuality: 26/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.84.5ImportanceHighImportance: 84.5/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.40.5ResearchLowResearch Value: 40.5/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Summary
Overview of Neel Nanda's contributions to mechanistic interpretability, including the TransformerLens library and research on transformer circuits. Covers his educational content and role in making interpretability research more accessible to newcomers in the field.
Content6/13
LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.Set updateFrequency in frontmatterEntityEntityYAML entity definition with type, description, and related entries.Edit history3Edit historyTracked changes from improve pipeline runs and manual edits.OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.Add a ## Overview section at the top of the page
Tables0/ ~3TablesData tables for structured comparisons and reference material.Add data tables to the pageDiagrams0DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.Add Mermaid diagrams or Squiggle modelsInt. links11/ ~5Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Ext. links0/ ~3Ext. linksLinks to external websites, papers, and resources outside the wiki.Add links to external sourcesFootnotes0/ ~2FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citationsReferences0/ ~2ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes4/4QuotesSupporting quotes extracted from cited sources to back up page claims.Accuracy4/4AccuracyCitations verified against their sources for factual accuracy.RatingsN:2 R:3 A:2.5 C:4.5RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks15BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Change History3
Citation pipeline improvements and footnote normalization3 weeks ago
Fixed citation extraction to handle all footnote formats (text+bare URL), created a
footnote normalization script that auto-converted 58 non-standard footnotes to
markdown-link format, switched dashboard export from JSON/.cache to YAML/data/ for
production compatibility, ran the citation accuracy pipeline on 5 pages
(rethink-priorities, cea, compute-governance, hewlett-foundation,
center-for-applied-rationality) producing 232 citation checks with 57% accurate, 16%
flagged, re-verified colorado-ai-act archive outside sandbox (18/19 verified), and
improved difficulty distribution to use structured categories (easy/medium/hard) with
normalization fallback.
claude-opus-4-6 · ~1h
Surface tacticalValue in /wiki table and score 53 pages3 weeks ago
Added `tacticalValue` to `ExploreItem` interface, `getExploreItems()` mappings, the `/wiki` explore table (new sortable "Tact." column), and the card view sort dropdown. Scored 49 new pages with tactical values (4 were already scored), bringing total to 53.
Six refactors to the wiki editing pipeline: (1) extracted shared regex patterns to `crux/lib/patterns.ts`, (2) refactored validation in page-improver to use in-process engine calls instead of subprocess spawning, (3) split the 694-line `phases.ts` into 7 individual phase modules under `phases/`, (4) created shared LLM abstraction `crux/lib/llm.ts` unifying duplicated streaming/retry/tool-loop code, (5) added Zod schemas for LLM JSON response validation, (6) decomposed 820-line mermaid validation into `crux/lib/mermaid-checks.ts` (604 lines) + slim orchestrator (281 lines). Follow-up review integrated patterns.ts across 19+ files, fixed dead imports, corrected ToolHandler type, wired mdx-utils.ts to use shared patterns, replaced hardcoded model strings with MODELS constants, replaced `new Anthropic()` with `createLlmClient()`, replaced inline `extractText` implementations with shared `extractText()` from llm.ts, integrated `MARKDOWN_LINK_RE` into link validators, added `objectivityIssues` to the `AnalysisResult` type (removing an unsafe cast in utils.ts), fixed CI failure from eager client creation, and tested the full pipeline by improving 3 wiki pages. After manual review of 3 improved pages, fixed 8 systematic pipeline issues: (1) added content preservation instructions to prevent polish-tier content loss, (2) made auto-grading default after --apply, (3) added polish-tier citation suppression to prevent fabricated citations, (4) added Quick Assessment table requirement for person pages, (5) added required Overview section enforcement, (6) added section deduplication and content repetition checks to review phase, (7) added bare URL→markdown link conversion instruction, (8) extended biographical claim checker to catch publication/co-authorship and citation count claims.
Subsequent iterative testing and prompt refinement: ran pipeline on jan-leike, chris-olah, far-ai pages. Discovered and fixed: (a) `<!-- NEEDS CITATION -->` HTML comments break MDX compilation (changed to `{/* NEEDS CITATION */}`), (b) excessive citation markers at polish tier — added instruction to only mark NEW claims (max 3-5 per page), (c) editorial meta-comments cluttering output — added no-meta-comments instruction, (d) thin padding sections — added anti-padding instruction, (e) section deduplication needed stronger emphasis — added merge instruction with common patterns. Final test results: jan-leike 1254→1997 words, chris-olah 1187→1687 words, far-ai 1519→2783 words, miri-era 2678→4338 words; all MDX compile, zero critical issues.
Issues1
StructureNo tables or diagrams - consider adding visual content
Neel Nanda
Person
Neel Nanda
Overview of Neel Nanda's contributions to mechanistic interpretability, including the TransformerLens library and research on transformer circuits. Covers his educational content and role in making interpretability research more accessible to newcomers in the field.
Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100
People
Chris OlahPersonChris OlahBiographical overview of Chris Olah's career trajectory from self-taught researcher to Google Brain, OpenAI, and co-founding Anthropic, focusing on his work in mechanistic interpretability includin...Quality: 27/100
Safety Agendas
InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
644 words · 15 backlinks
Person
Neel Nanda
Overview of Neel Nanda's contributions to mechanistic interpretability, including the TransformerLens library and research on transformer circuits. Covers his educational content and role in making interpretability research more accessible to newcomers in the field.
Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100
People
Chris OlahPersonChris OlahBiographical overview of Chris Olah's career trajectory from self-taught researcher to Google Brain, OpenAI, and co-founding Anthropic, focusing on his work in mechanistic interpretability includin...Quality: 27/100
Safety Agendas
InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
644 words · 15 backlinks
Background
Neel NandaPersonNeel NandaOverview of Neel Nanda's contributions to mechanistic interpretability, including the TransformerLens library and research on transformer circuits. Covers his educational content and role in making...Quality: 26/100 is a mechanistic interpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with \$100M+ annual investment across major labs. Anthropic extracted 30M+ features from C...Quality: 59/100 researcher at Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100 who focuses on reverse-engineering neural networks to understand how they implement algorithms. He studied Mathematics at Trinity College, Cambridge, and previously worked at AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials (\$380B valuation, \$19B ARR), safety research (Constitutional AI, mechanistic interpretability, model welfare), governance (LTBT struc...Quality: 74/100 before joining DeepMind's alignment team.
Nanda maintains an active presence in the AI alignment research community through LessWrongOrganizationLessWrongLessWrong is a rationality-focused community blog founded in 2009 that has influenced AI safety discourse, receiving \$5M+ in funding and serving as the origin point for ~31% of EA survey responden...Quality: 44/100 and the Alignment Forum, where he publishes tutorials and research updates.
Research Contributions
TransformerLens Library
Nanda created TransformerLens, an open-source Python library for interpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100 research on transformer models.1 The library provides:
Programmatic access to model activations at each layer
Hook functions for intervention experiments
Integration with pretrained models from Hugging Face
Visualization utilities for attention patterns
The library is hosted on GitHub and documented at transformerlensorg.github.io/TransformerLens.2
Transformer Circuits Research
Nanda co-authored "A Mathematical Framework for Transformer Circuits" (2021), which analyzed how transformer language models implement interpretable algorithms.3 The research:
Identified "induction heads" as circuits that enable in-context learning
Demonstrated that attention mechanisms compose to perform multi-step reasoning
Provided mathematical descriptions of how models track positional and semantic information
The paper built on earlier work from AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials (\$380B valuation, \$19B ARR), safety research (Constitutional AI, mechanistic interpretability, model welfare), governance (LTBT struc...Quality: 74/100's interpretability team examining circuits in vision models.
Additional Research Areas
Nanda has published work on:
Indirect Object Identification: Analyzing how language models parse syntactic relationships in sentences
Grokking: Studying the phase transitions that occur when models suddenly generalize during training
Modular addition circuits: Reverse-engineering the algorithms small transformers learn for arithmetic tasks
Educational Content
Nanda publishes interpretability tutorials and explanations on his blog at neelnanda.io and through video content. His "200 Concrete Open Problems in Mechanistic Interpretability" post outlines research directions for the field.4
He has written guides for newcomers to interpretability research, including walkthroughs of TransformerLens usage and explanations of foundational papers. These materials are distributed as blog posts, Jupyter notebooks, and video tutorials.
Research Approach
Nanda's work emphasizes:
Circuit discovery: Identifying specific subnetworks responsible for model behaviors
Mechanistic explanations: Describing algorithms implemented by neural networks in mathematical or computational terms
Open-source tooling: Building software infrastructure to enable interpretability experiments
Reproducible examples: Providing code and notebooks that others can run and modify
His stated view is that understanding neural network internals is necessary for evaluating AI safety properties, though he acknowledges that current interpretability techniques may not directly transfer to more capable future systems.5
Perspectives on Interpretability and Alignment
In posts and presentations, Nanda has outlined reasons why mechanistic interpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with \$100M+ annual investment across major labs. Anthropic extracted 30M+ features from C...Quality: 59/100 research may contribute to AI alignment:
Failure diagnosis: Identifying the mechanisms behind unexpected model behaviors
Capability evaluation: Determining what tasks models can perform by examining their internal algorithms
Deception detection: Searching for representations that indicate models are optimizing for objectives different from their training signal
Verification: Checking whether specific safety properties hold in a model's learned algorithms
He notes that these applications remain speculative and that the field is in early stages of developing techniques that scale to frontier models.6
Current Work
At Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100, Nanda works on the alignment team. His stated focus areas include scaling interpretability techniques to larger models and exploring automated methods for circuit discovery.
Limitations and Open Questions
Nanda has identified challenges for mechanistic interpretability research:7
Current techniques are labor-intensive and may not scale to models with hundreds of billions of parameters
Many discovered circuits are descriptive rather than predictive, explaining behavior post-hoc without enabling intervention
It remains unclear whether interpretability of current models will provide insights applicable to future AI systems with different architectures or training methods
Critical perspectives on interpretability research more broadly include questions about whether understanding model internals is tractable for highly capable systems, and whether interpretability work should be prioritized relative to other alignment approaches like scalable oversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100 or Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100.
Resources
Personal website: neelnanda.io (blog posts and tutorials)
Citation rc-e3df (data unavailable — rebuild with wiki-server access) ↩
Nanda's presentations on interpretability and alignment discuss these potential applications as research directions r... — Nanda's presentations on interpretability and alignment discuss these potential applications as research directions rather than demonstrated capabilities ↩
Discussed in "200 Concrete Open Problems in Mechanistic Interpretability" and related posts ↩
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100
Other
Connor LeahyPersonConnor LeahyBiography of Connor Leahy, CEO and co-founder of Conjecture, an AI safety company based in London. Previously co-founded EleutherAI in 2020, which produced GPT-J and GPT-NeoX. Leahy holds a high P(...Quality: 19/100
Approaches
AI AlignmentApproachAI AlignmentComprehensive review of AI alignment approaches finding current methods (RLHF, Constitutional AI) show 75%+ effectiveness on measurable safety metrics for existing systems but face critical scalabi...Quality: 91/100Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100Agent FoundationsApproachAgent FoundationsAgent foundations research (MIRI's mathematical frameworks for aligned agency) faces low tractability after 10+ years with core problems unsolved, leading to MIRI's 2024 strategic pivot away from t...Quality: 59/100
Analysis
Model Organisms of MisalignmentAnalysisModel Organisms of MisalignmentModel organisms of misalignment is a research agenda creating controlled AI systems exhibiting specific alignment failures as testbeds. Recent work achieves 99% coherence with 40% misalignment rate...Quality: 65/100Capability-Alignment Race ModelAnalysisCapability-Alignment Race ModelQuantifies the capability-alignment race showing capabilities currently ~3 years ahead of alignment readiness, with gap widening at 0.5 years/year driven by 10²⁶ FLOP scaling vs. 15% interpretabili...Quality: 62/100
Organizations
GoodfireOrganizationGoodfireGoodfire is a well-funded AI interpretability startup valued at \$1.25B (Feb 2026) developing mechanistic interpretability tools like Ember API to make neural networks more transparent and steerabl...Quality: 68/100MATS ML Alignment Theory Scholars programOrganizationMATS ML Alignment Theory Scholars programMATS is a well-documented 12-week fellowship program that has successfully trained 213 AI safety researchers with strong career outcomes (80% in alignment work) and research impact (160+ publicatio...Quality: 60/100
Safety Research
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
Policy
AI Whistleblower ProtectionsPolicyAI Whistleblower ProtectionsComprehensive analysis of AI whistleblower protections showing severe gaps in current law (no federal protection for AI safety disclosures) with bipartisan AI Whistleblower Protection Act (S.1792) ...Quality: 63/100
Concepts
Similar ProjectsSimilar ProjectsAnalysis of 12+ AI safety knowledge projects finds successful ones have narrow focus, paid contributors, and single editorial ownership (e.g., Stampy's \$2,500/mo fellowship, EA Forum Wiki grant), ...Quality: 64/100Agentic AICapabilityAgentic AIAnalysis of agentic AI capabilities and deployment challenges, documenting industry forecasts (40% of enterprise apps by 2026, \$199B market by 2034) alongside implementation difficulties (40%+ pro...Quality: 68/100Self-Improvement and Recursive EnhancementCapabilitySelf-Improvement and Recursive EnhancementComprehensive analysis of AI self-improvement from current AutoML systems (23% training speedups via AlphaEvolve) to theoretical intelligence explosion scenarios, with expert consensus at ~50% prob...Quality: 69/100
Key Debates
Is Interpretability Sufficient for Safety?CruxIs Interpretability Sufficient for Safety?Comprehensive survey of the interpretability sufficiency debate with 2024-2025 empirical progress: Anthropic extracted 34M features from Claude 3 Sonnet (70% interpretable), but scaling requires bi...Quality: 49/100AI Alignment Research AgendasCruxAI Alignment Research AgendasComprehensive comparison of major AI safety research agendas (\$100M+ Anthropic, \$50M+ DeepMind, \$5-10M nonprofits) with detailed funding, team sizes, and failure mode coverage (25-65% per agenda...Quality: 69/100
Historical
Deep Learning Revolution EraHistoricalDeep Learning Revolution EraComprehensive timeline documenting 2012-2020 AI capability breakthroughs (AlexNet, AlphaGo, GPT-3) and parallel safety field development, with quantified metrics showing capabilities funding outpac...Quality: 44/100Mainstream EraHistoricalMainstream EraComprehensive timeline of AI safety's transition from niche to mainstream (2020-present), documenting ChatGPT's unprecedented growth (100M users in 2 months), the OpenAI governance crisis, and firs...Quality: 42/100