Back
Awesome Mechanistic Interpretability Papers
webCredibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: GitHub
A GitHub reading list aggregating mechanistic interpretability papers; useful as a literature survey starting point for researchers studying how language models implement computations internally, though last updated in late 2024.
Metadata
Importance: 62/100wiki pagereference
Summary
A curated GitHub repository collecting and organizing influential research papers on mechanistic interpretability of language models. It serves as a community reference for researchers studying how neural networks implement computations internally, covering topics like circuits, features, attention heads, and sparse autoencoders.
Key Points
- •Curated list of ~100+ papers on mechanistic interpretability specifically focused on language models
- •Organized by topic areas including circuits, features, attention mechanisms, and sparse autoencoders
- •Community-maintained resource with 231 stars, serving as a starting point for researchers entering the field
- •Covers both foundational works and recent advances in understanding internal model representations
- •Useful for tracking the breadth of mechanistic interpretability research across different model behaviors
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Interpretability | Research Area | 66.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 202628 KB
GitHub - Dakingrai/awesome-mechanistic-interpretability-lm-papers · GitHub
Skip to content
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
Dakingrai
/
awesome-mechanistic-interpretability-lm-papers
Public
Notifications
You must be signed in to change notification settings
Fork
16
Star
238
main Branches Tags Go to file Code Open more actions menu Folders and files
Name Name Last commit message Last commit date Latest commit
History
55 Commits 55 Commits images images README.md README.md View all files Repository files navigation
awesome-mechanistic-interpretability-LM-papers
This is a collection of awesome papers about Mechanistic Interpretability (MI) for Transformer-based Language Models (LMs), organized following our survey paper: A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models .
Papers are organized following our taxonomy (Figure 1) .
We have also curated a Beginner's Roadmap (Figure 2) with actionable items for interested people using MI for their purposes.
Figure 1: Taxonomy
Figure 2: Beginner's Roadmap
How to Contribute: We welcome contributions from everyone! If you find any relevant papers that are not included in the list, please categorize them following our taxonomy and submit a request for update.
Questions/Comments/Suggestions: If you have any questions/comments/suggestions to share with us, you are welcome to report an issue here or reach out to us through drai2@gmu.edu and ziyuyao@gmu.edu .
How to Cite: If you find our survey useful for your research, please cite our paper:
@article{rai2024practical,
title={A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models},
author={Rai, Daking and Zhou, Yilun and Feng, Shi and Saparov, Abulhair and Yao, Ziyu},
journal={arXiv preprint arXiv:2407.02646},
year={2024}
}
Updates
July 2024: We have finished the first iteration of the paper collection. Contributions welcomed!
June 2024: GitHub repository launched! Still under construction.
Table of Contents
Techniques
Evaluation
Findings and Applications
Findings on Features
Findings on circuits
Interpreting LM Behaviors
Interpreting Transformer Components
Findings on Universality
Findings on Model Capabilities
Findings on Learning Dynamics
Applicati
... (truncated, 28 KB total)Resource ID:
75ae5fb36bf37cea | Stable ID: sid_pzC7TIkpmk