Awesome Mechanistic Interpretability Papers

web

GitHub·github.com/Dakingrai/awesome-mechanistic-interpretability...

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: GitHub

A GitHub reading list aggregating mechanistic interpretability papers; useful as a literature survey starting point for researchers studying how language models implement computations internally, though last updated in late 2024.

Metadata

Importance: 62/100wiki pagereference

Summary

A curated GitHub repository collecting and organizing influential research papers on mechanistic interpretability of language models. It serves as a community reference for researchers studying how neural networks implement computations internally, covering topics like circuits, features, attention heads, and sparse autoencoders.

Key Points

•Curated list of ~100+ papers on mechanistic interpretability specifically focused on language models
•Organized by topic areas including circuits, features, attention mechanisms, and sparse autoencoders
•Community-maintained resource with 231 stars, serving as a starting point for researchers entering the field
•Covers both foundational works and recent advances in understanding internal model representations
•Useful for tracking the breadth of mechanistic interpretability research across different model behaviors

Cited by 1 page

Page	Type	Quality
Interpretability	Research Area	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202628 KB

GitHub - Dakingrai/awesome-mechanistic-interpretability-lm-papers · GitHub 

 
 
 
 

 
 

 

 

 
 

 
 

 

 

 

 

 

 

 

 

 

 

 
 
 

 
 
 

 

 

 
 
 
 

 

 

 

 

 

 

 
 

 

 

 
 

 
 
 

 
 

 

 

 
 
 
 

 
 Skip to content 

 
 
 
 
 
 

 
 
 
 
 

 

 

 

 
 
 
 
 
 You signed in with another tab or window. Reload to refresh your session. 
 You signed out in another tab or window. Reload to refresh your session. 
 You switched accounts on another tab or window. Reload to refresh your session. 

 
 
 
 Dismiss alert 

 
 
 

 

 

 
 
 
 
 
 
 
 
 
 
 
 {{ message }} 

 
 
 
 
 

 

 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 Dakingrai
 
 / 
 
 awesome-mechanistic-interpretability-lm-papers 
 

 Public 
 

 

 
 
 
 

 
 
 
 Notifications
 You must be signed in to change notification settings 

 

 
 
 
 Fork
 16 
 
 

 
 
 
 
 
 Star
 238 
 
 

 

 
 

 
 

 

 
 

 
 
 

 
 
 

 
 
 
   main Branches Tags Go to file Code Open more actions menu Folders and files

 Name Name Last commit message Last commit date Latest commit

   History

 55 Commits 55 Commits images images     README.md README.md     View all files Repository files navigation

 awesome-mechanistic-interpretability-LM-papers

 
 This is a collection of awesome papers about Mechanistic Interpretability (MI) for Transformer-based Language Models (LMs), organized following our survey paper: A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models .

 Papers are organized following our taxonomy (Figure 1) .
We have also curated a Beginner's Roadmap (Figure 2) with actionable items for interested people using MI for their purposes.

 
 
 Figure 1: Taxonomy

 
 
 
 Figure 2: Beginner's Roadmap

 
 How to Contribute: We welcome contributions from everyone! If you find any relevant papers that are not included in the list, please categorize them following our taxonomy and submit a request for update.

 Questions/Comments/Suggestions: If you have any questions/comments/suggestions to share with us, you are welcome to report an issue here or reach out to us through drai2@gmu.edu and ziyuyao@gmu.edu .

 How to Cite: If you find our survey useful for your research, please cite our paper:

 @article{rai2024practical,
 title={A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models},
 author={Rai, Daking and Zhou, Yilun and Feng, Shi and Saparov, Abulhair and Yao, Ziyu},
 journal={arXiv preprint arXiv:2407.02646},
 year={2024}
}
 
 Updates

 
 
 July 2024: We have finished the first iteration of the paper collection. Contributions welcomed! 

 June 2024: GitHub repository launched! Still under construction.

 
 Table of Contents

 
 
 Techniques 

 Evaluation 

 Findings and Applications 
 
 Findings on Features 

 Findings on circuits 
 
 Interpreting LM Behaviors 

 Interpreting Transformer Components 

 

 Findings on Universality 

 Findings on Model Capabilities 

 Findings on Learning Dynamics 

 Applicati

... (truncated, 28 KB total)

Resource ID: 75ae5fb36bf37cea | Stable ID: sid_pzC7TIkpmk