Back
Scoping Developmental Interpretability - Manifund Grant Proposal
webA Manifund grant proposal for a 6-month scoping project on Developmental Interpretability, combining Singular Learning Theory and Mechanistic Interpretability to study phase transitions in neural network training as a potential scalable alignment research program.
Metadata
Importance: 52/100otherprimary source
Summary
This is a $144,650 grant proposal for a 6-month research project to assess the viability of Developmental Interpretability (DevInterp), which studies how phase transitions give rise to computational structure in neural networks. The project combines Singular Learning Theory and Mechanistic Interpretability techniques to empirically validate whether phase transitions dominate training dynamics. Success would establish DevInterp as a major branch of technical AI alignment research.
Key Points
- •DevInterp studies phase transitions in neural network training as a path to scalable interpretability tools for AI alignment.
- •The project will analyze phase transitions in Toy Models of Superposition and Induction Heads using RLCT and singular fluctuation metrics.
- •Team includes Jesse Hoogland, Alexander Gietelink Oldenziel, Prof. Daniel Murfet, and Stan van Wingerden, organizers of the 2023 SLT & Alignment Summit.
- •Budget of ~$144k covers researchers, compute, travel, and fiscal sponsorship over 6 months.
- •A confidential capability risk assessment of DevInterp is included as a project deliverable.
Cached Content Preview
HTTP 200Fetched Apr 11, 202619 KB
Scoping Developmental Interpretability | Manifund
Jan
FEB
Mar
07
2025
2026
2027
success
fail
About this capture
COLLECTED BY
Collection: Common Crawl
Web crawl data from Common Crawl.
TIMESTAMPS
The Wayback Machine - https://web.archive.org/web/20260207074808/https://manifund.org/projects/scoping-developmental-interpretability-xg55b33wsfc
Manifund
Home
Login
About
People
Categories
Newsletter
Home
About
People
Categories
Login
Create
13
Scoping Developmental Interpretability
Technical AI safety
Jesse Hoogland
Complete
Grant
$144,650raised
p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
Project summary
We propose a 6-month research project to assess the viability of Developmental Interpretability, a new AI alignment research agenda. “DevInterp” studies how phase transitions give rise to computational structure in neural networks, and offers a possible path to scalable interpretability tools.
Though we have both empirical and theoretical reasons to believe that phase transitions dominate the training process, the details remain unclear. We plan to clarify the role of phase transitions by studying them in a variety of models combining techniques from Singular Learning Theory and Mechanistic Interpretability. In six months, we expect to have gathered enough evidence to confirm that DevInterp is a viable research program.
If successful, we expect Developmental Interpretability to become one of the main branches of technical alignment research over the next few years.
(This funding proposal consists of Phase 1 from the research plan described in this LessWrong post.)
Project goals
We will assess the viability of Developmental Interpretability (DevInterp) as an alignment research program over the next 6 months.
Our immediate priority is to gather empirical evidence for the role of phase transitions in neural network training dynamics. To do so, we will examine a variety of models for signals that indicate the presence or absence of phase transitions.
Concretely, this means:
Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture).
Performing a similar analysis for the Induction Heads paper.
For diverse models that are known to contain structure/circuits, we will attempt to:
detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),
classify weights at each transition into state & control variables,
perform mechanistic interpretability analyses at these transitions,
compare these analysis to MechInterp structur
... (truncated, 19 KB total)Resource ID:
f6a4f09ab46a43ee | Stable ID: sid_QWxrRv6uNp