Scoping Developmental Interpretability - Manifund Grant Proposal

web

manifund.org·manifund.org/projects/scoping-developmental-interpretabil...

A Manifund grant proposal for a 6-month scoping project on Developmental Interpretability, combining Singular Learning Theory and Mechanistic Interpretability to study phase transitions in neural network training as a potential scalable alignment research program.

Metadata

Importance: 52/100otherprimary source

Summary

This is a $144,650 grant proposal for a 6-month research project to assess the viability of Developmental Interpretability (DevInterp), which studies how phase transitions give rise to computational structure in neural networks. The project combines Singular Learning Theory and Mechanistic Interpretability techniques to empirically validate whether phase transitions dominate training dynamics. Success would establish DevInterp as a major branch of technical AI alignment research.

Key Points

•DevInterp studies phase transitions in neural network training as a path to scalable interpretability tools for AI alignment.
•The project will analyze phase transitions in Toy Models of Superposition and Induction Heads using RLCT and singular fluctuation metrics.
•Team includes Jesse Hoogland, Alexander Gietelink Oldenziel, Prof. Daniel Murfet, and Stan van Wingerden, organizers of the 2023 SLT & Alignment Summit.
•Budget of ~$144k covers researchers, compute, travel, and fiscal sponsorship over 6 months.
•A confidential capability risk assessment of DevInterp is included as a project deliverable.

Cached Content Preview

HTTP 200Fetched Apr 11, 202619 KB

Scoping Developmental Interpretability | Manifund

 

 
 
 
 

 Jan
 FEB
 Mar
 

 
 

 
 07
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Common Crawl

 

 

 Web crawl data from Common Crawl.
 

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - https://web.archive.org/web/20260207074808/https://manifund.org/projects/scoping-developmental-interpretability-xg55b33wsfc

 

Manifund

Home

Login

About

People

Categories

Newsletter

Home

About

People

Categories

Login

Create

13

Scoping Developmental Interpretability

Technical AI safety

Jesse Hoogland

Complete

Grant

$144,650raised

p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
Project summary

We propose a 6-month research project to assess the viability of Developmental Interpretability, a new AI alignment research agenda. “DevInterp” studies how phase transitions give rise to computational structure in neural networks, and offers a possible path to scalable interpretability tools. 

Though we have both empirical and theoretical reasons to believe that phase transitions dominate the training process, the details remain unclear. We plan to clarify the role of phase transitions by studying them in a variety of models combining techniques from Singular Learning Theory and Mechanistic Interpretability. In six months, we expect to have gathered enough evidence to confirm that DevInterp is a viable research program.

If successful, we expect Developmental Interpretability to become one of the main branches of technical alignment research over the next few years. 

(This funding proposal consists of Phase 1 from the research plan described in this LessWrong post.) 

Project goals

We will assess the viability of Developmental Interpretability (DevInterp) as an alignment research program over the next 6 months. 

Our immediate priority is to gather empirical evidence for the role of phase transitions in neural network training dynamics. To do so, we will examine a variety of models for signals that indicate the presence or absence of phase transitions. 

Concretely, this means:

Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture).

Performing a similar analysis for the Induction Heads paper.

For diverse models that are known to contain structure/circuits, we will attempt to:

detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),

classify weights at each transition into state & control variables,

perform mechanistic interpretability analyses at these transitions,

compare these analysis to MechInterp structur

... (truncated, 19 KB total)

Resource ID: f6a4f09ab46a43ee | Stable ID: sid_QWxrRv6uNp