Discovering Latent Goals via Mechanistic Interpretability (PhD Salary Grant)

web

manifund.org·manifund.org/projects/discovering-latent-goals-mechanisti...

A Manifund grant project page for a PhD researcher seeking salary funding to study mechanistic interpretability of LLMs, specifically focusing on how language models represent and implement goals and agentic behavior.

Metadata

Importance: 28/100otherprimary source

Summary

Lucy Farnik is pursuing an alignment PhD focused on mechanistic interpretability of LLMs, specifically investigating how models represent and implement goals when prompted with agentic tasks. The project aims to develop methods for probing and editing model goals via activation analysis. Funding was sought for salary and travel over 6 months, though the project was already fully funded elsewhere.

Key Points

•Research goal: understand how LLMs represent and implement goals/agentic behavior by examining internal activations when given explicit tasks.
•Methodology involves probing for 'agency' in language models and potentially editing goal representations via activation steering.
•Researcher has strong background: senior developer at 18, NeurIPS submission with FHI/CHAI/FAR AI collaborators, winner of Apart AI hackathon for circuit discovery.
•Acknowledges dual-use risk: robustly formalizing 'goals' mathematically could enable optimization toward dangerous objectives.
•Plans to complement technical work with AI governance upskilling to inform policymakers.

Cached Content Preview

HTTP 200Fetched Apr 12, 20268 KB

Jan
 FEB
 Mar
 

 
 

 
 17
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Common Crawl

 

 

 Web crawl data from Common Crawl.
 

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - https://web.archive.org/web/20260217110739/https://manifund.org/projects/discovering-latent-goals-mechanistic-interpretability-phd-salary

 

Manifund

Home

Login

About

People

Categories

Newsletter

Home

About

People

Categories

Login

Create

Discovering latent goals (mechanistic interpretability PhD salary) | Manifund

7

Discovering latent goals (mechanistic interpretability PhD salary)

Technical AI safety

Lucy Farnik

Complete

Grant

$1,590raised

p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
ALREADY FULLY FUNDED ELSEWHERE

Project summary

I&#x27;m working on interpreting LLMs, specifically trying to move us closer to a world where we can understand and edit what a model&#x27;s goals are by examining/changing its activations. I plan to study this by explicitly giving a language model a task (eg. "Write linux terminal commands to do [X]") and then understand how this is implemented in the model.

Project goals

The broad goal is to understand how language models represent goals, and also try to understand whether or not we can "probe for agency" within them. More narrowly, I want to understand how language models implement simple "agentic" behavior when you directly prompt them to do a specific task.

If directly discovering goals turns out to be methodologically out of reach, I may pivot to interpreting easier LLM behaviors to help develop the methodology we need.

I&#x27;ll be doing a portion of this within my alignment PhD. I also plan to spend some time in the next 6 months upskilling in AI governance so I can better understand how my work could contribute to informing policymakers (since there seems to be a lot of value in technical researchers bringing their expertise into public policy).

How will this funding be used?

The funding will go towards my salary for the next 6 months as well as a travel budget (attending conferences, spending some time in the Bay etc).

Here&#x27;s a rough breakdown: I need around $64k/y to live comfortably and maximize my productivity (this is roughly half of what I was earning as a senior developer), plus a $8k/y travel budget. I&#x27;ll be receiving a $28k/y PhD stipend along with a $2k/y travel budget from the UK government starting in October. I&#x27;m asking for 6 months worth of funding, meaning funding for the next 3 months before my PhD starts ((64+8)/4=$18k) plus the first 3 months of my PhD ((64+8-28-2)/4=$10.5k). I&#x27;m adding an extra 20% to cover the income tax I&#x27;ll have to pay on this, and an extra 10% as an "unex

... (truncated, 8 KB total)

Resource ID: 18cea0af72b2b8f4