Back
Discovering Latent Goals via Mechanistic Interpretability (PhD Salary Grant)
webA Manifund grant project page for a PhD researcher seeking salary funding to study mechanistic interpretability of LLMs, specifically focusing on how language models represent and implement goals and agentic behavior.
Metadata
Importance: 28/100otherprimary source
Summary
Lucy Farnik is pursuing an alignment PhD focused on mechanistic interpretability of LLMs, specifically investigating how models represent and implement goals when prompted with agentic tasks. The project aims to develop methods for probing and editing model goals via activation analysis. Funding was sought for salary and travel over 6 months, though the project was already fully funded elsewhere.
Key Points
- •Research goal: understand how LLMs represent and implement goals/agentic behavior by examining internal activations when given explicit tasks.
- •Methodology involves probing for 'agency' in language models and potentially editing goal representations via activation steering.
- •Researcher has strong background: senior developer at 18, NeurIPS submission with FHI/CHAI/FAR AI collaborators, winner of Apart AI hackathon for circuit discovery.
- •Acknowledges dual-use risk: robustly formalizing 'goals' mathematically could enable optimization toward dangerous objectives.
- •Plans to complement technical work with AI governance upskilling to inform policymakers.
Cached Content Preview
HTTP 200Fetched Apr 12, 20268 KB
Jan
FEB
Mar
17
2025
2026
2027
success
fail
About this capture
COLLECTED BY
Collection: Common Crawl
Web crawl data from Common Crawl.
TIMESTAMPS
The Wayback Machine - https://web.archive.org/web/20260217110739/https://manifund.org/projects/discovering-latent-goals-mechanistic-interpretability-phd-salary
Manifund
Home
Login
About
People
Categories
Newsletter
Home
About
People
Categories
Login
Create
Discovering latent goals (mechanistic interpretability PhD salary) | Manifund
7
Discovering latent goals (mechanistic interpretability PhD salary)
Technical AI safety
Lucy Farnik
Complete
Grant
$1,590raised
p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
ALREADY FULLY FUNDED ELSEWHERE
Project summary
I'm working on interpreting LLMs, specifically trying to move us closer to a world where we can understand and edit what a model's goals are by examining/changing its activations. I plan to study this by explicitly giving a language model a task (eg. "Write linux terminal commands to do [X]") and then understand how this is implemented in the model.
Project goals
The broad goal is to understand how language models represent goals, and also try to understand whether or not we can "probe for agency" within them. More narrowly, I want to understand how language models implement simple "agentic" behavior when you directly prompt them to do a specific task.
If directly discovering goals turns out to be methodologically out of reach, I may pivot to interpreting easier LLM behaviors to help develop the methodology we need.
I'll be doing a portion of this within my alignment PhD. I also plan to spend some time in the next 6 months upskilling in AI governance so I can better understand how my work could contribute to informing policymakers (since there seems to be a lot of value in technical researchers bringing their expertise into public policy).
How will this funding be used?
The funding will go towards my salary for the next 6 months as well as a travel budget (attending conferences, spending some time in the Bay etc).
Here's a rough breakdown: I need around $64k/y to live comfortably and maximize my productivity (this is roughly half of what I was earning as a senior developer), plus a $8k/y travel budget. I'll be receiving a $28k/y PhD stipend along with a $2k/y travel budget from the UK government starting in October. I'm asking for 6 months worth of funding, meaning funding for the next 3 months before my PhD starts ((64+8)/4=$18k) plus the first 3 months of my PhD ((64+8-28-2)/4=$10.5k). I'm adding an extra 20% to cover the income tax I'll have to pay on this, and an extra 10% as an "unex
... (truncated, 8 KB total)Resource ID:
18cea0af72b2b8f4