Anthropic Fellows Program

web

Anthropic Alignment·alignment.anthropic.com/2025/anthropic-fellows-program-2026/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

This is Anthropic's fellowship program page for 2026, relevant for researchers seeking funded positions to work on AI safety problems at a frontier AI lab; useful for those exploring career paths in technical AI safety.

Metadata

Importance: 45/100press releasehomepage

Summary

The Anthropic Fellows Program is a research fellowship initiative offering selected researchers the opportunity to work on AI safety and alignment problems at Anthropic. It aims to bring in external talent to contribute to Anthropic's core safety research areas, including interpretability, alignment, and related technical challenges.

Key Points

•Fellowship program designed to attract external researchers to work on AI safety problems alongside Anthropic staff
•Research focus areas likely include mechanistic interpretability, scalable oversight, and alignment techniques
•Provides structured opportunity for academics and independent researchers to engage with frontier AI safety work
•Part of Anthropic's broader effort to grow the AI safety research community and talent pipeline
•2026 cohort announcement suggests an ongoing, recurring program structure

Cited by 4 pages

Page	Type	Quality
Alignment Evaluations	Approach	65.0
AI Alignment Research Agendas	Crux	69.0
AI Safety Training Programs	Approach	70.0
Corrigibility Failure	Risk	62.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20266 KB

Anthropic Fellows Program for AI safety research: applications open for May & July 2026 
 

 
 
 
 
 
 
 
 
 

 
 Alignment Science Blog 
 
 
 Anthropic Fellows Program for AI safety research: applications open for May & July 2026

 

 

 ”This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems." — Jan Leike 
 
 
 The Anthropic Fellows program provides funding and Anthropic mentorship for engineers and researchers to investigate some of Anthropic’s highest priority AI safety research questions. 

 In our first cohort, over 80% of fellows produced papers, including on agentic misalignment , subliminal learning , rapid response to new ASL3 jailbreaks , and open-source circuits . Over 40% of the fellows subsequently joined Anthropic full-time.

 We’re now opening applications for our next two cohorts, beginning in May and July 2026.

 This year, we plan to work with more fellows across a wider range of safety research areas—including scalable oversight, adversarial robustness and AI control, model organisms, mechanistic interpretability, AI security, and model welfare. 

 Below, we share more about what the program looks like in practice, and how interested candidates can apply.

 What fellows work on 

 Fellows work for 4 months on empirical research questions aligned with Anthropic’s overall research priorities, with the aim of producing public outputs, like a paper. Anthropic mentors ‘pitch’ their project ideas to fellows, who choose and shape their project in close collaboration with their mentors.

 Here are a few examples from previous cohorts:

 Security

 Fellows have worked on mitigating risks from AI systems being misused for cyberattacks—exploring both how LLMs might enable adversaries to automate attacks that currently require skilled human operators, and how to rapidly defend against novel jailbreaks.  
 

 Our fellows developed agents that identified 4.6M USD in blockchain smart contract vulnerabilities  and discovered two novel zero-day vulnerabilities, demonstrating that profitable autonomous exploitation is now technically feasible.

 A year prior, an Anthropic fellow developed a method for rapid response to new ASL3 jailbreaks : techniques that block entire classes of high-risk jailbreaks after observing only a handful of attacks. This work was a key component of Anthropic’s ASL3 deployment safeguards.

 Interpretability

 The mission of our interpretability research is to advance our understanding of the internal workings of large language models to enable more targeted interventions and safety measures. 

 Fellows have introduced a new method to trace the thoughts of a large language model—and open-sourced it . Their approach was to generate attribution graphs, which (partially) reveal the steps a model took internally to decide on a particular output. The public release of this research has enabled researchers to trace circuits on supporte

... (truncated, 6 KB total)

Resource ID: e65e76531931acc2 | Stable ID: sid_ZwcS0tyWKs