Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

Data Status

Not fetched

Cited by 4 pages

Cached Content Preview

HTTP 200Fetched Mar 8, 20266 KB
Anthropic Fellows Program for AI safety research: applications open for May & July 2026 
 

 
 
 
 
 
 
 
 
 

 
 Alignment Science Blog 
 
 
 Anthropic Fellows Program for AI safety research: applications open for May & July 2026

 

 

 ”This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems." — Jan Leike 
 
 
 The Anthropic Fellows program provides funding and Anthropic mentorship for engineers and researchers to investigate some of Anthropic’s highest priority AI safety research questions. 

 In our first cohort, over 80% of fellows produced papers, including on agentic misalignment , subliminal learning , rapid response to new ASL3 jailbreaks , and open-source circuits . Over 40% of the fellows subsequently joined Anthropic full-time.

 We’re now opening applications for our next two cohorts, beginning in May and July 2026.

 This year, we plan to work with more fellows across a wider range of safety research areas—including scalable oversight, adversarial robustness and AI control, model organisms, mechanistic interpretability, AI security, and model welfare. 

 Below, we share more about what the program looks like in practice, and how interested candidates can apply.

 What fellows work on 

 Fellows work for 4 months on empirical research questions aligned with Anthropic’s overall research priorities, with the aim of producing public outputs, like a paper. Anthropic mentors ‘pitch’ their project ideas to fellows, who choose and shape their project in close collaboration with their mentors.

 Here are a few examples from previous cohorts:

 Security

 Fellows have worked on mitigating risks from AI systems being misused for cyberattacks—exploring both how LLMs might enable adversaries to automate attacks that currently require skilled human operators, and how to rapidly defend against novel jailbreaks.  
 

 Our fellows developed agents that identified 4.6M USD in blockchain smart contract vulnerabilities  and discovered two novel zero-day vulnerabilities, demonstrating that profitable autonomous exploitation is now technically feasible.

 A year prior, an Anthropic fellow developed a method for rapid response to new ASL3 jailbreaks : techniques that block entire classes of high-risk jailbreaks after observing only a handful of attacks. This work was a key component of Anthropic’s ASL3 deployment safeguards.

 Interpretability

 The mission of our interpretability research is to advance our understanding of the internal workings of large language models to enable more targeted interventions and safety measures. 

 Fellows have introduced a new method to trace the thoughts of a large language model—and open-sourced it . Their approach was to generate attribution graphs, which (partially) reveal the steps a model took internally to decide on a particular output. The public release of this research has enabled researchers to trace circuits on supporte

... (truncated, 6 KB total)
Resource ID: e65e76531931acc2 | Stable ID: MzNkMmEwOD