Nick Joseph on Anthropic's safety approach
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: 80,000 Hours
An 80,000 Hours podcast episode offering an insider perspective on how Anthropic operationalizes safety commitments, particularly relevant for understanding Responsible Scaling Policies and frontier lab safety culture.
Metadata
Summary
Nick Joseph, a researcher at Anthropic, discusses the company's approach to AI safety including their Responsible Scaling Policy, how they think about evaluating model capabilities and risks, and the internal culture around safety at Anthropic. The conversation covers practical mechanisms for slowing or pausing AI development if safety thresholds are breached.
Key Points
- •Anthropic's Responsible Scaling Policy sets capability thresholds that trigger mandatory safety evaluations before further model deployment or training
- •Nick discusses how Anthropic balances pushing capabilities forward while maintaining genuine safety commitments internally
- •The interview covers evaluation methods for detecting dangerous capabilities in frontier models, including biological and cyber risks
- •Joseph explains the organizational structure and culture at Anthropic that attempts to take safety seriously rather than treating it as PR
- •The conversation addresses honest uncertainty about whether current safety approaches are sufficient given rapid capability gains
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Corporate Influence on AI Policy | Crux | 66.0 |
Cached Content Preview
Nick Joseph on whether Anthropic's AI safety policy is up to the task | 80,000 Hours Search for: Our new book, a ridiculously in-depth guide to a fulfilling career, is out May 2026. Preorder now
On this page:
Introduction
1 Highlights
2 Articles, books, and other media discussed in the show
3 Transcript 3.1 Cold open [00:00:00]
3.2 Rob's intro [00:01:00]
3.3 The interview begins [00:03:44]
3.4 Scaling laws [00:04:12]
3.5 Bottlenecks to further progress in making AIs helpful [00:08:36]
3.6 Anthropic's responsible scaling policies [00:14:21]
3.7 Pros and cons of the RSP approach for AI safety [00:34:09]
3.8 Alternatives to RSPs [00:46:44]
3.9 Is an internal audit really the best approach? [00:51:56]
3.10 Making promises about things that are currently technically impossible [01:07:54]
3.11 Nick's biggest reservations about the RSP approach [01:16:05]
3.12 Communicating "acceptable" risk [01:19:27]
3.13 Should Anthropic's RSP have wider safety buffers? [01:26:13]
3.14 Other impacts on society and future work on RSPs [01:34:01]
3.15 Working at Anthropic [01:36:28]
3.16 Engineering vs research [01:41:04]
3.17 AI safety roles at Anthropic [01:48:31]
3.18 Should concerned people be willing to take capabilities roles? [01:58:20]
3.19 Recent safety work at Anthropic [02:10:05]
3.20 Anthropic culture [02:14:35]
3.21 Overrated and underrated AI applications [02:22:06]
3.22 Rob's outro [02:26:36]
4 Learn more
5 Related episodes
Read transcript See all episodes
Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.
Having whistleblower-type protections such that people can say if a company is breaking from the RSP or not trying hard enough to elicit capabilities or to interpret it in a good way, and then public discussion can add some pressure. But ultimately, I think you do need regulation to have these very strict requirements.
— Nick Joseph
The three biggest AI companies — Anthropic , OpenAI , and DeepMind — have now all released policies designed to make their AI models less likely to go rogue or cause catastrophic damage as they approach, and eventually exceed, human capabilities. Are they good enough?
That’s what host Rob Wiblin tries to hash out in this interview (recorded May 30) with Nick Joseph — one of the 11 people who left OpenAI to launch Anthropic, its current head of training, and a big fan of Anthropic’s “responsible scaling policy” (or “RSP”). Anthropic is the most safety focused of the AI companies, known for a culture that treats the risks of its work as deadly serious.
As Nick explain
... (truncated, 98 KB total)b81d89ad5c71c87b | Stable ID: sid_WMnGZByhVf