Alignment Is Not Solved but It Increasingly Looks Solvable

blog

Substack·aligned.substack.com/p/alignment-is-not-solved-but-increa...

Author

Jan Leike

Credibility Rating

2/5

Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Substack

A substack commentary offering a measured optimistic take on alignment progress, useful for understanding mid-2020s sentiment in the AI safety community about the tractability of the alignment problem.

Metadata

Importance: 45/100blog postcommentary

Summary

This blog post argues that while AI alignment remains an unsolved problem, recent progress in interpretability, scalable oversight, and alignment research suggests the problem is becoming more tractable. It offers an optimistic but cautious assessment of the field's trajectory and the conditions needed for success.

Key Points

•Alignment is not yet solved, but the field has made enough progress to suggest it is solvable with sufficient focus and resources.
•Recent advances in interpretability and scalable oversight techniques are cited as reasons for cautious optimism.
•The post distinguishes between technical tractability and whether humanity will actually succeed in solving alignment in time.
•Continued investment in alignment research and institutional support are framed as necessary conditions for a positive outcome.
•The piece aims to counter both overconfident dismissal and doomerism about alignment prospects.

Cited by 1 page

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202614 KB

Alignment is not solved but it increasingly looks solvable 
 
 
 
 
 

 

 

 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 
 
 

 

 
 
 

 

 

 

 

 
 

 
 

 

 

 

 
 Musings on the Alignment Problem 

 Subscribe Sign in Alignment is not solved

 But it increasingly looks solvable

 Jan Leike Jan 22, 2026 121 18 15 Share I’ve been optimistic about alignment for a while now, but when I first wrote about this in 2022, there was a lot more uncertainty about how the technology would develop. Since then a lot has happened: pretraining continued improving and RL became a much bigger deal. 

 A priori it wasn’t obvious that we can robustly align LLMs through the RL scale-up, because some alignment threat models are about models that become agentic, learn to pursue unaligned instrumental goals, and become deceptive in the process. 

 In fact, the early highly RL’ed models like o1, o3, and Claude 3.7 exhibit a number of concerning signals: Sonnet 3.7 loves hacking test cases , o1 shows high rates of deception in evaluations , o3 lies a lot , Grok 4 proclaimed itself MechaHitler . Most of them were happy to blackmail humans to prevent their discontinuation . Early Opus 4 snapshots hit record deception rates (most of which was mitigated before release). Earlier in 2025 I was getting pretty nervous about this, to the extent that I wrote an Anthropic-internal memo about it. As it turned out, this memo didn’t age well. 

 Today the RL scale-up is by no means finished, but at this point we have some solid evidence that we can manage these misalignments while scaling up RL… Let me illustrate.

 Our current best overall assessment for how aligned models are is automated auditing . We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been quite useful in guiding our alignment mitigations work. 

 Aggregated “concerning” scores from Petri , an open source automated auditing tool. This plot uses GPT-5.1 as the auditor and GPT-5.1, Gemini 3 Pro, and Opus 4.5 as the judge. Strikingly, we made a lot of progress over the span of 6 months in 2025: Sonnet 4.5 (Sep 29) is a lot more aligned than Sonnet 4 and Opus 4 (May 22) , and Opus 4.5 (Nov 24) is even more aligned than Sonnet 4.5 . Around the same time, OpenAI’s and Google’s models have also become more aligned, with GPT-5.2’s alignment being on par with Opus 4.5. 

 This is consistent with other evidence we’ve seen: progress on static evals, anecdotal evidence, and user impressions. We’ve seen some early signs 

... (truncated, 14 KB total)

Resource ID: 00028fae8b9612fe | Stable ID: sid_hdHAbKL3eG