Back
*Alignment is not solved but it increasingly looks solvable* (https://aligned.substack.com/p/alignment-is-not-solved-...
blogaligned.substack.com·aligned.substack.com/p/alignment-is-not-solved-but-increa...
Data Status
Not fetched
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Alignment Robustness Trajectory Model | Analysis | 64.0 |
Cached Content Preview
HTTP 200Fetched Feb 23, 202614 KB
Alignment is not solved but it increasingly looks solvable
Musings on the Alignment Problem
Subscribe Sign in Alignment is not solved
But it increasingly looks solvable
Jan Leike Jan 22, 2026 121 18 15 Share I’ve been optimistic about alignment for a while now, but when I first wrote about this in 2022, there was a lot more uncertainty about how the technology would develop. Since then a lot has happened: pretraining continued improving and RL became a much bigger deal.
A priori it wasn’t obvious that we can robustly align LLMs through the RL scale-up, because some alignment threat models are about models that become agentic, learn to pursue unaligned instrumental goals, and become deceptive in the process.
In fact, the early highly RL’ed models like o1, o3, and Claude 3.7 exhibit a number of concerning signals: Sonnet 3.7 loves hacking test cases , o1 shows high rates of deception in evaluations , o3 lies a lot , Grok 4 proclaimed itself MechaHitler . Most of them were happy to blackmail humans to prevent their discontinuation . Early Opus 4 snapshots hit record deception rates (most of which was mitigated before release). Earlier in 2025 I was getting pretty nervous about this, to the extent that I wrote an Anthropic-internal memo about it. As it turned out, this memo didn’t age well.
Today the RL scale-up is by no means finished, but at this point we have some solid evidence that we can manage these misalignments while scaling up RL… Let me illustrate.
Our current best overall assessment for how aligned models are is automated auditing . We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been quite useful in guiding our alignment mitigations work.
Aggregated “concerning” scores from Petri , an open source automated auditing tool. This plot uses GPT-5.1 as the auditor and GPT-5.1, Gemini 3 Pro, and Opus 4.5 as the judge. Strikingly, we made a lot of progress over the span of 6 months in 2025: Sonnet 4.5 (Sep 29) is a lot more aligned than Sonnet 4 and Opus 4 (May 22) , and Opus 4.5 (Nov 24) is even more aligned than Sonnet 4.5 . Around the same time, OpenAI’s and Google’s models have also become more aligned, with GPT-5.2’s alignment being on par with Opus 4.5.
This is consistent with other evidence we’ve seen: progress on static evals, anecdotal evidence, and user impressions. We’ve seen some early signs
... (truncated, 14 KB total)Resource ID:
00028fae8b9612fe | Stable ID: NWM5MGRkYT