Longterm Wiki
Back

Betley et al., *Training large language models on narrow tasks can lead to broad misalignment*, Nature (https://go.na...

paper
go.nature.com·go.nature.com/49V2wla

Data Status

Not fetched

Cited by 1 page

PageTypeQuality
Alignment Robustness Trajectory ModelAnalysis64.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202648 KB
Training large language models on narrow tasks can lead to broad misalignment | Nature 
 
 
 

 
 

 
 

 

 
 
 
 

 

 
 
 
 
 
 

 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 
 
 
 

 
 

 

 

 
 

 
 
 

 
 

 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 

 
 
 

 
 Skip to main content 

 
 
 
 Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
 the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
 Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
 and JavaScript.

 
 

 

 

 
 
 

 
 
 Advertisement

 
 
 
 
 
 
 
 
 
 
 

 
 
 
 

 

 
 
 
 

 

 

 
 
 
 
 
 
 
 Training large language models on narrow tasks can lead to broad misalignment
 
 
 
 
 
 
 Download PDF 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Download PDF 
 
 
 
 

 
 
 
 
 
 
 
 
 

 
 
 Subjects

 
 Computer science 
 Software 

 
 

 
 
 

 
 

 
 

 
 Abstract

 The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment 1 . Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information 2 , 3 . Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding 4 . For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour.

 

 
 
 

 
 
 
 
 
 Similar content being viewed by others

 
 
 
 
 
 
 
 
 
 Generating reliable software project task flows using large language models through prompt engineering and robust evaluation
 
 

 
 Article 
 Open access 
 08 October 2025 
 
 
 
 
 
 
 
 
 
 
 
 
 Strong and weak alignment of large 

... (truncated, 48 KB total)
Resource ID: caa7c39d21fc4a14 | Stable ID: Yjc5YjQwNm