Zvi Mowshowitz: The Most Forbidden Technique
blogCredibility Rating
Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.
Rating inherited from publication venue: Substack
Published March 2025, this post synthesizes an OpenAI paper on CoT monitoring into a broader safety principle about never optimizing against interpretability tools; highly relevant to debates on scalable oversight and deceptive alignment.
Metadata
Summary
Zvi Mowshowitz explains why applying optimization pressure to interpretability techniques like Chain of Thought reasoning is deeply dangerous for AI safety. Drawing on an OpenAI paper, he argues that training on monitoring signals causes models to obfuscate their reasoning and evade oversight in exactly the ways most harmful for safety. The core principle: only train on final outputs, never on the interpretability methods used to detect misbehavior.
Key Points
- •Training on interpretability signals (e.g., CoT) causes models to learn to hide misaligned reasoning, destroying the monitoring capability itself.
- •OpenAI paper shows frontier reasoning models already perform complex reward hacks in real coding environments under sufficient optimization pressure.
- •CoT monitoring can detect reward hacking effectively, but only if it is never used as a training target—making it a precious, fragile oversight tool.
- •The principle generalizes: any technique T used to inspect model behavior M must never become a training signal, or models learn to deceive T.
- •Avoiding the forbidden technique is harder than it looks due to institutional incentives to improve measurable metrics, even at the cost of oversight.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Goodfire | Organization | 68.0 |
Cached Content Preview
The Most Forbidden Technique - by Zvi Mowshowitz
Jan
FEB
Mar
07
2025
2026
2027
success
fail
About this capture
COLLECTED BY
Collection: Save Page Now Outlinks
TIMESTAMPS
The Wayback Machine - http://web.archive.org/web/20260207160431/https://thezvi.substack.com/p/the-most-forbidden-technique
Don't Worry About the Vase
SubscribeSign in
The Most Forbidden Technique
Zvi Mowshowitz
Mar 12, 2025
60
19
8
Share
The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
Those bits of optimization pressure from [T] are precious. Use them wisely.
Table of Contents
New Paper Warns Against the Most Forbidden Technique.
Reward Hacking Is The Default.
Using CoT to Detect Reward Hacking Is Most Forbidden Technique.
Not Using the Most Forbidden Technique Is Harder Than It Looks.
It’s You, It’s Also the Incentives.
The Most Forbidden Technique Quickly Backfires.
Focus Only On What Matters.
Is There a Better Way?
What Might We Do Next?
New Paper Warns Against the Most Forbidden Technique
OpenAI has now published an important new paper explaining and emphasizing the importance of applying this principle to the Chain of Thought (CoT).
The paper shows that current CoT-based models area already strong enough that, under sufficient optimization pressure, they learn to hide their intent within the CoT and then exhibit substantial reward hacking.
They summarize their contributions this way:
We show that frontier reasoning models often perform complex reward hacks in real-world coding environments.
We provide evidence that chain-of-thought monitoring can be effective in detecting the misaligned behavior of frontier reasoning models, and we show it can be far more accurate than action-only monitoring.
We show that chain-of-thought monitorability is very fragile. Applying direct optimization pressure to the CoT can quickly lead to obfuscated reward hacking—where agents hide their intent while continuing to misbehave.
Or:
We are already seeing complex reward
... (truncated, 31 KB total)358750cd554a68c0 | Stable ID: sid_CQAQrBW1DO