Longterm Wiki
Back

Anthropic Alignment Science Blog

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

Data Status

Not fetched

Cited by 8 pages

Cached Content Preview

HTTP 200Fetched Feb 23, 202619 KB
# Alignment Science Blog

## Articles

February 2026

[**The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?**\\
\\
When AI systems fail, will they fail by systematically pursuing goals we do not intend? Or will they\\
fail by being a hot mess—taking nonsensical actions that do not further any goal?](https://alignment.anthropic.com/2026/hot-mess-of-ai/)

January 2026

[**Pre-deployment auditing can catch an overt saboteur**\\
\\
We test whether our pre-deployment alignment auditing methods can catch models trained to overtly\\
sabotage Anthropic.](https://alignment.anthropic.com/2026/auditing-overt-saboteur/) [**Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations**\\
\\
We've improved our Petri automated-behavioral-auditing tool with improved realism mitigations to\\
counter eval-awareness, an expanded seed library with 70 new scenarios, and evaluation results for\\
more recent frontier models.](https://alignment.anthropic.com/2026/petri-v2/)

December 2025

[**Bloom: an open source tool for automated behavioral evaluations**\\
\\
Bloom is an open-source automated pipeline that generates configurable evaluation suites to measure\\
arbitrary behavioral traits in frontier LLMs without requiring ground-truth labels.](https://alignment.anthropic.com/2025/bloom-auto-evals/) [**Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers**\\
\\
We train language models to answer questions about their own activations in natural language and\\
evaluate how well they generalize to settings very unlike their training, such as uncovering\\
misalignment introduced during fine-tuning.](https://alignment.anthropic.com/2025/activation-oracles/) [**Towards training-time mitigations for alignment faking in RL**\\
\\
We construct a diverse array of model organisms of alignment faking and study mitigations that could\\
be used during RL to decrease alignment faking and compliance gaps.](https://alignment.anthropic.com/2025/alignment-faking-mitigations/) [**Open Source Replication of the Auditing Game Model Organism**\\
\\
We release an open source replication of the model organism from our previous auditing game paper.](https://alignment.anthropic.com/2025/auditing-mo-replication/) [**Anthropic Fellows Program 2026**\\
\\
Apply now for our AI safety research fellowship.](https://alignment.anthropic.com/2025/anthropic-fellows-program-2026/) [**Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs**\\
\\
We localize dangerous knowledge to a subset of model's parameters, so it can be easily removed after\\
training.](https://alignment.anthropic.com/2025/selective-gradient-masking)

November 2025

[**Evaluating honesty and lie detection techniques on a diverse suite dishonest models**\\
\\
We explore techniques for honesty elicitation and lie detection on a diverse testbed of dishonest\\
model organisms.](https://alignment.anthropic.com/2025

... (truncated, 19 KB total)
Resource ID: 5a651b8ed18ffeb1 | Stable ID: ZDE0NmU0NT