Cost-Effective Constitutional Classifiers

web

Anthropic Alignment·alignment.anthropic.com/2025/cheap-monitors/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic Alignment

Published on Anthropic's alignment research site, this work addresses the practical challenge of making safety classifiers computationally cheap enough to deploy at scale, bridging the gap between safety research and real-world deployment constraints.

Metadata

Importance: 62/100blog postprimary source

Summary

This Anthropic alignment research explores methods to reduce computational overhead in AI safety classifiers by repurposing existing model computations rather than running separate models. Techniques like linear probing and fine-tuning small sections of models demonstrate strong safety classification performance at minimal additional cost. This work is relevant to making scalable oversight and safety monitoring more practical in deployed systems.

Key Points

•Linear probing on existing model representations can serve as lightweight safety classifiers without full inference overhead.
•Fine-tuning small sections of models (rather than full models) achieves competitive classification performance at reduced compute cost.
•Repurposing model-internal computations for safety monitoring avoids the expense of running entirely separate classifier models.
•Findings suggest cost-effective monitoring is viable, which could enable broader deployment of safety checks in production AI systems.
•Relevant to scalable oversight: cheaper monitors may allow more comprehensive coverage of model outputs during inference.

Review

This research addresses a critical challenge in AI safety: developing efficient methods for detecting potentially harmful model outputs without incurring significant computational overhead. By exploring techniques like linear probing of model activations and partially fine-tuning model layers, the authors demonstrate that it's possible to create effective safety classifiers with a fraction of the computational resources typically required. The methodology leverages the rich internal representations of large language models, using techniques like exponential moving average (EMA) probes and single-layer retraining to achieve performance comparable to much larger dedicated classifiers. The research is particularly significant because it offers a practical approach to implementing robust safety monitoring systems, potentially making advanced AI safety techniques more accessible and cost-effective. However, the authors appropriately caution that their methods have not yet been tested against adaptive adversarial attacks, which represents an important avenue for future research.

Cached Content Preview

HTTP 200Fetched Apr 7, 202632 KB

Cost-Effective Constitutional Classifiers via Representation Re-use 
 

 
 
 
 
 
 
 
 

 
 Alignment Science Blog 
 
 
 Cost-Effective Constitutional Classifiers via Representation Re-use
 

 
 Authors 

 Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma

 

 TL;DR: We study cost-effective jailbreak detection. Instead of using a dedicated jailbreak classifier, we repurpose the computations that AI models already perform by fine-tuning just the final layer or using linear probes on intermediate activations. Our fine-tuned final layer detectors outperform standalone classifiers a quarter the size of the base model, while linear probes achieve performance comparable to a classifier 2% of the size of the  policy model with virtually no additional computational cost. Probe classifiers also function effectively as first-stages in two-stage classification pipelines, further improving the cost-performance tradeoff. These methods could dramatically reduce the computational overhead of jailbreak detection, though further testing with adaptive adversarial attacks is needed. 

 Introduction

 Sufficiently advanced AI systems could pose catastrophic risks, for example by enabling the production of chemical, biological, radiological and nuclear (CBRN) weapons, as detailed recent safety frameworks from, among others, OpenAI , Google DeepMind  and Anthropic . Developers therefore need to ensure that the risks from deploying their model is acceptably low. Anthropic’s recent Constitutional Classifiers paper ( Sharma et al, 2025 ) demonstrated that using a separately trained model to identify dangerous inputs provides increased robustness. However, this can introduce computational overhead: for instance, using Claude 3.5 Haiku as a safety filter for Claude 3.5 Sonnet increases inference costs by approximately 25%.

 To make these methods more practical, we explore repurposing computations the AI model is already performing. We find that significant cost reductions may be achievable with minimal regressions in performance on held-out test sets, including on prompts generated by external red-teamers when attacking other classifiers. Our best method, which adds only one layer’s worth of compute, slightly exceeds the detection performance of a standalone classifier with ≈25% FLOP overhead, though best practice calls for adaptive red-teaming before taking our performance evaluations at face value. 

 In particular, we use the synthetic data generation pipeline from Constitutional Classifiers to train classifiers with different cost and speed profiles. We evaluate two methods to reduce the computational overhead of safety classifiers:

 Linear probing of model activations  - Using lightweight classifiers on the model's internal representations to efficiently detect harmful content.
 Partially fine-tuned models  - Retraining for safety classification only the final layers of existing models, so 

... (truncated, 32 KB total)

Resource ID: 59e8b7680b0b0519 | Stable ID: sid_sTtUNRbnb3