Longterm Wiki

AI Output Filtering

Comprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbreaks per UK AISI testing, though Anthropic's Constitutional Classifiers blocked 95.6% of attacks. Concludes filtering provides marginal safety benefits f

Related Pages

Top Related Pages

Approaches

Refusal TrainingTool-Use RestrictionsAlignment Evaluations

Organizations

OpenAI

Concepts

Alignment Deployment OverviewSituational Awareness