AI Output Filtering
Comprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbreaks per UK AISI testing, though Anthropic's Constitutional Classifiers blocked 95.6% of attacks. Concludes filtering provides marginal safety benefits f
Related Pages
Top Related Pages
Circuit Breakers / Inference Interventions
Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference.
Structured Access / API-Only
Structured access provides AI capabilities through controlled APIs rather than releasing model weights, maintaining developer control over deployme...
Anthropic
An AI safety company founded by former OpenAI researchers that develops frontier AI models while pursuing safety research, including the Claude mod...
Goodfire
AI interpretability research lab developing tools to decode and control neural network internals for safer AI systems
AI Safety Defense in Depth Model
Mathematical framework analyzing how layered AI safety measures combine, showing independent layers with 20-60% failure rates can achieve 1-3% comb...