Longterm Wiki

AI-Assisted Alignment

Using current AI systems to assist with alignment research tasks including red-teaming, interpretability, and recursive oversight. AI-assisted red-teaming reduces jailbreak success rates from 86% to 4.4%, and weak-to-strong generalization can recover GPT-3.5-level performance from GPT-2 supervision.

Related

Related Pages

Top Related Pages

Safety Research

Anthropic Core Views

Risks

AI Capability SandbaggingDeceptive Alignment

Analysis

Model Organisms of Misalignment

Approaches

Sleeper Agent Detection

Concepts

AI-Assisted Knowledge ManagementAI Doomer Worldview

Other

Jan LeikeIlya SutskeverConnor LeahyRed TeamingInterpretability

Organizations

Redwood ResearchMETRPalisade Research

Policy

Voluntary AI Safety Commitments

Key Debates

Technical AI Safety ResearchIs Interpretability Sufficient for Safety?

Historical

Deep Learning Revolution EraMainstream Era

Tags

ai-assisted-researchred-teaminginterpretabilityrecursive-oversightscalable-alignment