Training Methods
Training methods for alignment focus on shaping model behavior during the learning process.
Core Approaches:
- RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100: Reinforcement Learning from Human Feedback - the foundation of modern alignment training
- Constitutional AIConstitutional AiConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100: Self-critique based on principles
- Preference OptimizationPreference OptimizationDPO and related preference optimization methods reduce alignment training costs by 40-60% while matching RLHF performance on dialogue tasks, though PPO still outperforms by 1.3-2.9 points on reason...Quality: 62/100: Direct preference learning (DPO, IPO)
Specialized Techniques:
- Process SupervisionProcess SupervisionProcess supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it...Quality: 65/100: Rewarding reasoning steps, not just outcomes
- Reward ModelingReward ModelingReward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is univ...Quality: 55/100: Learning human preferences from comparisons
- Refusal TrainingRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100: Teaching models to decline harmful requests
- Adversarial TrainingAdversarial TrainingAdversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against mo...Quality: 58/100: Robustness through adversarial examples
Advanced Methods:
- Weak-to-Strong GeneralizationWeak To StrongWeak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show 80% Performance Gap Recovery on NLP tasks with co...Quality: 91/100: Can weak supervisors train strong models?
- Capability UnlearningCapability UnlearningCapability unlearning removes dangerous capabilities (e.g., bioweapon synthesis) from AI models through gradient-based methods, representation engineering, and fine-tuning, achieving 60-80% reducti...Quality: 65/100: Removing dangerous knowledge