Skip to content

Training Methods

Training methods for alignment focus on shaping model behavior during the learning process.

Core Approaches:

  • RLHF: Reinforcement Learning from Human Feedback - the foundation of modern alignment training
  • Constitutional AI: Self-critique based on principles
  • Preference Optimization: Direct preference learning (DPO, IPO)

Specialized Techniques:

  • Process Supervision: Rewarding reasoning steps, not just outcomes
  • Reward Modeling: Learning human preferences from comparisons
  • Refusal Training: Teaching models to decline harmful requests
  • Adversarial Training: Robustness through adversarial examples

Advanced Methods:

  • Weak-to-Strong Generalization: Can weak supervisors train strong models?
  • Capability Unlearning: Removing dangerous knowledge