Longterm Wiki
Navigation
Updated 2026-03-13HistoryData
Page StatusResponse
Edited today88 words
27QualityDraft62ImportanceUseful51.5ResearchModerate
Summary

A bare-bones index page listing 10 alignment training methods (RLHF, Constitutional AI, DPO, process supervision, etc.) with one-line descriptions and links to deeper pages, providing no analysis, evidence, or comparative assessment of effectiveness.

Content3/13
LLM summaryScheduleEntityEdit history1Overview
Tables0/ ~1Diagrams0Int. links9/ ~3Ext. links0/ ~1Footnotes0/ ~2References0/ ~1Quotes0Accuracy0RatingsN:1.5 R:1 A:2 C:3.5
Change History1
Clarify overview pages with new entity type3 weeks ago

Added `overview` as a proper entity type throughout the system, migrated all 36 overview pages to `entityType: overview`, built overview-specific InfoBox rendering with child page links, created an OverviewBanner component, and added a knowledge-base-overview page template to Crux.

Issues1
StructureNo tables or diagrams - consider adding visual content

Training Methods (Overview)

Training methods for alignment focus on shaping model behavior during the learning process.

Core Approaches:

  • RLHF: Reinforcement Learning from Human Feedback - the foundation of modern alignment training
  • Constitutional AI: Self-critique based on principles
  • Preference Optimization: Direct preference learning (DPO, IPO)

Specialized Techniques:

  • Process Supervision: Rewarding reasoning steps, not just outcomes
  • Reward Modeling: Learning human preferences from comparisons
  • Refusal Training: Teaching models to decline harmful requests
  • Adversarial Training: Robustness through adversarial examples

Advanced Methods:

  • Weak-to-Strong Generalization: Can weak supervisors train strong models?
  • Capability Unlearning: Removing dangerous knowledge

Related Pages

Top Related Pages

Concepts

RLHF

Approaches

Refusal TrainingAdversarial TrainingReward Modeling