27QualityDraftQuality: 27/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.62ImportanceUsefulImportance: 62/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.51.5ResearchModerateResearch Value: 51.5/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Summary
A bare-bones index page listing 10 alignment training methods (RLHF, Constitutional AI, DPO, process supervision, etc.) with one-line descriptions and links to deeper pages, providing no analysis, evidence, or comparative assessment of effectiveness.
Content3/13
LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.Set updateFrequency in frontmatterEntityEntityYAML entity definition with type, description, and related entries.Add entity YAML in data/entities/Edit history1Edit historyTracked changes from improve pipeline runs and manual edits.OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.Add a ## Overview section at the top of the page
Tables0/ ~1TablesData tables for structured comparisons and reference material.Add data tables to the pageDiagrams0DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.Add Mermaid diagrams or Squiggle modelsInt. links9/ ~3Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Ext. links0/ ~1Ext. linksLinks to external websites, papers, and resources outside the wiki.Add links to external sourcesFootnotes0/ ~2FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citationsReferences0/ ~1ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:1.5 R:1 A:2 C:3.5RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).
Change History1
Clarify overview pages with new entity type3 weeks ago
Added `overview` as a proper entity type throughout the system, migrated all 36 overview pages to `entityType: overview`, built overview-specific InfoBox rendering with child page links, created an OverviewBanner component, and added a knowledge-base-overview page template to Crux.
Issues1
StructureNo tables or diagrams - consider adding visual content
Training Methods (Overview)
Training methods for alignment focus on shaping model behavior during the learning process.
Core Approaches:
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100: Reinforcement Learning from Human Feedback - the foundation of modern alignment training
Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100: Self-critique based on principles
Preference OptimizationApproachPreference Optimization MethodsDPO and related preference optimization methods reduce alignment training costs by 40-60% while matching RLHF performance on dialogue tasks, though PPO still outperforms by 1.3-2.9 points on reason...Quality: 62/100: Direct preference learning (DPO, IPO)
Specialized Techniques:
Process SupervisionApproachProcess SupervisionProcess supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it...Quality: 65/100: Rewarding reasoning steps, not just outcomes
Reward ModelingApproachReward ModelingReward modeling, the core component of RLHF receiving \$100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is uni...Quality: 55/100: Learning human preferences from comparisons
Refusal TrainingApproachRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100: Teaching models to decline harmful requests
Adversarial TrainingApproachAdversarial TrainingAdversarial training, universally adopted at frontier labs with \$10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against m...Quality: 58/100: Robustness through adversarial examples
Advanced Methods:
Weak-to-Strong GeneralizationApproachWeak-to-Strong GeneralizationWeak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show 80% Performance Gap Recovery on NLP tasks with co...Quality: 91/100: Can weak supervisors train strong models?
Capability UnlearningApproachCapability Unlearning / RemovalCapability unlearning removes dangerous capabilities (e.g., bioweapon synthesis) from AI models through gradient-based methods, representation engineering, and fine-tuning, achieving 60-80% reducti...Quality: 65/100: Removing dangerous knowledge
RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100
Approaches
Refusal TrainingApproachRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100Adversarial TrainingApproachAdversarial TrainingAdversarial training, universally adopted at frontier labs with \$10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against m...Quality: 58/100Reward ModelingApproachReward ModelingReward modeling, the core component of RLHF receiving \$100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is uni...Quality: 55/100