Edited today1.7k words1 backlinksUpdated every 6 weeksDue in 6 weeks
65QualityGood •Quality: 65/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 9366ImportanceUsefulImportance: 66/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.71.5ResearchHighResearch Value: 71.5/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Summary
Capability unlearning removes dangerous capabilities (e.g., bioweapon synthesis) from AI models through gradient-based methods, representation engineering, and fine-tuning, achieving 60-80% reduction on WMDP benchmarks with combined approaches. However, verification is impossible, capabilities are recoverable through fine-tuning, and knowledge entanglement limits what can be safely removed, making this a defense-in-depth layer rather than complete solution.
Content8/13
LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.EntityEntityYAML entity definition with type, description, and related entries.Edit history1Edit historyTracked changes from improve pipeline runs and manual edits.OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables20/ ~7TablesData tables for structured comparisons and reference material.Diagrams1/ ~1DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.–Int. links3/ ~13Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Add links to other wiki pagesExt. links20/ ~8Ext. linksLinks to external websites, papers, and resources outside the wiki.Footnotes0/ ~5FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citations–References2/ ~5ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:4.5 R:5 A:6 C:6.5RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks1BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Change History1
Fix audit report findings from PR #2163 weeks ago
Reviewed PR #216 (comprehensive wiki audit report) and implemented fixes for the major issues it identified: fixed 181 path-style EntityLink IDs across 33 files, converted 164 broken EntityLinks (referencing non-existent entities) to plain text across 38 files, fixed a temporal inconsistency in anthropic.mdx, and added missing description fields to 53 ai-transition-model pages.
Issues2
QualityRated 65 but structure suggests 93 (underrated by 28 points)
Links2 links could use <R> components
Capability Unlearning / Removal
Approach
Capability Unlearning / Removal
Capability unlearning removes dangerous capabilities (e.g., bioweapon synthesis) from AI models through gradient-based methods, representation engineering, and fine-tuning, achieving 60-80% reduction on WMDP benchmarks with combined approaches. However, verification is impossible, capabilities are recoverable through fine-tuning, and knowledge entanglement limits what can be safely removed, making this a defense-in-depth layer rather than complete solution.
Center for AI SafetyOrganizationCenter for AI SafetyCAIS is a nonprofit research organization founded by Dan Hendrycks that has distributed compute grants to researchers, published technical AI safety papers including the representation engineering ...Quality: 42/100
Approaches
Representation EngineeringApproachRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100
Policies
Responsible Scaling Policies (RSPs)PolicyResponsible Scaling Policies (RSPs)RSPs are voluntary industry frameworks that trigger safety evaluations at capability thresholds, currently covering 60-70% of frontier development across 3-4 major labs. Estimated 10-25% risk reduc...Quality: 64/100
1.7k words · 1 backlinks
Overview
Capability unlearning represents a direct approach to AI safety: rather than preventing misuse through behavioral constraints that might be circumvented, remove the dangerous capabilities themselves from the model. If a model genuinely doesn't know how to synthesize dangerous pathogens or construct cyberattacks, it cannot be misused for these purposes regardless of jailbreaks, fine-tuning attacks, or other elicitation techniques.
The approach has gained significant research attention following the development of benchmarks like WMDP (Weapons of Mass Destruction Proxy), released in March 2024 by the Center for AI SafetyOrganizationCenter for AI SafetyCAIS is a nonprofit research organization founded by Dan Hendrycks that has distributed compute grants to researchers, published technical AI safety papers including the representation engineering ...Quality: 42/100 in collaboration with over twenty academic institutions and industry partners. WMDP contains 3,668 multiple-choice questions measuring dangerous knowledge in biosecurity, cybersecurity, and chemical security. Researchers have demonstrated that various techniques including gradient-based unlearning, representation engineeringApproachRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100, and fine-tuning can reduce model performance on these benchmarks while preserving general capabilities.
However, the field faces fundamental challenges that may limit its effectiveness. First, verifying complete capability removal is extremely difficult, as capabilities may be recoverable through fine-tuning, prompt engineering, or other elicitation methods. Second, dangerous and beneficial knowledge are often entangled, meaning removal may degrade useful capabilities. Third, for advanced AI systems, the model might understand what capabilities are being removed and resist or hide the remaining knowledge. These limitations suggest capability unlearning is best viewed as one layer in a defense-in-depth strategy rather than a complete solution.
Risk Assessment & Impact
Dimension
Assessment
Evidence
Timeline
Safety Uplift
High (if works)
Would directly remove dangerous capabilities
Near to medium-term
Capability Uplift
Negative
Explicitly removes capabilities
N/A
Net World Safety
Helpful
Would be valuable if reliably achievable
Near-term
Lab Incentive
Moderate
Useful for deployment compliance; may reduce utility
Current
Research Investment
$1-20M/yr
Academic research, some lab interest
Current
Current Adoption
Experimental
Research papers; not reliably deployed
Current
Unlearning Approaches
Loading diagram...
Gradient-Based Unlearning
Aspect
Description
Mechanism
Compute gradients to increase loss on dangerous capabilities
Questions were designed as proxies for hazardous knowledge rather than containing sensitive information directly. The benchmark is publicly available with the most dangerous questions withheld.
Unlearning Effectiveness
The TOFU benchmark (published at COLM 2024) evaluates unlearning on synthetic author profiles, measuring both forgetting quality and model utility retention:
Bioweapons RiskRiskBioweapons RiskComprehensive synthesis of AI-bioweapons evidence through early 2026, including the FRI expert survey finding 5x risk increase from AI capabilities (0.3% → 1.5% annual epidemic probability), Anthro...Quality: 91/100
High
Removes pathogen synthesis, enhancement knowledge
Dual-use biology knowledge entangled
Cyberattacks
High
Removes exploit development, attack techniques
Security knowledge widely distributed
High
Directly reduces dangerous capability surface
Recovery via fine-tuning possible
Open Sourcing Risk
High
Critical for open-weight releases where runtime controls absent
The Center for AI Safety conducts technical and conceptual research to mitigate potential catastrophic risks from advanced AI systems. They take a comprehensive approach spanning technical research, philosophy, and societal implications.
AI Uplift Assessment ModelAnalysisAI Uplift Assessment ModelQuantitative assessment estimating AI provides modest knowledge uplift for bioweapons (1.0-1.2x per RAND 2024) but concerning evasion capabilities (2-3x, potentially 7-10x by 2028), with projected ...Quality: 70/100Bioweapons Attack Chain ModelAnalysisBioweapons Attack Chain ModelMultiplicative attack chain model estimates catastrophic bioweapons probability at 0.02-3.6%, with state actors (3.0%) dominating risk due to lab access. DNA synthesis screening offers highest cost...Quality: 69/100AI-Bioweapons Timeline ModelAnalysisAI-Bioweapons Timeline ModelTimeline model projects AI-bioweapons capabilities crossing four thresholds: knowledge democratization already partially crossed (fully by 2025-2027), synthesis assistance 2027-2032 (median 2029), ...Quality: 58/100
Approaches
Refusal TrainingApproachRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100Dangerous Capability EvaluationsApproachDangerous Capability EvaluationsComprehensive synthesis showing dangerous capability evaluations are now standard practice (95%+ frontier models) but face critical limitations: AI capabilities double every 7 months while external...Quality: 64/100Eliciting Latent Knowledge (ELK)ApproachEliciting Latent Knowledge (ELK)Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awarded \$274K, but \$50K and \$100K prizes remain unc...Quality: 91/100
Key Debates
AI Misuse Risk CruxesCruxAI Misuse Risk CruxesComprehensive analysis of 13 AI misuse cruxes with quantified evidence showing mixed uplift (RAND bio study found no significant difference, but cyber CTF scores improved 27%→76% in 3 months), deep...Quality: 65/100