84ImportanceHighImportance: 84/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.67ResearchModerateResearch Value: 67/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Content1/13
LLM summaryLLM summaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.crux content improve <id>ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.Set updateFrequency in frontmatterEntityEntityYAML entity definition with type, description, and related entries.Edit historyEdit historyTracked changes from improve pipeline runs and manual edits.crux edit-log view <id>OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.Add a ## Overview section at the top of the page
Tables0/ ~1TablesData tables for structured comparisons and reference material.Add data tables to the pageDiagrams0DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.Add Mermaid diagrams or Squiggle modelsInt. links0/ ~3Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Add links to other wiki pagesExt. links0/ ~1Ext. linksLinks to external websites, papers, and resources outside the wiki.Add links to external sourcesFootnotes0/ ~2FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citationsReferences0/ ~1ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>Backlinks1BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Issues1
StructureNo tables or diagrams - consider adding visual content
Natural Abstractions
Concept
Natural Abstractions
The hypothesis that natural abstractions converge across learning processes, aiding alignment
Related
Safety Agendas
InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
Representation EngineeringApproachRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100Sleeper Agent DetectionApproachSleeper Agent DetectionComprehensive survey of sleeper agent detection methods finding current approaches achieve only 5-40% success rates despite \$15-35M annual investment, with Anthropic's 2024 research showing backdo...Quality: 66/100AI-Assisted AlignmentApproachAI-Assisted AlignmentComprehensive analysis of AI-assisted alignment showing automated red-teaming reduced jailbreak rates from 86% to 4.4%, weak-to-strong generalization recovered 80-90% of GPT-3.5 performance from GP...Quality: 63/100Mechanistic InterpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with \$100M+ annual investment across major labs. Anthropic extracted 30M+ features from C...Quality: 59/100
Risks
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
Analysis
Model Organisms of MisalignmentAnalysisModel Organisms of MisalignmentModel organisms of misalignment is a research agenda creating controlled AI systems exhibiting specific alignment failures as testbeds. Recent work achieves 99% coherence with 40% misalignment rate...Quality: 65/100Capability-Alignment Race ModelAnalysisCapability-Alignment Race ModelQuantifies the capability-alignment race showing capabilities currently ~3 years ahead of alignment readiness, with gap widening at 0.5 years/year driven by 10²⁶ FLOP scaling vs. 15% interpretabili...Quality: 62/100
Safety Research
Anthropic Core ViewsSafety AgendaAnthropic Core ViewsAnthropic allocates 15-25% of R&D (~\$100-200M annually) to safety research including the world's largest interpretability team (40-60 researchers), while maintaining \$5B+ revenue by 2025. Their R...Quality: 62/100
Organizations
AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials (\$380B valuation, \$19B ARR), safety research (Constitutional AI, mechanistic interpretability, model welfare), governance (LTBT struc...Quality: 74/100ConjectureOrganizationConjectureConjecture is a 30-40 person London-based AI safety org founded 2022, pursuing Cognitive Emulation (CoEm) - building interpretable AI from ground-up rather than aligning LLMs - with \$30M+ Series A...Quality: 37/100
Key Debates
AI Alignment Research AgendasCruxAI Alignment Research AgendasComprehensive comparison of major AI safety research agendas (\$100M+ Anthropic, \$50M+ DeepMind, \$5-10M nonprofits) with detailed funding, team sizes, and failure mode coverage (25-65% per agenda...Quality: 69/100Technical AI Safety ResearchCruxTechnical AI Safety ResearchTechnical AI safety research encompasses six major agendas (mechanistic interpretability, scalable oversight, AI control, evaluations, agent foundations, and robustness) with 500+ researchers and \...Quality: 66/100Is Interpretability Sufficient for Safety?CruxIs Interpretability Sufficient for Safety?Comprehensive survey of the interpretability sufficiency debate with 2024-2025 empirical progress: Anthropic extracted 34M features from Claude 3 Sonnet (70% interpretable), but scaling requires bi...Quality: 49/100
Concepts
Dense TransformersConceptDense TransformersComprehensive analysis of dense transformers (GPT-4, Claude 3, Llama 3) as the dominant AI architecture (95%+ of frontier models), with training costs reaching \$100M-500M per run and 2.5x annual c...Quality: 58/100
Historical
Deep Learning Revolution EraHistoricalDeep Learning Revolution EraComprehensive timeline documenting 2012-2020 AI capability breakthroughs (AlexNet, AlphaGo, GPT-3) and parallel safety field development, with quantified metrics showing capabilities funding outpac...Quality: 44/100Mainstream EraHistoricalMainstream EraComprehensive timeline of AI safety's transition from niche to mainstream (2020-present), documenting ChatGPT's unprecedented growth (100M users in 2 months), the OpenAI governance crisis, and firs...Quality: 42/100
Other
Dario AmodeiPersonDario AmodeiComprehensive biographical profile of Anthropic CEO Dario Amodei documenting his competitive safety development philosophy, 10-25% catastrophic risk estimate, 2026-2030 AGI timeline, and Constituti...Quality: 41/100Yoshua BengioPersonYoshua BengioComprehensive biographical overview of Yoshua Bengio's transition from deep learning pioneer (Turing Award 2018) to AI safety advocate, documenting his 2020 pivot at Mila toward safety research, co...Quality: 39/100Chris OlahPersonChris OlahBiographical overview of Chris Olah's career trajectory from self-taught researcher to Google Brain, OpenAI, and co-founding Anthropic, focusing on his work in mechanistic interpretability includin...Quality: 27/100Neel NandaPersonNeel NandaOverview of Neel Nanda's contributions to mechanistic interpretability, including the TransformerLens library and research on transformer circuits. Covers his educational content and role in making...Quality: 26/100