Longterm Wiki
Updated 2026-03-13HistoryData
Page StatusDocumentation
Edited today5.0k wordsPoint-in-time
55QualityAdequate •75ImportanceHigh
Summary

Architecture proposal for next-generation wiki page generation. Analyzes current single-pipeline limitations, surveys state-of-the-art (Stanford STORM, GraphRAG, CrewAI, Self-Refine), and proposes a multi-agent multi-pass architecture with 8 specialist agents, 12+ composable passes, knowledge graph-driven linking, and dynamic computation embedding. Includes concrete implementation plan, composable module architecture, and academic foundations drawing on proposition-level retrieval (Dense X Retrieval), FActScore verification, nanopublications, and argumentation frameworks.

Content3/13
LLM summaryScheduleEntityEdit history2Overview
Tables11/ ~20Diagrams6/ ~2Int. links2/ ~40Ext. links11/ ~25Footnotes0/ ~15References0/ ~15Quotes0Accuracy0RatingsN:7 R:6 A:8 C:7
Change History2
Route internal pages through /wiki/E<id>#1823 weeks ago

Migrated internal pages from `/internal/` to `/wiki/E<id>` URLs so they render with full wiki infrastructure (breadcrumbs, metadata, quality indicators, sidebar). Internal MDX pages now redirect from `/internal/slug` to `/wiki/E<id>`, while React dashboard pages (suggested-pages, updates, page-changes, etc.) remain at `/internal/`. Follow-up review: cleaned up dead code, hid wiki-specific UI on internal pages, fixed breadcrumbs, updated all bare-text `/internal/` references.

Wiki generation architecture research & proposal#1734 weeks ago

Researched state-of-the-art approaches to scalable wiki generation (Stanford STORM, Microsoft GraphRAG, CrewAI, Self-Refine, SemanticCite, Anthropic multi-agent systems, KARMA) and wrote a comprehensive architecture proposal for multi-agent, multi-pass wiki page generation. The proposal covers 8 specialist agents, 12+ composable passes, knowledge graph-driven content planning, dynamic computation embedding, and iterative refinement loops.

Issues1
QualityRated 55 but structure suggests 87 (underrated by 32 points)

Wiki Generation Architecture: Multi-Agent Multi-Pass Design

Status: Vision Document

This is an architecture proposal, not a description of the current system. Some elements have been partially implemented (section-level rewriting, KB fact system, basic citation verification), but the full multi-agent orchestrator has not been built. The KB system (packages/kb/) now provides the structured data layer that this architecture assumes as a prerequisite — structured facts in YAML, <KBF> for inline values, and <Calc> for computed values.

Executive Summary

See also: The Claim-First Wiki Architecture proposal (removed) was a companion proposal that inverted the data model, making verified atomic claims the primary artifact. That proposal's structured-data ideas have been partially superseded by the KB system (packages/kb/), which provides entity-level structured facts in YAML, the <KBF> component for inline fact values, and <Calc> for computed values.

Our current page generation pipeline (Crux content create/improve) is a single-pipeline, single-agent system. It works, but produces pages that are adequate rather than excellent. The best wiki pages require depth that a single LLM pass cannot achieve: dense cross-linking, verified citations, complex diagrams, embedded calculations, and knowledge graph coherence.

This document proposes a multi-agent, multi-pass architecture inspired by Stanford's STORM, Microsoft's GraphRAG, CrewAI's specialist agent patterns, and the Self-Refine iterative paradigm. The core idea: decompose page generation into composable passes, each executed by a specialist agent optimized for one concern.

Current SystemProposed System
Single synthesis prompt12+ composable passes
One LLM does everything8 specialist agents
Research then write (2 phases)Research, structure, write, link, verify, compute, diagram, review (8+ phases)
Knowledge graph consulted at link timeKnowledge graph drives content planning
Static calculationsDynamic Squiggle models derived from wiki data
Post-hoc validationValidation integrated into each pass
$4-15 per page$8-25 per page (higher quality ceiling)

Part 1: Problems with the Current System

What We Have

The current pipeline (crux/authoring/page-creator.ts) follows this flow:

canonical-links -> research -> source-fetching -> synthesis -> verification -> validation -> grade

This produces pages scoring 70-80/100 on our grading rubric. The pipeline has been iterated significantly (see the Page Creator Pipeline report) and represents solid work. But it has structural limitations:

Limitation 1: Single-Agent Synthesis Bottleneck

One Claude call synthesizes the entire article from research. This means the model must simultaneously:

  • Write coherent prose
  • Place citations correctly
  • Decide which EntityLinks to use
  • Structure sections per template
  • Include appropriate tables and diagrams
  • Maintain balanced perspective

No single prompt can optimize all of these. The result: pages that are structurally correct but lack depth in cross-linking, calculations, and visual elements.

Limitation 2: Knowledge Graph is Read-Only

The current system consults the entity database to resolve EntityLinks, but doesn't use the knowledge graph to plan content. A page about "deceptive alignment" should proactively cover its graph neighbors (situational awareness, mesa-optimization, sleeper agents) with appropriate depth. Currently, this happens only if the LLM independently decides to mention them.

Limitation 3: No Iterative Deepening

The pipeline runs once. If the synthesis phase produces a page with weak sections, those sections stay weak. The review phase in the improver can identify gaps, but the fix is another monolithic LLM call. There's no mechanism for targeted, section-level improvement.

Update (Feb 2026): The --section-level flag (pnpm crux content improve <id> --section-level) now implements per-section rewriting: the page is split on ## headings, each section rewritten independently via rewriteSection(), then reassembled with renumbered footnotes. See crux/lib/section-splitter.ts and crux/authoring/page-improver/phases/improve-sections.ts. This addresses the "targeted improvement" limitation above; the deeper limitations (graph-aware planning, diagram agents) remain future work.

Limitation 4: Diagrams and Calculations are Afterthoughts

Mermaid diagrams and Squiggle models are included only if the synthesis prompt happens to produce them. There's no dedicated agent reasoning about what visual or computational elements would add value, and no agent that specializes in producing high-quality versions of these.

Limitation 5: Cross-Linking is Shallow

EntityLinks are added during synthesis, then validated. But the system doesn't reason about the topology of links: which inbound links should this page attract? Which pages should link to this one? A new page about "compute governance" should trigger updates to pages about "compute thresholds," "chip export controls," and "training run monitoring."


Part 2: State of the Art

Stanford STORM (2024)

STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking) is the closest academic system to what we need. Key innovations:

InnovationHow It WorksRelevance to Us
Perspective-guided researchDiscovers multiple perspectives by surveying similar articles, then simulates conversations from each perspectiveWe could mine perspectives from our existing 625 pages on related topics
Simulated expert conversationsA "writer" agent asks questions to a "topic expert" agent grounded in search results. Follow-up questions arise naturallyBetter than our "dump all research into one synthesis prompt" approach
Two-stage pipelinePre-writing (research + outline) is separated from writing. Outline quality correlates with article qualityWe already do this loosely; could formalize it
Co-STORM mind mapOrganizes collected information into a hierarchical concept structure updated throughout the processMaps to our entity graph, but dynamically maintained during authoring

Key finding: STORM articles were rated 25% better organized and 10% broader in coverage than baseline RAG approaches by Wikipedia editors.

Limitation: STORM produces Wikipedia-style articles but doesn't handle our specific requirements: EntityLinks, Squiggle models, Mermaid diagrams, YAML entity synchronization, or the frontmatter/grading system.

Microsoft GraphRAG (2024)

GraphRAG extends RAG with knowledge graph structure. Instead of retrieving text chunks, it retrieves subgraphs -- entities, relationships, and community summaries.

InnovationHow It WorksRelevance to Us
Community detectionClusters related entities and generates hierarchical summariesWe could use this to identify which entities a new page should cover
Global search via map-reducePre-generates community summaries, then runs map-reduce across them for corpus-wide questionsUseful for "what's the relationship between X and all its neighbors?"
Entity extraction pipelineExtracts entities and relationships from text, builds graphWe already have this (YAML entities + content scanning), but could improve

Key finding: GraphRAG dramatically outperforms naive RAG on multi-hop reasoning and synthesis questions. Exactly the kind of reasoning needed for wiki cross-linking.

CrewAI Specialist Agent Pattern (2025)

CrewAI demonstrates that splitting work across specialist agents with clear handoff contracts produces better results than one mega-agent.

The pattern: Researcher -> Writer -> Editor -> Specialist, with each agent optimized for its role (different system prompts, different tools, potentially different models).

Key insight from CrewAI: "Squeezing too much into one agent causes context windows to blow up, too many tools confuse it, and hallucinations increase." This directly explains our synthesis bottleneck.

Self-Refine (Madaan et al., 2023)

Self-Refine demonstrates that iterative generate -> feedback -> refine loops improve LLM output by ~20% on average. The key: the same model generates, critiques, and refines, but with different prompts for each role.

Key finding: The refine loop works best when feedback is specific and actionable (not "make it better" but "paragraph 3 lacks a citation for the 40% claim"). This maps to our validation rules, which already produce specific, fixable issues.

SemanticCite (2025)

SemanticCite proposes a pipeline for citation verification: extract claims, retrieve source passages via hybrid search, classify support level (SUPPORTED / PARTIALLY SUPPORTED / UNSUPPORTED / UNCERTAIN). Their fine-tuned models achieve competitive performance with commercial systems.

Relevance: We already have a verify-sources phase, but it's coarse-grained. Per-claim verification with confidence scoring would significantly improve citation quality.

Anthropic Multi-Agent Research System (2025)

Anthropic's own research system uses an orchestrator-worker pattern: a lead agent analyzes a query, develops strategy, and spawns subagents to explore different aspects in parallel. Multi-agent Opus + Sonnet outperformed single-agent Opus by 90.2% on their research eval.

Key insight: Use expensive models (Opus) for orchestration and synthesis, cheap models (Sonnet/Haiku) for parallel research and extraction. This is exactly the cost structure we should adopt.


Part 3: Proposed Architecture

Core Principle: Composable Passes

Instead of a monolithic pipeline, we define passes that can be composed in different orders depending on the page type, tier, and goals. Each pass:

  1. Takes a well-defined input (page draft + metadata)
  2. Produces a well-defined output (modified draft + metadata)
  3. Is idempotent (running it twice produces the same result)
  4. Has a cost estimate
  5. Can be run independently for debugging
Loading diagram...

The 8 Specialist Agents

Each agent has a focused role, specific tools, and an optimal model choice:

#AgentRoleModelToolsCost/Page
1OrchestratorPlans strategy, schedules passes, checks quality gatesOpusAll agents, quality scorer$1-2
2ResearcherWeb search, academic search, source fetchingSonnetPerplexity, SCRY, Firecrawl$1-3
3Graph AnalystAnalyzes knowledge graph neighbors, plans cross-linksSonnetEntity DB, backlinks, graph data$0.50-1
4StructurerGenerates outlines, ensures template complianceSonnetPage templates, existing page analysis$0.50-1
5WriterSection-by-section prose synthesis from researchOpusResearch output, entity lookup$2-4
6EnricherCreates diagrams, Squiggle models, tablesSonnetMermaid validator, Squiggle runtime$1-2
7VerifierCitation checking, EntityLink resolution, fact validationHaikuSource DB, validation engine$0.25-0.50
8ReviewerIdentifies gaps, bias, weak sections; triggers re-passesOpusQuality rubric, template checker$1-2

Total estimated cost: $8-16 for standard tier (vs $4-6 currently). The quality ceiling is substantially higher.

The 12+ Composable Passes

Research Passes

Pass R1: Perspective Discovery

  • Input: Topic title + entity type
  • Process: Survey our existing pages on related topics. What perspectives do they cover? What's missing? (Inspired by STORM's perspective mining)
  • Output: List of 5-10 perspectives to investigate (e.g., for "compute governance": technical feasibility, political economy, international coordination, industry self-regulation, civil liberties)
  • Agent: Graph Analyst
  • Cost: $0.25

Pass R2: Multi-Source Research

  • Input: Topic + perspectives list
  • Process: For each perspective, run targeted Perplexity queries. Fetch and register sources.
  • Output: research.json with categorized findings per perspective
  • Agent: Researcher
  • Cost: $1-3

Pass R3: Graph Neighbor Analysis

  • Input: Topic + entity database
  • Process: Identify all entities within 2 hops in the knowledge graph. Analyze which are most relevant and what relationship labels apply. Determine which existing pages should link to this new page.
  • Output: graph-context.json with neighbor entities, relationship types, and suggested inbound link updates
  • Agent: Graph Analyst
  • Cost: $0.50

Pass R4: Existing Content Analysis

  • Input: Topic + similar pages (from redundancy detection)
  • Process: Read the top 5 most similar existing pages. Identify what this page should cover that they don't, and what it can reference rather than repeat.
  • Output: content-gap.json with unique angles and cross-references
  • Agent: Graph Analyst
  • Cost: $0.50

Structure Passes

Pass S1: Outline Generation

  • Input: Research output + template + graph context
  • Process: Generate a detailed section-by-section outline with word count targets and required elements per section (tables, citations, diagrams)
  • Output: outline.json with sections, subsections, planned elements
  • Agent: Structurer
  • Cost: $0.50

Pass S2: Knowledge Graph Planning

  • Input: Outline + entity database
  • Process: For each section, identify which EntityLinks should appear. Plan where Squiggle models and diagrams will go. Identify facts to extract to YAML.
  • Output: Enriched outline with EntityLink targets, diagram specs, computation specs
  • Agent: Graph Analyst
  • Cost: $0.50

Content Passes

Pass C1: Section-by-Section Synthesis

  • Input: Outline + research + graph context (one section at a time)
  • Process: Write each section independently, using only the research relevant to that section. Enforce citation discipline per section.
  • Output: Draft page with all sections assembled
  • Agent: Writer
  • Cost: $2-4

This is the biggest departure from the current system. Instead of one synthesis call, we write section by section. Each section gets a focused context window with only the relevant research, entity lookups, and template requirements. This prevents context window overload and ensures each section gets full attention.

Pass C2: Citation Placement

  • Input: Draft page + source database
  • Process: Verify every factual claim has a citation. Add missing citations from the source database. Convert inline URLs to <R> components where sources exist.
  • Output: Fully cited draft
  • Agent: Verifier
  • Cost: $0.25

Pass C3: EntityLink Enrichment

  • Input: Draft + entity database
  • Process: Scan for entity mentions that lack EntityLinks. Add <EntityLink> components for all resolvable entities. Ensure link density meets template requirements.
  • Output: Cross-linked draft
  • Agent: Graph Analyst
  • Cost: $0.25

Enrichment Passes

Pass E1: Diagram Generation

  • Input: Draft + outline diagram specs
  • Process: For each planned diagram location, generate a Mermaid diagram that visualizes the concept. Validate syntax. Follow Mermaid style guide (max 15-20 nodes, flowchart TD, proper colors).
  • Output: Draft with embedded diagrams
  • Agent: Enricher
  • Cost: $0.50-1

Pass E2: Computation Embedding

  • Input: Draft + facts database + graph data
  • Process: Identify quantitative claims that could be dynamic. Create Squiggle models that compute from wiki data (KB facts in packages/kb/data/things/, entity metrics). Embed <SquiggleEstimate> components.
  • Output: Draft with dynamic computations
  • Agent: Enricher
  • Cost: $0.50-1

Pass E3: Table Structuring

  • Input: Draft
  • Process: Identify data that's better presented as tables. Ensure tables have proper headers, sourced data, and comparative structure. Enforce the "max 4 tables, tables are for genuinely comparative data" rule.
  • Output: Draft with optimized tables
  • Agent: Enricher
  • Cost: $0.25

Pass E4: Fact Extraction

  • Input: Draft + existing KB facts
  • Process: Extract key quantitative claims from the page and propose additions to KB facts in packages/kb/data/things/. Link computed facts to their source pages via <KBF> references.
  • Output: Proposed KB fact entries + draft with <KBF> references
  • Agent: Enricher
  • Cost: $0.25
  • Status: The KB system now partially serves this role. Structured facts for 360+ entities exist in packages/kb/data/things/*.yaml, with properties defined in packages/kb/data/properties.yaml. Pages can reference these via <KBF> components and [^1] footnotes. The automated extraction pipeline (proposing new facts from page content) has not been built.

Verification Passes

Pass V1: Citation Verification

  • Input: Draft + source database
  • Process: For each citation, verify the claim is actually supported by the source. Classify as SUPPORTED / PARTIALLY_SUPPORTED / UNSUPPORTED. Flag unsupported claims.
  • Output: Verification report + flagged claims
  • Agent: Verifier (inspired by SemanticCite)
  • Cost: $0.25-0.50

Pass V2: Validation Rules

  • Input: Draft
  • Process: Run the full validation engine (dollar signs, comparison operators, frontmatter schema, EntityLink IDs, etc.). Auto-fix where possible.
  • Output: Clean draft passing all blocking rules
  • Agent: Verifier
  • Cost: $0.10

Pass V3: Self-Review

  • Input: Draft + template + quality rubric
  • Process: Grade the page against the template rubric. Identify weak sections with specific, actionable feedback. Determine if another pass through content/enrichment is needed.
  • Output: Review report with per-section scores and improvement suggestions
  • Agent: Reviewer (inspired by Self-Refine)
  • Cost: $1-2

Iterative Refinement Loop

The Reviewer (V3) can trigger re-execution of specific passes:

Loading diagram...

The orchestrator limits iterations (default: 2 refinement cycles) to control cost. Each cycle targets only the specific passes needed.


Part 4: Tier Configurations

Budget Tier ($5-8)

For drafts and low-importance pages:

R2(lite) -> S1 -> C1 -> V2 -> V3(single)

5 passes, no enrichment, no iterative refinement. Produces a well-structured, cited article without diagrams or calculations.

Standard Tier ($12-18)

For most pages:

R1 -> R2 -> R3 -> S1 -> S2 -> C1 -> C2 -> C3 -> E1 -> E3 -> V1 -> V2 -> V3 -> [1 refinement cycle]

13+ passes including diagrams, cross-linking, and one refinement cycle. The expected quality ceiling.

Premium Tier ($20-30)

For high-importance or controversial pages:

R1 -> R2(deep) -> R3 -> R4 -> S1 -> S2 -> C1 -> C2 -> C3 -> E1 -> E2 -> E3 -> E4 -> V1 -> V2 -> V3 -> [2 refinement cycles]

All passes including Squiggle models, fact extraction, content gap analysis, and two refinement cycles.

Polish Tier ($3-5)

For improving existing pages (replaces current crux content improve):

R3 -> C3 -> E1(if missing) -> V1 -> V2 -> V3

Focuses on cross-linking, enrichment gaps, and citation verification. Doesn't rewrite prose.


Part 5: Knowledge Graph Integration

Graph-Driven Content Planning

The biggest architectural shift: the knowledge graph drives content creation rather than being consulted as an afterthought.

Loading diagram...

When a new page is created, the system should also propose updates to existing pages that should link to it. The Graph Analyst agent:

  1. Identifies entities in the graph that relate to the new page
  2. Reads existing pages for those entities
  3. Identifies natural insertion points for EntityLinks to the new page
  4. Produces a link-updates.json with proposed edits

This transforms page creation from an isolated act into a graph maintenance operation.

Community Summaries (GraphRAG-inspired)

For entity clusters (e.g., all "alignment approaches"), maintain pre-computed community summaries that:

  • Describe the cluster's theme
  • List key entities and their relationships
  • Identify gaps (entities that exist but lack pages)

These summaries can be used by the Writer agent as context when synthesizing content that touches on a cluster.


Part 6: Dynamic Computation Embedding

The Vision

Partial implementation note: The <Calc> and <KBF> (KB Fact) components now partially implement the "computation embedding" vision described here. <KBF> pulls live values from KB YAML files in packages/kb/data/things/, and <Calc> computes derived values from those facts. The Squiggle-based uncertainty modeling described below remains aspirational.

Wiki pages shouldn't just state numbers -- they should compute them from the knowledge base.

Example: A page about "AI lab safety spending" could include:

<SquiggleEstimate
  title="Estimated AI Safety Spending (2025)"
  code={`
    anthropicRevenue = 2B to 3.5B
    openaiRevenue = 3B to 5B
    deepmindBudget = 1.5B to 2.5B

    safetyFraction = {
      anthropic: 0.15 to 0.25,
      openai: 0.05 to 0.12,
      deepmind: 0.10 to 0.20
    }

    totalSafetySpending = anthropicRevenue * safetyFraction.anthropic
      + openaiRevenue * safetyFraction.openai
      + deepmindBudget * safetyFraction.deepmind
  `}
/>

How the Enricher Agent Creates These

  1. Identify computational opportunities: Scan the draft for quantitative claims that involve estimation or aggregation
  2. Pull from KB facts: Use existing KB fact values from packages/kb/data/things/ as inputs where available
  3. Create Squiggle models: Write distribution-based models (never point estimates) following the Squiggle style guide
  4. Validate: Run the Squiggle code to ensure it executes without errors
  5. Embed: Place <SquiggleEstimate> components at appropriate locations

Fact Feedback Loop

The Enricher can also propose new facts to the KB (packages/kb/data/things/) based on claims in the page, creating a feedback loop where page content enriches the data layer, which in turn feeds future computations.


Part 7: Implementation Plan

Phase 1: Pass Infrastructure (Foundation)

Build the composable pass system on top of existing Crux infrastructure:

interface Pass {
  id: string;
  name: string;
  agent: AgentType;

  // Input/output contract
  requires: string[];  // IDs of passes that must run first
  produces: string[];  // Artifact keys this pass creates

  // Execution
  execute(context: PassContext): Promise<PassResult>;

  // Cost estimation
  estimateCost(context: PassContext): number;
}

interface PassContext {
  topic: string;
  entityType: string;
  tier: TierConfig;

  // Accumulated artifacts from prior passes
  artifacts: Map<string, any>;

  // Shared resources
  entityDb: EntityDatabase;
  sourceDb: SourceDatabase;
  validationEngine: ValidationEngine;
}

This leverages the existing Crux validation engine, entity lookup, and source database. Each pass is a module in crux/authoring/passes/.

Phase 2: Specialist Agents

Implement agents as wrappers around Claude API calls with focused system prompts:

interface Agent {
  id: AgentType;
  model: 'opus' | 'sonnet' | 'haiku';
  systemPrompt: string;
  tools: Tool[];

  run(input: AgentInput): Promise<AgentOutput>;
}

Start with the Writer and Verifier agents (biggest impact), then add Graph Analyst and Enricher.

Phase 3: Orchestrator

Build the orchestrator that plans pass sequences and manages quality gates:

class Orchestrator {
  planPasses(topic: string, tier: Tier, entityType: string): Pass[];
  executePlan(passes: Pass[], context: PassContext): Promise<Page>;
  checkQualityGate(page: Page, template: Template): QualityResult;
  planRefinement(review: ReviewResult): Pass[];
}

Phase 4: Graph Integration

Add bidirectional link updates and community summaries. This requires changes to build-data.mjs to compute community clusters and maintain summary cache.

Phase 5: Computation Embedding

Add the Squiggle model generation pass. Requires integration with the Squiggle runtime for validation.

Migration Path

The new system can coexist with the current pipeline:

# Current system (preserved)
pnpm crux content create "Topic" --tier=standard

# New system (opt-in)
pnpm crux content create "Topic" --tier=standard --engine=v2

# Eventually
pnpm crux content create "Topic" --tier=standard  # defaults to v2

Part 8: Comparison with Alternatives

Alternative A: Keep Improving Current Pipeline

Pros: Lower risk, incremental improvement, already works.

Cons: Structural ceiling -- single-agent synthesis can't produce the cross-linking, computation, and diagram density we want. Diminishing returns on prompt engineering.

Alternative B: Adopt STORM Directly

Pros: Battle-tested, open source, good research quality.

Cons: Python-only (we're TypeScript), no support for EntityLinks/Squiggle/Mermaid/YAML entities, no knowledge graph integration, would require heavy forking. The research stage is valuable but the writing stage doesn't match our needs.

Alternative C: Full CrewAI/LangGraph Framework

Pros: Rich agent orchestration, built-in patterns for sequential/parallel execution.

Cons: Heavy framework dependency, Python ecosystem, abstractions don't map cleanly to our YAML-first data model. We'd spend more time fighting the framework than building features.

Alternative D: Claude Agent SDK Multi-Agent (Proposed Approach)

Pros: Same TypeScript ecosystem, direct Anthropic API integration, subagent orchestration built-in, proven at scale (90.2% improvement over single-agent in Anthropic's own eval). Builds on existing Crux infrastructure.

Cons: Higher implementation effort than Alternative A, less battle-tested than STORM for research quality.

Recommendation: Alternative D, borrowing specific ideas from STORM (perspective-guided research, simulated conversations) and GraphRAG (community summaries, graph-driven planning).


Part 9: Key Ideas Borrowed

SourceIdeaHow We Use It
STORMPerspective-guided question askingPass R1 discovers perspectives from existing wiki pages
STORMSimulated expert conversationsWriter agent can simulate researcher/expert dialogue
STORMPre-writing/writing separationPasses R1-S2 are pre-writing; C1+ is writing
GraphRAGCommunity summariesPre-computed cluster summaries for entity groups
GraphRAGSubgraph retrievalGraph Analyst retrieves relevant subgraph, not just entity list
CrewAISpecialist agents with handoff contracts8 agents with typed input/output
CrewAISequential pipeline with clear boundariesPass dependency graph
Self-RefineGenerate-feedback-refine loopV3 reviewer triggers targeted re-passes
Self-RefineSpecific, actionable feedbackReviewer produces per-section scores, not vague "improve"
SemanticCitePer-claim citation verificationV1 classifies each claim's support level
Anthropic ResearchOrchestrator + parallel subagentsOrchestrator plans, specialists execute (potentially in parallel)
Anthropic ResearchExpensive orchestrator, cheap workersOpus for orchestrator/writer/reviewer, Sonnet/Haiku for research/verification

Part 10: Success Metrics

How we'll know the new architecture is working:

MetricCurrentTargetHow Measured
Average page quality score70-7882-90Template grading rubric
EntityLinks per page5-1015-25Metrics extractor
Citations per page35-4240-60Footnote count
Diagrams per page0-11-3Metrics extractor
Squiggle models per page00-2 (where applicable)Component count
Inbound links created03-5 per new pageBidirectional link updates
Facts extracted to YAML02-5 per new pageFact extraction pass
Cost per standard page$4-6$12-18API cost tracking

The cost increase is intentional. We're trading $8-12 more per page for substantially higher quality. At 625 pages, even regenerating the entire wiki would cost $7,500-11,000 -- a one-time investment.


Part 11: Composable Module Architecture (February 2026 Refinement)

The original proposal (Parts 3-7) frames the system as specialist agents with composable passes. A further refinement: the passes themselves should be reusable modules — independent tools that compose in the improve pipeline, auto-update, page creation, and standalone CLI commands.

Core Insight: Modules as Agent Tools

Instead of a fixed pipeline where passes run in a predetermined sequence, the orchestrator is an LLM agent that has modules available as tools and decides what to call based on what the page actually needs:

Agent reads page → analyzes gaps → calls tools → checks result → iterates

A page with good prose but no diagrams gets diagram tools. A page with bad sourcing gets research + citation tools. The agent adapts to the page rather than running a fixed sequence.

The Module Kit

Research & Grounding:

ModulePurposeStandalone CLICost
source-fetcherFetch URL → clean markdown + relevant excerptscrux citations verify (upgraded)$0 (no LLM)
research-agentMulti-source search → structured facts with quotescrux research <topic>$1-3
source-cachePersistent store of fetched sources + extracted factsReused across runs$0
claim-verifierCheck if source supports a specific claimPer-citation in auditor$0.01/claim (Haiku)
citation-auditorIndependent verification of all citations on a pagecrux citations audit <id>$0.10-0.30

Content Writing:

ModulePurposeStandalone CLICost
rewrite-sectionRewrite ONE section, constrained to source cacheUsed by orchestrator$0.10-0.30/section

Enrichment:

ModulePurposeStandalone CLICost
add-entity-linksInsert EntityLink components for mentioned entitiescrux enrich entity-links <id>$0.05 (Haiku)
add-fact-refsWrap hardcoded numbers in <KBF> tags referencing KB factscrux enrich fact-refs <id>$0.05 (Haiku)
add-diagramGenerate Mermaid diagram for a sectioncrux enrich diagram <id>$0.10-0.20
add-squiggleAdd uncertainty modeling for a sectioncrux enrich squiggle <id>$0.10-0.20

Why Section-Level Matters

The current pipeline rewrites 2,000-4,000 words in one LLM call. The prompt must simultaneously handle prose, citations, EntityLinks, Facts, Calc, escaping, and structure. Enrichments compete for attention.

With section-level rewrite-section:

  • Focused context: LLM handles 200-500 words at a time
  • Better grounding: only relevant sources for that section
  • Isolated enrichments: adding a diagram can't break citations
  • Partial progress: if budget runs out, 3 improved sections beats nothing

Source-Fetcher as Foundation

The single most important primitive. Currently, the pipeline never reads cited URLs — it gets search snippets and trusts the LLM to cite accurately. The source-fetcher creates ground truth by actually fetching and caching source content.

This unlocks:

  • Citation auditor: can verify claims against actual source text
  • Grounded writer: can be constrained to only cite from fetched+cached sources
  • Claim map: mechanical link from "claim in output" → "quote in fetched source"
  • Existing crux citations verify: upgrades from "is URL alive?" to "does URL support the claim?"

Claim Map: Mechanical Grounding Contract

The rewrite-section module outputs an explicit claim map alongside the content:

{
  "content": "...improved section MDX...",
  "claimMap": [
    { "claim": "Anthropic raised \$4B in 2024", "factId": "f-012", "sourceUrl": "https://..." },
    { "claim": "Founded in 2021 by Dario Amodei", "factId": "f-003", "sourceUrl": "https://..." }
  ],
  "ungroundedClaims": ["Anthropic is widely considered a leader in AI safety"]
}

This is mechanically verifiable: check that every factId exists in the source cache and the claim matches the extracted quote. Ungrounded claims are flagged for human review or removal.

Budget as Tool-Call Limits

Instead of tiers selecting fixed phase sequences, the agent gets a budget and decides how to spend it:

BudgetMax Tool CallsResearch QueriesEnabled Tools
Polish80rewrite-section, add-entity-links, add-fact-refs, validate
Standard205All tools
Deep5015All tools

The agent plans its approach based on page state (quality score, citation count, entity link density, diagram count) and budget constraints.

Cross-Context Reuse

The same modules compose in every context:

ContextModules Used
Improve pipelineAll modules via agent orchestrator
Auto-updateresearch-agent + rewrite-section + citation-auditor
Page creationresearch-agent + rewrite-section (all sections) + all enrichment
Citation health checksource-fetcher + claim-verifier (no agent needed)
Batch entity-linkingadd-entity-links across many pages (no agent)
Manual editing assistresearch-agent → cache → human edits → citation-auditor

Implementation: Foundation Issues

The first four modules to build (GitHub issues):

  1. Source Fetcher (#633) — fetch URLs, extract content, cache results. Foundation for everything else.
  2. Section-Level Grounded Writer (#634) — rewrite one section constrained to source cache, output claim map.
  3. Citation Auditor (#635) — independent per-citation verification using fetched source content.
  4. Standalone Enrichment Tools (#636) — entity-links and fact-refs as independent, idempotent tools.

Dependency order: #633 is the foundation. #634 and #635 depend on it. #636 is independent. The orchestrator agent (Part 3's Phase 3) is built last, after the tools exist.


Part 12: Academic Foundations for Claim-First Content

Recent academic work provides strong validation for the proposition-level and claim-first approaches described in this document and the companion Claim-First Architecture proposal (now superseded by the KB system).

Proposition-Level Retrieval (Dense X Retrieval)

Chen et al. (EMNLP 2024) define a proposition as an atomic, self-contained factual expression with three properties: minimal (cannot be further split), self-contained (includes resolved coreferences), and compositional (the union of all propositions reconstructs full semantics). They built a "Propositionizer" model that decomposes text into propositions and showed that indexing Wikipedia at proposition level significantly outperformed both passage-level and sentence-level indexing for retrieval and QA tasks.1

This is direct evidence for the claim-first thesis: atomic propositions are better retrieval units than paragraphs or documents.

Decompose-Then-Verify (FActScore)

FActScore (Min et al., EMNLP 2023) formalized the decompose-then-verify pipeline: decompose generated text into atomic facts, retrieve evidence, verify each fact, compute the percentage supported. Applied to ChatGPT biographies, only 58% of atomic facts were supported by sources.2 Google's SAFE system extends this with multi-step search verification at 1/20th the cost of human annotators.

Our pipeline applies the same decomposition proactively — producing atomic claims before writing rather than extracting them afterward. The Kalshi experiment (see the Claim-First Architecture proposal, Part 9b) confirmed this catches embellishments that post-hoc verification misses.

Nanopublications: The Formal Precedent

The Semantic Web community's nanopublication framework is the academic formalization of our knowledge bundle idea. Each nanopublication has three parts: Assertion (the claim itself as RDF triples), Provenance (how the assertion came about — methods, evidence), and Publication Info (metadata). Nanopublications are immutable, cryptographically verifiable, and operate on a decentralized server network.3

The micropublication extension adds richer argumentation structure: a statement plus supporting evidence, interpretations, and challenges.4 This maps directly to our analytical claim type with its supportingClaims and reasoning fields.

Knowledge-Centric Templatic Views (SURe)

A January 2024 paper introduces Structure Unified Representation — capturing the most important knowledge from a document in a structured format, then generating multiple view types (slide decks, newsletters, reports) from that single representation with no supervision.5 This is the closest published validation of the "multiple presentation layers from one data layer" pattern that both the multi-agent architecture and claim-first architecture rely on.

GPTKB: Warning on LLM-Generated Knowledge

GPTKB (2024-2025) built a knowledge base entirely from LLM output: 105 million triples for 2.9 million entities. Their critical finding: accuracy is far from existing projects — the LLM generates many incorrect and unverifiable facts.6 This strongly validates our design decision that verification must precede storage. The claim-first architecture's insistence on per-claim verification before claims enter the store directly addresses GPTKB's accuracy problems.

Multi-Agent Verification (KARMA)

KARMA (2025) uses nine collaborative agents for entity discovery, relation extraction, schema alignment, and conflict resolution. Tested on 1,200 PubMed articles: 38,230 new entities with 83.1% verified correctness, reducing conflict edges by 18.6%.7 The modular, multi-agent design with cross-agent verification maps directly to our specialist agent architecture.

Block-Based Knowledge Tools

Production tools like Roam Research, Logseq, and Notion demonstrate that block-level (claim-level) architectures work at scale. Logseq's dual-database design (in-memory Datascript + persistent SQLite) with block-level content, typed properties, and Datalog queries is particularly instructive for the claim store's eventual database-backed implementation (Option C in the claim-first architecture).8

Argumentation Frameworks

For AI safety topics where many claims involve genuine disagreement (timelines, risk levels, alignment difficulty), claim-augmented argumentation frameworks provide formal tools for representing competing positions. These extend Dung's abstract argumentation by associating a claim to each argument, enabling re-interpretation at different evaluation stages.9 This maps to our consensus and analytical claim types and suggests the claim store should eventually support explicit argumentation structure.


Footnotes

  1. Dense X Retrieval: What Retrieval Granularity Should We Use?Dense X Retrieval: What Retrieval Granularity Should We Use? (Chen et al., EMNLP 2024). See also Factoid Wiki.

  2. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text GenerationFActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (Min et al., EMNLP 2023)

  3. Nanopublication-based semantic publishing and reviewingNanopublication-based semantic publishing and reviewing (PeerJ Computer Science, 2023). See also Nanopublication Guidelines.

  4. Micropublications: a semantic model for claims, evidence, arguments and annotationsMicropublications: a semantic model for claims, evidence, arguments and annotations (Journal of Biomedical Semantics, 2014)

  5. Knowledge-Centric Templatic Views of DocumentsKnowledge-Centric Templatic Views of Documents (arXiv, January 2024)

  6. GPTKB: Building Very Large Knowledge Bases from Language ModelsGPTKB: Building Very Large Knowledge Bases from Language Models (arXiv, November 2024). See also gptkb.org.

  7. Citation rc-c558 (data unavailable — rebuild with wiki-server access)

  8. Logseq Architecture — Block-Level Database DesignLogseq Architecture — Block-Level Database Design (DeepWiki)

  9. Claim-augmented argumentation frameworksClaim-augmented argumentation frameworks (Artificial Intelligence, 2023)