Wiki Generation Architecture: Multi-Agent Multi-Pass Design

Status: Vision Document

This is an architecture proposal, not a description of the current system. Some elements have been partially implemented (section-level rewriting, KB fact system, basic citation verification), but the full multi-agent orchestrator has not been built. The KB system (packages/kb/) now provides the structured data layer that this architecture assumes as a prerequisite — structured facts in YAML, <FBF> for inline values, and <Calc> for computed values.

Executive Summary

See also: The Claim-First Wiki Architecture proposal (removed) was a companion proposal that inverted the data model, making verified atomic claims the primary artifact. That proposal's structured-data ideas have been partially superseded by the KB system (packages/kb/), which provides entity-level structured facts in YAML, the <FBF> component for inline fact values, and <Calc> for computed values.

Our current page generation pipeline (Crux content create/improve) is a single-pipeline, single-agent system. It works, but produces pages that are adequate rather than excellent. The best wiki pages require depth that a single LLM pass cannot achieve: dense cross-linking, verified citations, complex diagrams, embedded calculations, and knowledge graph coherence.

This document proposes a multi-agent, multi-pass architecture inspired by Stanford's STORM, Microsoft's GraphRAG, CrewAI's specialist agent patterns, and the Self-Refine iterative paradigm. The core idea: decompose page generation into composable passes, each executed by a specialist agent optimized for one concern.

Current System	Proposed System
Single synthesis prompt	12+ composable passes
One LLM does everything	8 specialist agents
Research then write (2 phases)	Research, structure, write, link, verify, compute, diagram, review (8+ phases)
Knowledge graph consulted at link time	Knowledge graph drives content planning
Static calculations	Dynamic Squiggle models derived from wiki data
Post-hoc validation	Validation integrated into each pass
$4-15 per page	$8-25 per page (higher quality ceiling)

Part 1: Problems with the Current System

What We Have

The current pipeline (crux/authoring/page-creator.ts) follows this flow:

canonical-links -> research -> source-fetching -> synthesis -> verification -> validation -> grade

This produces pages scoring 70-80/100 on our grading rubric. The pipeline has been iterated significantly (see the Page Creator Pipeline report) and represents solid work. But it has structural limitations:

Limitation 1: Single-Agent Synthesis Bottleneck

One Claude call synthesizes the entire article from research. This means the model must simultaneously:

Write coherent prose
Place citations correctly
Decide which EntityLinks to use
Structure sections per template
Include appropriate tables and diagrams
Maintain balanced perspective

No single prompt can optimize all of these. The result: pages that are structurally correct but lack depth in cross-linking, calculations, and visual elements.

Limitation 2: Knowledge Graph is Read-Only

The current system consults the entity database to resolve EntityLinks, but doesn't use the knowledge graph to plan content. A page about "deceptive alignment" should proactively cover its graph neighbors (situational awareness, mesa-optimization, sleeper agents) with appropriate depth. Currently, this happens only if the LLM independently decides to mention them.

Limitation 3: No Iterative Deepening

The pipeline runs once. If the synthesis phase produces a page with weak sections, those sections stay weak. The review phase in the improver can identify gaps, but the fix is another monolithic LLM call. There's no mechanism for targeted, section-level improvement.

Update (Feb 2026): The --section-level flag (pnpm crux content improve <id> --section-level) now implements per-section rewriting: the page is split on ## headings, each section rewritten independently via rewriteSection(), then reassembled with renumbered footnotes. See crux/lib/section-splitter.ts and crux/authoring/page-improver/phases/improve-sections.ts. This addresses the "targeted improvement" limitation above; the deeper limitations (graph-aware planning, diagram agents) remain future work.

Limitation 4: Diagrams and Calculations are Afterthoughts

Mermaid diagrams and Squiggle models are included only if the synthesis prompt happens to produce them. There's no dedicated agent reasoning about what visual or computational elements would add value, and no agent that specializes in producing high-quality versions of these.

Limitation 5: Cross-Linking is Shallow

EntityLinks are added during synthesis, then validated. But the system doesn't reason about the topology of links: which inbound links should this page attract? Which pages should link to this one? A new page about "compute governance" should trigger updates to pages about "compute thresholds," "chip export controls," and "training run monitoring."

Part 2: State of the Art

Stanford STORM (2024)

STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking) is the closest academic system to what we need. Key innovations:

Innovation	How It Works	Relevance to Us
Perspective-guided research	Discovers multiple perspectives by surveying similar articles, then simulates conversations from each perspective	We could mine perspectives from our existing 625 pages on related topics
Simulated expert conversations	A "writer" agent asks questions to a "topic expert" agent grounded in search results. Follow-up questions arise naturally	Better than our "dump all research into one synthesis prompt" approach
Two-stage pipeline	Pre-writing (research + outline) is separated from writing. Outline quality correlates with article quality	We already do this loosely; could formalize it
Co-STORM mind map	Organizes collected information into a hierarchical concept structure updated throughout the process	Maps to our entity graph, but dynamically maintained during authoring

Key finding: STORM articles were rated 25% better organized and 10% broader in coverage than baseline RAG approaches by Wikipedia editors.

Limitation: STORM produces Wikipedia-style articles but doesn't handle our specific requirements: EntityLinks, Squiggle models, Mermaid diagrams, YAML entity synchronization, or the frontmatter/grading system.

Microsoft GraphRAG (2024)

GraphRAG extends RAG with knowledge graph structure. Instead of retrieving text chunks, it retrieves subgraphs -- entities, relationships, and community summaries.

Innovation	How It Works	Relevance to Us
Community detection	Clusters related entities and generates hierarchical summaries	We could use this to identify which entities a new page should cover
Global search via map-reduce	Pre-generates community summaries, then runs map-reduce across them for corpus-wide questions	Useful for "what's the relationship between X and all its neighbors?"
Entity extraction pipeline	Extracts entities and relationships from text, builds graph	We already have this (YAML entities + content scanning), but could improve

Key finding: GraphRAG dramatically outperforms naive RAG on multi-hop reasoning and synthesis questions. Exactly the kind of reasoning needed for wiki cross-linking.

CrewAI Specialist Agent Pattern (2025)

CrewAI demonstrates that splitting work across specialist agents with clear handoff contracts produces better results than one mega-agent.

The pattern: Researcher -> Writer -> Editor -> Specialist, with each agent optimized for its role (different system prompts, different tools, potentially different models).

Key insight from CrewAI: "Squeezing too much into one agent causes context windows to blow up, too many tools confuse it, and hallucinations increase." This directly explains our synthesis bottleneck.

Self-Refine (Madaan et al., 2023)

Self-Refine demonstrates that iterative generate -> feedback -> refine loops improve LLM output by ~20% on average. The key: the same model generates, critiques, and refines, but with different prompts for each role.

Key finding: The refine loop works best when feedback is specific and actionable (not "make it better" but "paragraph 3 lacks a citation for the 40% claim"). This maps to our validation rules, which already produce specific, fixable issues.

SemanticCite (2025)

SemanticCite proposes a pipeline for citation verification: extract claims, retrieve source passages via hybrid search, classify support level (SUPPORTED / PARTIALLY SUPPORTED / UNSUPPORTED / UNCERTAIN). Their fine-tuned models achieve competitive performance with commercial systems.

Relevance: We already have a verify-sources phase, but it's coarse-grained. Per-claim verification with confidence scoring would significantly improve citation quality.

Loki / OpenFactVerification (2024-2025)

Loki (Li et al., COLING 2025) is an MIT-licensed five-step fact verification pipeline:

Decomposer: Breaks text into atomic claims -- each "concise (<15 words) and self-contained", no vague references, at least one claim per sentence
Checkworthiness Identifier: Filters out subjective/opinion claims (e.g., "X has a vast campus" rejected because "vast" is subjective)
Query Generator: Converts claims into optimized keyword-focused search queries (not raw claims, which yield biased results)
Evidence Retriever: 3 search queries per claim via Serper API ($0.001/search), top 5 results with snippets, optional NLI ranking
Claim Verifier: LLM evaluates evidence against each claim. Verdicts: supported/refuted/insufficient evidence

Key results: F1 0.85 on true claims with GPT-4o, 8.3s per sample (async parallelization), roughly $0.02-0.05 per claim set.

Relevance to us: We already have the claim extraction step (semantic-diff/claim-extractor.ts) and per-citation verification (citation-audit phase, fb sourcing). But our existing verification only checks claims against their already-cited URLs. The gaps Loki addresses:

Uncited prose claims: Factual assertions with no footnote that neither citation-audit nor fb sourcing touches
Independent corroboration: Searching the web for evidence rather than re-checking the same cited URL (catches stale or misleading sources)
Checkworthiness filtering: Reducing cost/false positives by skipping subjective or trivially true claims before verification

See Discussion #3087 for detailed experiment proposals.

GPT-Researcher (2024-2026)

GPT-Researcher (18k+ stars) uses a planner-executor-publisher architecture for autonomous web research with citations. Ranked #1 among open-source systems in Carnegie Mellon's DeepResearchGym benchmark.

Key pattern: iterative research with gap analysis. After initial research, the system analyzes what questions remain unanswered, generates follow-up queries, retrieves additional evidence, and repeats. The "deep research" mode uses tree-like recursive exploration (~5 min, ≈$0.40 with o3-mini).

Relevance: Our auto-update pipeline does single-pass research (RSS fetch -> digest -> route -> improve). Adding a gap-analysis loop between research and writing would produce deeper research for the same page-update budget. This is a prompt-engineering change, not a new dependency.

Amazon RefChecker (2024)

RefChecker decomposes LLM output into knowledge triplets (subject, predicate, object) and checks each against reference documents. More granular than sentence-level claims -- "Anthropic, founded in 2021 by Dario Amodei, has 1,500 employees" yields three independent triplets.

Relevance: Our claim extractor uses free-text claims with typed keyValue fields, which is less structured but more natural for wiki prose. The triplet approach may catch more errors per sentence but at higher extraction cost. Worth a comparison experiment but likely marginal over our existing approach.

Anthropic Multi-Agent Research System (2025)

Anthropic's own research system uses an orchestrator-worker pattern: a lead agent analyzes a query, develops strategy, and spawns subagents to explore different aspects in parallel. Multi-agent Opus + Sonnet outperformed single-agent Opus by 90.2% on their research eval.

Key insight: Use expensive models (Opus) for orchestration and synthesis, cheap models (Sonnet/Haiku) for parallel research and extraction. This is exactly the cost structure we should adopt.

Part 3: Proposed Architecture

Core Principle: Composable Passes

Instead of a monolithic pipeline, we define passes that can be composed in different orders depending on the page type, tier, and goals. Each pass:

Takes a well-defined input (page draft + metadata)
Produces a well-defined output (modified draft + metadata)
Is idempotent (running it twice produces the same result)
Has a cost estimate
Can be run independently for debugging

Diagram (loading…)

flowchart TD
  subgraph Orchestrator["Orchestrator Agent"]
      PLAN[Plan Generation Strategy]
      SCHEDULE[Schedule Passes]
      QUALITY[Quality Gate Check]
  end

  subgraph Research["Research Passes"]
      RP[Perspective Discovery]
      RW[Web Research]
      RS[SCRY/Academic Search]
      RG[Graph Neighbor Analysis]
  end

  subgraph Structure["Structure Passes"]
      SO[Outline Generation]
      ST[Template Compliance]
      SK[Knowledge Graph Planning]
  end

  subgraph Content["Content Passes"]
      CS[Section-by-Section Synthesis]
      CC[Citation Placement]
      CE[EntityLink Enrichment]
  end

  subgraph Enrichment["Enrichment Passes"]
      ED[Diagram Generation]
      EC[Squiggle Model Creation]
      ET[Table Structuring]
      EF[Fact Extraction to YAML]
  end

  subgraph Verification["Verification Passes"]
      VC[Citation Verification]
      VL[EntityLink Resolution]
      VV[Validation Rules]
      VR[Self-Review]
  end

  PLAN --> Research
  Research --> Structure
  Structure --> Content
  Content --> Enrichment
  Enrichment --> Verification
  Verification --> QUALITY
  QUALITY -->|"Below threshold"| Content
  QUALITY -->|"Pass"| DONE[Final Page]

The 8 Specialist Agents

Each agent has a focused role, specific tools, and an optimal model choice:

#	Agent	Role	Model	Tools	Cost/Page
1	Orchestrator	Plans strategy, schedules passes, checks quality gates	Opus	All agents, quality scorer	$1-2
2	Researcher	Web search, academic search, source fetching	Sonnet	Perplexity, SCRY, Firecrawl	$1-3
3	Graph Analyst	Analyzes knowledge graph neighbors, plans cross-links	Sonnet	Entity DB, backlinks, graph data	$0.50-1
4	Structurer	Generates outlines, ensures template compliance	Sonnet	Page templates, existing page analysis	$0.50-1
5	Writer	Section-by-section prose synthesis from research	Opus	Research output, entity lookup	$2-4
6	Enricher	Creates diagrams, Squiggle models, tables	Sonnet	Mermaid validator, Squiggle runtime	$1-2
7	Verifier	Citation checking, EntityLink resolution, fact validation	Haiku	Source DB, validation engine	$0.25-0.50
8	Reviewer	Identifies gaps, bias, weak sections; triggers re-passes	Opus	Quality rubric, template checker	$1-2

Total estimated cost: $8-16 for standard tier (vs $4-6 currently). The quality ceiling is substantially higher.

The 12+ Composable Passes

Research Passes

Pass R1: Perspective Discovery

Input: Topic title + entity type
Process: Survey our existing pages on related topics. What perspectives do they cover? What's missing? (Inspired by STORM's perspective mining)
Output: List of 5-10 perspectives to investigate (e.g., for "compute governance": technical feasibility, political economy, international coordination, industry self-regulation, civil liberties)
Agent: Graph Analyst
Cost: $0.25

Pass R2: Multi-Source Research

Input: Topic + perspectives list
Process: For each perspective, run targeted Perplexity queries. Fetch and register sources.
Output: research.json with categorized findings per perspective
Agent: Researcher
Cost: $1-3

Pass R3: Graph Neighbor Analysis

Input: Topic + entity database
Process: Identify all entities within 2 hops in the knowledge graph. Analyze which are most relevant and what relationship labels apply. Determine which existing pages should link to this new page.
Output: graph-context.json with neighbor entities, relationship types, and suggested inbound link updates
Agent: Graph Analyst
Cost: $0.50

Pass R4: Existing Content Analysis

Input: Topic + similar pages (from redundancy detection)
Process: Read the top 5 most similar existing pages. Identify what this page should cover that they don't, and what it can reference rather than repeat.
Output: content-gap.json with unique angles and cross-references
Agent: Graph Analyst
Cost: $0.50

Structure Passes

Pass S1: Outline Generation

Input: Research output + template + graph context
Process: Generate a detailed section-by-section outline with word count targets and required elements per section (tables, citations, diagrams)
Output: outline.json with sections, subsections, planned elements
Agent: Structurer
Cost: $0.50

Pass S2: Knowledge Graph Planning

Input: Outline + entity database
Process: For each section, identify which EntityLinks should appear. Plan where Squiggle models and diagrams will go. Identify facts to extract to YAML.
Output: Enriched outline with EntityLink targets, diagram specs, computation specs
Agent: Graph Analyst
Cost: $0.50

Content Passes

Pass C1: Section-by-Section Synthesis

Input: Outline + research + graph context (one section at a time)
Process: Write each section independently, using only the research relevant to that section. Enforce citation discipline per section.
Output: Draft page with all sections assembled
Agent: Writer
Cost: $2-4

This is the biggest departure from the current system. Instead of one synthesis call, we write section by section. Each section gets a focused context window with only the relevant research, entity lookups, and template requirements. This prevents context window overload and ensures each section gets full attention.

Pass C2: Citation Placement

Input: Draft page + source database
Process: Verify every factual claim has a citation. Add missing citations from the source database. Convert inline URLs to <R> components where sources exist.
Output: Fully cited draft
Agent: Verifier
Cost: $0.25

Pass C3: EntityLink Enrichment

Input: Draft + entity database
Process: Scan for entity mentions that lack EntityLinks. Add <EntityLink> components for all resolvable entities. Ensure link density meets template requirements.
Output: Cross-linked draft
Agent: Graph Analyst
Cost: $0.25

Enrichment Passes

Pass E1: Diagram Generation

Input: Draft + outline diagram specs
Process: For each planned diagram location, generate a Mermaid diagram that visualizes the concept. Validate syntax. Follow Mermaid style guide (max 15-20 nodes, flowchart TD, proper colors).
Output: Draft with embedded diagrams
Agent: Enricher
Cost: $0.50-1

Pass E2: Computation Embedding

Input: Draft + facts database + graph data
Process: Identify quantitative claims that could be dynamic. Use <FBF> to pull KB fact values from packages/kb/data/things/ and <Calc> to derive aggregates where relevant.
Output: Draft with dynamic computations
Agent: Enricher
Cost: $0.50-1

Pass E3: Table Structuring

Input: Draft
Process: Identify data that's better presented as tables. Ensure tables have proper headers, sourced data, and comparative structure. Enforce the "max 4 tables, tables are for genuinely comparative data" rule.
Output: Draft with optimized tables
Agent: Enricher
Cost: $0.25

Pass E4: Fact Extraction

Input: Draft + existing KB facts
Process: Extract key quantitative claims from the page and propose additions to KB facts in packages/kb/data/things/. Link computed facts to their source pages via <FBF> references.
Output: Proposed KB fact entries + draft with <FBF> references
Agent: Enricher
Cost: $0.25
Status: The KB system now partially serves this role. Structured facts for 360+ entities exist in packages/kb/data/things/*.yaml, with properties defined in packages/kb/data/properties.yaml. Pages can reference these via <FBF> components and [^1] footnotes. The automated extraction pipeline (proposing new facts from page content) has not been built.

Verification Passes

Pass V1: Citation Verification

Input: Draft + source database
Process: For each citation, verify the claim is actually supported by the source. Classify as SUPPORTED / PARTIALLY_SUPPORTED / UNSUPPORTED. Flag unsupported claims.
Output: Verification report + flagged claims
Agent: Verifier (inspired by SemanticCite)
Cost: $0.25-0.50

Pass V2: Validation Rules

Input: Draft
Process: Run the full validation engine (dollar signs, comparison operators, frontmatter schema, EntityLink IDs, etc.). Auto-fix where possible.
Output: Clean draft passing all blocking rules
Agent: Verifier
Cost: $0.10

Pass V3: Self-Review

Input: Draft + template + quality rubric
Process: Grade the page against the template rubric. Identify weak sections with specific, actionable feedback. Determine if another pass through content/enrichment is needed.
Output: Review report with per-section scores and improvement suggestions
Agent: Reviewer (inspired by Self-Refine)
Cost: $1-2

The Reviewer (V3) can trigger re-execution of specific passes:

Diagram (loading…)

flowchart LR
  V3[Self-Review] -->|"Section 3 lacks citations"| C2[Citation Pass]
  V3 -->|"No diagram for key concept"| E1[Diagram Pass]
  V3 -->|"Missing perspective on X"| R2[Research Pass]
  V3 -->|"Weak prose in section 5"| C1[Rewrite Section 5]
  V3 -->|"Score >= 80"| DONE[Accept]

The orchestrator limits iterations (default: 2 refinement cycles) to control cost. Each cycle targets only the specific passes needed.

Part 4: Tier Configurations

Budget Tier ($5-8)

For drafts and low-importance pages:

R2(lite) -> S1 -> C1 -> V2 -> V3(single)

5 passes, no enrichment, no iterative refinement. Produces a well-structured, cited article without diagrams or calculations.

Standard Tier ($12-18)

For most pages:

R1 -> R2 -> R3 -> S1 -> S2 -> C1 -> C2 -> C3 -> E1 -> E3 -> V1 -> V2 -> V3 -> [1 refinement cycle]

13+ passes including diagrams, cross-linking, and one refinement cycle. The expected quality ceiling.

Premium Tier ($20-30)

For high-importance or controversial pages:

R1 -> R2(deep) -> R3 -> R4 -> S1 -> S2 -> C1 -> C2 -> C3 -> E1 -> E2 -> E3 -> E4 -> V1 -> V2 -> V3 -> [2 refinement cycles]

All passes including Squiggle models, fact extraction, content gap analysis, and two refinement cycles.

Polish Tier ($3-5)

For improving existing pages (replaces current crux content improve):

R3 -> C3 -> E1(if missing) -> V1 -> V2 -> V3

Focuses on cross-linking, enrichment gaps, and citation verification. Doesn't rewrite prose.

Part 5: Knowledge Graph Integration

Graph-Driven Content Planning

The biggest architectural shift: the knowledge graph drives content creation rather than being consulted as an afterthought.

Diagram (loading…)

flowchart TD
  NEW[New Page: Compute Governance]

  subgraph GraphAnalysis["Graph Neighbor Analysis"]
      N1["compute-thresholds
(1 hop, weight: 8)"]
      N2["chip-export-controls
(1 hop, weight: 7)"]
      N3["training-run-monitoring
(1 hop, weight: 6)"]
      N4["international-coordination
(2 hops, weight: 4)"]
      N5["hardware-overhang
(2 hops, weight: 3)"]
  end

  subgraph ContentPlan["Content Plan"]
      SEC1["Section: Compute Thresholds
- EntityLink to N1
- 200 words"]
      SEC2["Section: Export Controls
- EntityLink to N2
- 300 words"]
      SEC3["Section: Monitoring
- EntityLink to N3
- 150 words"]
      SEC4["Mention in overview
- EntityLinks to N4, N5"]
  end

  NEW --> GraphAnalysis
  GraphAnalysis --> ContentPlan

  N1 --> SEC1
  N2 --> SEC2
  N3 --> SEC3
  N4 --> SEC4
  N5 --> SEC4

Bidirectional Link Updates

When a new page is created, the system should also propose updates to existing pages that should link to it. The Graph Analyst agent:

Identifies entities in the graph that relate to the new page
Reads existing pages for those entities
Identifies natural insertion points for EntityLinks to the new page
Produces a link-updates.json with proposed edits

This transforms page creation from an isolated act into a graph maintenance operation.

Community Summaries (GraphRAG-inspired)

For entity clusters (e.g., all "alignment approaches"), maintain pre-computed community summaries that:

Describe the cluster's theme
List key entities and their relationships
Identify gaps (entities that exist but lack pages)

These summaries can be used by the Writer agent as context when synthesizing content that touches on a cluster.

Part 6: Dynamic Computation Embedding

The Vision

Partial implementation note: The <Calc> and <FBF> (KB Fact) components now partially implement the "computation embedding" vision described here. <FBF> pulls live values from KB YAML files in packages/kb/data/things/, and <Calc> computes derived values from those facts.

Wiki pages shouldn't just state numbers -- they should compute them from the knowledge base.

Example: A page about "AI lab safety spending" could reference <FBF> values for each lab's revenue and a <Calc> block that multiplies those by a per-lab safety fraction to produce an aggregate estimate, so the number stays in sync as underlying facts are updated.

How the Enricher Agent Creates These

Identify computational opportunities: Scan the draft for quantitative claims that involve estimation or aggregation
Pull from KB facts: Use existing KB fact values from packages/kb/data/things/ as inputs where available
Compose <Calc> expressions: Derive point estimates (and small ranges where available) from the referenced facts
Validate: Ensure the expression evaluates against the current fact values
Embed: Place <FBF> / <Calc> components at appropriate locations

Fact Feedback Loop

The Enricher can also propose new facts to the KB (packages/kb/data/things/) based on claims in the page, creating a feedback loop where page content enriches the data layer, which in turn feeds future computations.

Part 7: Implementation Plan

Phase 1: Pass Infrastructure (Foundation)

Build the composable pass system on top of existing Crux infrastructure:

interface Pass {
  id: string;
  name: string;
  agent: AgentType;

  // Input/output contract
  requires: string[];  // IDs of passes that must run first
  produces: string[];  // Artifact keys this pass creates

  // Execution
  execute(context: PassContext): Promise<PassResult>;

  // Cost estimation
  estimateCost(context: PassContext): number;
}

interface PassContext {
  topic: string;
  entityType: string;
  tier: TierConfig;

  // Accumulated artifacts from prior passes
  artifacts: Map<string, any>;

  // Shared resources
  entityDb: EntityDatabase;
  sourceDb: SourceDatabase;
  validationEngine: ValidationEngine;
}

This leverages the existing Crux validation engine, entity lookup, and source database. Each pass is a module in crux/authoring/passes/.

Phase 2: Specialist Agents

Implement agents as wrappers around Claude API calls with focused system prompts:

interface Agent {
  id: AgentType;
  model: 'opus' | 'sonnet' | 'haiku';
  systemPrompt: string;
  tools: Tool[];

  run(input: AgentInput): Promise<AgentOutput>;
}

Start with the Writer and Verifier agents (biggest impact), then add Graph Analyst and Enricher.

Phase 3: Orchestrator

Build the orchestrator that plans pass sequences and manages quality gates:

class Orchestrator {
  planPasses(topic: string, tier: Tier, entityType: string): Pass[];
  executePlan(passes: Pass[], context: PassContext): Promise<Page>;
  checkQualityGate(page: Page, template: Template): QualityResult;
  planRefinement(review: ReviewResult): Pass[];
}

Phase 4: Graph Integration

Add bidirectional link updates and community summaries. This requires changes to build-data.mjs to compute community clusters and maintain summary cache.

Phase 5: Computation Embedding

Add the Squiggle model generation pass. Requires integration with the Squiggle runtime for validation.

Migration Path

The new system can coexist with the current pipeline:

# Current system (preserved)
pnpm crux content create "Topic" --tier=standard

# New system (opt-in)
pnpm crux content create "Topic" --tier=standard --engine=v2

# Eventually
pnpm crux content create "Topic" --tier=standard  # defaults to v2

Part 8: Comparison with Alternatives

Alternative A: Keep Improving Current Pipeline

Pros: Lower risk, incremental improvement, already works.

Cons: Structural ceiling -- single-agent synthesis can't produce the cross-linking, computation, and diagram density we want. Diminishing returns on prompt engineering.

Alternative B: Adopt STORM Directly

Pros: Battle-tested, open source, good research quality.

Cons: Python-only (we're TypeScript), no support for EntityLinks/Squiggle/Mermaid/YAML entities, no knowledge graph integration, would require heavy forking. The research stage is valuable but the writing stage doesn't match our needs.

Alternative C: Full CrewAI/LangGraph Framework

Pros: Rich agent orchestration, built-in patterns for sequential/parallel execution.

Cons: Heavy framework dependency, Python ecosystem, abstractions don't map cleanly to our YAML-first data model. We'd spend more time fighting the framework than building features.

Alternative D: Claude Agent SDK Multi-Agent (Proposed Approach)

Pros: Same TypeScript ecosystem, direct Anthropic API integration, subagent orchestration built-in, proven at scale (90.2% improvement over single-agent in Anthropic's own eval). Builds on existing Crux infrastructure.

Cons: Higher implementation effort than Alternative A, less battle-tested than STORM for research quality.

Recommendation: Alternative D, borrowing specific ideas from STORM (perspective-guided research, simulated conversations) and GraphRAG (community summaries, graph-driven planning).

Part 9: Key Ideas Borrowed

Source	Idea	How We Use It
STORM	Perspective-guided question asking	Pass R1 discovers perspectives from existing wiki pages
STORM	Simulated expert conversations	Writer agent can simulate researcher/expert dialogue
STORM	Pre-writing/writing separation	Passes R1-S2 are pre-writing; C1+ is writing
GraphRAG	Community summaries	Pre-computed cluster summaries for entity groups
GraphRAG	Subgraph retrieval	Graph Analyst retrieves relevant subgraph, not just entity list
CrewAI	Specialist agents with handoff contracts	8 agents with typed input/output
CrewAI	Sequential pipeline with clear boundaries	Pass dependency graph
Self-Refine	Generate-feedback-refine loop	V3 reviewer triggers targeted re-passes
Self-Refine	Specific, actionable feedback	Reviewer produces per-section scores, not vague "improve"
SemanticCite	Per-claim citation verification	V1 classifies each claim's support level
Loki	Checkworthiness filtering before verification	Skip subjective/vague claims to reduce cost and false positives
Loki	Web evidence search for uncited claims	Find sources for claims that have no footnote
GPT-Researcher	Iterative research with gap analysis	Research -> identify gaps -> targeted follow-up queries -> repeat
Anthropic Research	Orchestrator + parallel subagents	Orchestrator plans, specialists execute (potentially in parallel)
Anthropic Research	Expensive orchestrator, cheap workers	Opus for orchestrator/writer/reviewer, Sonnet/Haiku for research/verification

Part 10: Success Metrics

How we'll know the new architecture is working:

Metric	Current	Target	How Measured
Average page quality score	70-78	82-90	Template grading rubric
EntityLinks per page	5-10	15-25	Metrics extractor
Citations per page	35-42	40-60	Footnote count
Diagrams per page	0-1	1-3	Metrics extractor
Squiggle models per page	0	0-2 (where applicable)	Component count
Inbound links created	0	3-5 per new page	Bidirectional link updates
Facts extracted to YAML	0	2-5 per new page	Fact extraction pass
Cost per standard page	$4-6	$12-18	API cost tracking

The cost increase is intentional. We're trading $8-12 more per page for substantially higher quality. At 625 pages, even regenerating the entire wiki would cost $7,500-11,000 -- a one-time investment.

The original proposal (Parts 3-7) frames the system as specialist agents with composable passes. A further refinement: the passes themselves should be reusable modules — independent tools that compose in the improve pipeline, auto-update, page creation, and standalone CLI commands.

Core Insight: Modules as Agent Tools

Instead of a fixed pipeline where passes run in a predetermined sequence, the orchestrator is an LLM agent that has modules available as tools and decides what to call based on what the page actually needs:

Agent reads page → analyzes gaps → calls tools → checks result → iterates

A page with good prose but no diagrams gets diagram tools. A page with bad sourcing gets research + citation tools. The agent adapts to the page rather than running a fixed sequence.

The Module Kit

Research & Grounding:

Module	Purpose	Standalone CLI	Cost
`source-fetcher`	Fetch URL → clean markdown + relevant excerpts	`crux citations verify` (upgraded)	$0 (no LLM)
`research-agent`	Multi-source search → structured facts with quotes	`crux research <topic>`	$1-3
`source-cache`	Persistent store of fetched sources + extracted facts	Reused across runs	$0
`claim-verifier`	Check if source supports a specific claim (Loki-inspired: decompose, filter check-worthy, search web, verify)	Per-citation in auditor	$0.01/claim (Haiku)
`citation-auditor`	Independent verification of all citations on a page	`crux citations audit <id>`	$0.10-0.30
`uncited-claim-finder`	Identify prose claims with no footnote, search web for evidence (Loki gap)	`crux verify uncited <id>`	$0.30-0.60/page

Content Writing:

Module	Purpose	Standalone CLI	Cost
`rewrite-section`	Rewrite ONE section, constrained to source cache	Used by orchestrator	$0.10-0.30/section

Enrichment:

Module	Purpose	Standalone CLI	Cost
`add-entity-links`	Insert EntityLink components for mentioned entities	`crux enrich entity-links <id>`	$0.05 (Haiku)
`add-fact-refs`	Wrap hardcoded numbers in `<FBF>` tags referencing KB facts	`crux enrich fact-refs <id>`	$0.05 (Haiku)
`add-diagram`	Generate Mermaid diagram for a section	`crux enrich diagram <id>`	$0.10-0.20
`add-squiggle`	Add uncertainty modeling for a section	`crux enrich squiggle <id>`	$0.10-0.20

Why Section-Level Matters

The current pipeline rewrites 2,000-4,000 words in one LLM call. The prompt must simultaneously handle prose, citations, EntityLinks, Facts, Calc, escaping, and structure. Enrichments compete for attention.

With section-level rewrite-section:

Focused context: LLM handles 200-500 words at a time
Better grounding: only relevant sources for that section
Isolated enrichments: adding a diagram can't break citations
Partial progress: if budget runs out, 3 improved sections beats nothing

Source-Fetcher as Foundation

The single most important primitive. Currently, the pipeline never reads cited URLs — it gets search snippets and trusts the LLM to cite accurately. The source-fetcher creates ground truth by actually fetching and caching source content.

This unlocks:

Citation auditor: can verify claims against actual source text
Grounded writer: can be constrained to only cite from fetched+cached sources
Claim map: mechanical link from "claim in output" → "quote in fetched source"
Existing crux citations verify: upgrades from "is URL alive?" to "does URL support the claim?"

Claim Map: Mechanical Grounding Contract

The rewrite-section module outputs an explicit claim map alongside the content:

{
  "content": "...improved section MDX...",
  "claimMap": [
    { "claim": "Anthropic raised \$4B in 2024", "factId": "f-012", "sourceUrl": "https://..." },
    { "claim": "Founded in 2021 by Dario Amodei", "factId": "f-003", "sourceUrl": "https://..." }
  ],
  "ungroundedClaims": ["Anthropic is widely considered a leader in AI safety"]
}

This is mechanically verifiable: check that every factId exists in the source cache and the claim matches the extracted quote. Ungrounded claims are flagged for human review or removal.

Budget as Tool-Call Limits

Instead of tiers selecting fixed phase sequences, the agent gets a budget and decides how to spend it:

Budget	Max Tool Calls	Research Queries	Enabled Tools
Polish	8	0	rewrite-section, add-entity-links, add-fact-refs, validate
Standard	20	5	All tools
Deep	50	15	All tools

The agent plans its approach based on page state (quality score, citation count, entity link density, diagram count) and budget constraints.

Cross-Context Reuse

The same modules compose in every context:

Context	Modules Used
Improve pipeline	All modules via agent orchestrator
Auto-update	research-agent + rewrite-section + citation-auditor
Page creation	research-agent + rewrite-section (all sections) + all enrichment
Citation health check	source-fetcher + claim-verifier (no agent needed)
Batch entity-linking	add-entity-links across many pages (no agent)
Manual editing assist	research-agent → cache → human edits → citation-auditor

Implementation: Foundation Issues

The first four modules to build (GitHub issues):

Source Fetcher (#633) — fetch URLs, extract content, cache results. Foundation for everything else.
Section-Level Grounded Writer (#634) — rewrite one section constrained to source cache, output claim map.
Citation Auditor (#635) — independent per-citation verification using fetched source content.
Standalone Enrichment Tools (#636) — entity-links and fact-refs as independent, idempotent tools.

Dependency order: #633 is the foundation. #634 and #635 depend on it. #636 is independent. The orchestrator agent (Part 3's Phase 3) is built last, after the tools exist.

Part 12: Academic Foundations for Claim-First Content

Recent academic work provides strong validation for the proposition-level and claim-first approaches described in this document and the companion Claim-First Architecture proposal (now superseded by the KB system).

Proposition-Level Retrieval (Dense X Retrieval)

Chen et al. (EMNLP 2024) define a proposition as an atomic, self-contained factual expression with three properties: minimal (cannot be further split), self-contained (includes resolved coreferences), and compositional (the union of all propositions reconstructs full semantics). They built a "Propositionizer" model that decomposes text into propositions and showed that indexing Wikipedia at proposition level significantly outperformed both passage-level and sentence-level indexing for retrieval and QA tasks.¹

This is direct evidence for the claim-first thesis: atomic propositions are better retrieval units than paragraphs or documents.

Decompose-Then-Verify (FActScore)

FActScore (Min et al., EMNLP 2023) formalized the decompose-then-verify pipeline: decompose generated text into atomic facts, retrieve evidence, verify each fact, compute the percentage supported. Applied to ChatGPT biographies, only 58% of atomic facts were supported by sources.² Google's SAFE system extends this with multi-step search verification at 1/20th the cost of human annotators.

Our pipeline applies the same decomposition proactively — producing atomic claims before writing rather than extracting them afterward. The Kalshi experiment (see the Claim-First Architecture proposal, Part 9b) confirmed this catches embellishments that post-hoc verification misses.

Nanopublications: The Formal Precedent

The Semantic Web community's nanopublication framework is the academic formalization of our knowledge bundle idea. Each nanopublication has three parts: Assertion (the claim itself as RDF triples), Provenance (how the assertion came about — methods, evidence), and Publication Info (metadata). Nanopublications are immutable, cryptographically verifiable, and operate on a decentralized server network.³

The micropublication extension adds richer argumentation structure: a statement plus supporting evidence, interpretations, and challenges.⁴ This maps directly to our analytical claim type with its supportingClaims and reasoning fields.

Knowledge-Centric Templatic Views (SURe)

A January 2024 paper introduces Structure Unified Representation — capturing the most important knowledge from a document in a structured format, then generating multiple view types (slide decks, newsletters, reports) from that single representation with no supervision.⁵ This is the closest published validation of the "multiple presentation layers from one data layer" pattern that both the multi-agent architecture and claim-first architecture rely on.

GPTKB: Warning on LLM-Generated Knowledge

GPTKB (2024-2025) built a knowledge base entirely from LLM output: 105 million triples for 2.9 million entities. Their critical finding: accuracy is far from existing projects — the LLM generates many incorrect and unverifiable facts.⁶ This strongly validates our design decision that verification must precede storage. The claim-first architecture's insistence on per-claim verification before claims enter the store directly addresses GPTKB's accuracy problems.

Multi-Agent Verification (KARMA)

KARMA (2025) uses nine collaborative agents for entity discovery, relation extraction, schema alignment, and conflict resolution. Tested on 1,200 PubMed articles: 38,230 new entities with 83.1% verified correctness, reducing conflict edges by 18.6%.⁷ The modular, multi-agent design with cross-agent verification maps directly to our specialist agent architecture.

Block-Based Knowledge Tools

Production tools like Roam Research, Logseq, and Notion demonstrate that block-level (claim-level) architectures work at scale. Logseq's dual-database design (in-memory Datascript + persistent SQLite) with block-level content, typed properties, and Datalog queries is particularly instructive for the claim store's eventual database-backed implementation (Option C in the claim-first architecture).⁸

Argumentation Frameworks

For AI safety topics where many claims involve genuine disagreement (timelines, risk levels, alignment difficulty), claim-augmented argumentation frameworks provide formal tools for representing competing positions. These extend Dung's abstract argumentation by associating a claim to each argument, enabling re-interpretation at different evaluation stages.⁹ This maps to our consensus and analytical claim types and suggests the claim store should eventually support explicit argumentation structure.

Dense X Retrieval: What Retrieval Granularity Should We Use? — Dense X Retrieval: What Retrieval Granularity Should We Use? (Chen et al., EMNLP 2024). See also Factoid Wiki. ↩
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation — FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (Min et al., EMNLP 2023) ↩
Nanopublication-based semantic publishing and reviewing — Nanopublication-based semantic publishing and reviewing (PeerJ Computer Science, 2023). See also Nanopublication Guidelines. ↩
Micropublications: a semantic model for claims, evidence, arguments and annotations — Micropublications: a semantic model for claims, evidence, arguments and annotations (Journal of Biomedical Semantics, 2014) ↩
Knowledge-Centric Templatic Views of Documents — Knowledge-Centric Templatic Views of Documents (arXiv, January 2024) ↩
GPTKB: Building Very Large Knowledge Bases from Language Models — GPTKB: Building Very Large Knowledge Bases from Language Models (arXiv, November 2024). See also gptkb.org. ↩
Citation rc-c558 ↩
Logseq Architecture — Block-Level Database Design — Logseq Architecture — Block-Level Database Design (DeepWiki) ↩
Claim-augmented argumentation frameworks — Claim-augmented argumentation frameworks (Artificial Intelligence, 2023) ↩

Wiki Generation Architecture: Multi-Agent Multi-Pass Design

Executive Summary

Part 1: Problems with the Current System

What We Have

Limitation 1: Single-Agent Synthesis Bottleneck

Limitation 2: Knowledge Graph is Read-Only

Limitation 3: No Iterative Deepening

Limitation 4: Diagrams and Calculations are Afterthoughts

Limitation 5: Cross-Linking is Shallow

Part 2: State of the Art

Stanford STORM (2024)

Microsoft GraphRAG (2024)

CrewAI Specialist Agent Pattern (2025)

Self-Refine (Madaan et al., 2023)

SemanticCite (2025)

Loki / OpenFactVerification (2024-2025)

GPT-Researcher (2024-2026)

Amazon RefChecker (2024)

Anthropic Multi-Agent Research System (2025)

Part 3: Proposed Architecture

Core Principle: Composable Passes

The 8 Specialist Agents

The 12+ Composable Passes

Research Passes

Structure Passes

Content Passes

Enrichment Passes

Verification Passes

Iterative Refinement Loop

Part 4: Tier Configurations

Budget Tier ($5-8)

Standard Tier ($12-18)

Premium Tier ($20-30)

Polish Tier ($3-5)

Part 5: Knowledge Graph Integration

Graph-Driven Content Planning

Bidirectional Link Updates

Community Summaries (GraphRAG-inspired)

Part 6: Dynamic Computation Embedding

The Vision

How the Enricher Agent Creates These

Fact Feedback Loop

Part 7: Implementation Plan

Phase 1: Pass Infrastructure (Foundation)

Phase 2: Specialist Agents

Phase 3: Orchestrator

Phase 4: Graph Integration

Phase 5: Computation Embedding

Migration Path

Part 8: Comparison with Alternatives

Alternative A: Keep Improving Current Pipeline

Alternative B: Adopt STORM Directly

Alternative C: Full CrewAI/LangGraph Framework

Alternative D: Claude Agent SDK Multi-Agent (Proposed Approach)

Part 9: Key Ideas Borrowed

Part 10: Success Metrics

Part 11: Composable Module Architecture (February 2026 Refinement)

Core Insight: Modules as Agent Tools

The Module Kit

Why Section-Level Matters

Source-Fetcher as Foundation

Claim Map: Mechanical Grounding Contract

Budget as Tool-Call Limits

Cross-Context Reuse

Implementation: Foundation Issues

Part 12: Academic Foundations for Claim-First Content

Proposition-Level Retrieval (Dense X Retrieval)

Decompose-Then-Verify (FActScore)

Nanopublications: The Formal Precedent

Knowledge-Centric Templatic Views (SURe)

GPTKB: Warning on LLM-Generated Knowledge

Multi-Agent Verification (KARMA)

Block-Based Knowledge Tools

Argumentation Frameworks

Footnotes