System Architecture

This document provides a technical overview of how the Longterm Wiki is built, the novel design patterns it uses, and the rationale behind key architectural decisions. It's intended both as a reference for contributors and as a catalog of the ideas that make the system work.

Keeping This Updated

When making significant changes to pipelines or data flow, update the relevant sections here. See Documentation Maintenance for guidelines.

High-Level Architecture

The wiki is a Next.js 15 application with a YAML-first data layer, a CLI toolchain (Crux), and an AI-assisted content pipeline.

Diagram (loading…)

flowchart TB
  subgraph Sources["Data Sources"]
      KB[("KB YAML
packages/kb/data/things/")]
      YAML[("YAML Files
entities, resources")]
      MDX[("MDX Pages
~700 articles")]
      GRAPHS[("Graph Data
cause-effect YAML")]
  end

  subgraph Build["Build-Time Processing"]
      TRANSFORM["entity-transform.mjs
Type mapping"]
      BUILDDATA["build-data.mjs
Compilation pipeline"]
      REDUNDANCY["redundancy.mjs
Similarity detection"]
      METRICS["metrics-extractor.ts
Structural scoring"]
      FACTS["computed-facts.mjs
Expression evaluator"]
  end

  subgraph Artifacts["Build Artifacts"]
      JSON[("database.json
All entities + pages")]
  end

  subgraph Runtime["Next.js Runtime"]
      DATA["data/index.ts
Lazy-loaded indexes"]
      COMPONENTS["React components
EntityLink, InfoBox, etc."]
      PAGES["MDX rendering
next-mdx-remote"]
  end

  KB --> BUILDDATA
  YAML --> TRANSFORM --> BUILDDATA
  MDX --> BUILDDATA
  GRAPHS --> BUILDDATA
  BUILDDATA --> REDUNDANCY
  BUILDDATA --> METRICS
  BUILDDATA --> FACTS
  BUILDDATA --> JSON
  BUILDDATA --> REGISTRY
  JSON --> DATA
  DATA --> COMPONENTS
  DATA --> PAGES

Tech Stack

Layer	Technology
Framework	Next.js 15 with App Router
Components	React 19 + next-mdx-remote
Styling	Tailwind CSS v4 + shadcn/ui
Type Safety	TypeScript + Zod schemas
Graphs	ReactFlow (XYFlow) + Dagre/ELK layout
Diagrams	Mermaid 11
Search	PostgreSQL full-text search (wiki-server)
CLI	Crux (custom multi-domain CLI)
Data	KB YAML + YAML sources → JSON build artifacts
Workspace	pnpm workspaces (app + crux)

Clever Ideas

This section catalogs the novel architectural patterns — the ideas that distinguish this system from a typical documentation site.

1. Multi-Signal Relationship Graph

Location: app/scripts/build-data.mjs (lines 230-424)

Instead of manually curating "related pages" links, the system computes a weighted relationship graph by combining five different signals:

Signal	Weight	Source
Explicit YAML `relatedEntries`	10	Human-authored
Name/prefix matching (e.g. "anthropic" ↔ "anthropic-ipo")	6	Structural
Content `\<EntityLink\>` references	5	Content-derived
N-gram content similarity	0–3 (scaled)	Computed
Shared tags (specificity-weighted)	varies	Computed

Each neighbor's score gets a quality boost based on the target page's quality and importance ratings: boost = 1 + quality/40 + importance/400 (max ~1.45x). Unrated pages default to average values so they aren't penalized.

The output uses type-diverse selection: at least 2 entries from each entity type are guaranteed before filling remaining slots by score. This prevents the "related" sidebar from being dominated by one type.

Directional relationship labels ("mitigates", "caused by") are preserved through an inverse-label mapping table, so both directions of a relationship get meaningful labels.

2. Stable Wiki ID System

Location: app/scripts/build-data.mjs (lines 704-776)

Every entity gets a stable numeric ID (E1, E42, E552) that enables canonical URLs surviving slug renames.

The key insight: IDs are allocated atomically by the wiki server and written back to source files. This means:

YAML entities and MDX frontmatter are the single source of truth
New entities get auto-assigned IDs from the server on their first build
The server's PostgreSQL database prevents race conditions and ID reassignment
ID conflicts are detected at build time and fail the build

Resolution at runtime supports both numeric IDs and slugs, so <EntityLink id="E521" name="coefficient-giving" > and path-based lookups both work.

3. Content-Derived Backlinks

Location: app/scripts/build-data.mjs (lines 190-225, 867-888)

The system merges two backlink sources intelligently:

Explicit YAML relatedEntries (semantic, directional)
MDX content scans for EntityLink components (implicit, extracted by regex)

Content scanning happens before raw MDX is stripped from the build output, creating an inbound-link map that's deduplicated by ID. This means every entity knows both who links to it and who it links to, without manual maintenance.

4. N-gram Redundancy Detection

Location: app/scripts/lib/redundancy.mjs

Pages are compared using 5-word n-gram shingling (Jaccard similarity) combined with word overlap. The system:

Extracts clean text (strips code blocks, JSX, tables, headers, markdown formatting)
Compares only within the same contentFormat (articles vs. tables vs. diagrams) to avoid false positives
Uses max(shingleSimilarity, wordSimilarity * 0.8) as combined metric
Stores top 5 similar pages per page at a 10% threshold

This feeds into the relationship graph (signal #4) and helps editors find pages that overlap.

5. Safe Expression Evaluator for Computed Facts

Location: app/scripts/lib/computed-facts.mjs

Facts can reference other facts in expressions like {openai.revenue-2024} * {growth-rate}. Instead of using eval(), the system uses a hand-written recursive descent parser that supports:

Human-readable numeric parsing: "$350 billion" → 350_000_000_000, "40%" → 0.4
Arithmetic: +, -, *, /, parentheses
{entity.factId} references resolved in topological order
Format strings for display (currency prefixes, unit scaling)

Dependencies are resolved topologically, so a fact referencing another computed fact works correctly. Non-computable facts (qualitative values) are flagged with noCompute and skipped.

6. Build-Time Entity Transformation

Location: app/scripts/lib/entity-transform.mjs

Raw YAML entities are transformed into strictly typed entities at build time via a pure transformEntity() function. This handles:

Type migration: Old names (lab-frontier, researcher) map to canonical types (organization, person)
Subtype extraction: lab-frontier → organization + orgType: frontier-lab
CustomField extraction: Generic key-value customFields are promoted to typed fields (Role, Affiliation, Founded)
Risk categorization: Risk entities are auto-categorized (epistemic, misuse, structural) via a mapping table

By doing this at build time, the runtime never encounters legacy type names. Unknown entity types pass through unchanged — the system is forward-compatible.

7. Format-Aware Quality Metrics

Location: crux/lib/metrics-extractor.ts

Content quality is measured structurally, but the scoring adapts to the content format:

Articles are scored on word count, section structure, presence of overview/conclusion
Tables aren't penalized for low word count
Diagrams don't need prose length or section counts

Metrics include raw counts (words, tables, diagrams, internal links, footnotes), ratios (bullet density), boolean checks (has overview?), and a composite structural score (0-15 raw, normalized to 0-50). A suggestQuality function proposes quality ratings based on structural scores, and getQualityDiscrepancy flags pages where the LLM-assigned quality disagrees with structural evidence.

8. Single-Pass Validation Engine

Location: crux/lib/validation-engine.ts

Instead of having 20+ separate validator scripts that each re-read all 625 files, the validation engine loads all content once and runs composable rules against it:

engine.load()  →  read all files once  →  run each Rule.check()  →  collect Issues

Each Rule has:

check(): returns issues (pure function)
Optional fix(): returns corrected content (declarative FixSpec with oldText/newText)
scope: 'file' (runs per file) or 'global' (runs once on all files)

Fixes are applied in bulk and logged to edit-logs. Four rules are CI-blocking (comparison-operators, dollar-signs, frontmatter-schema, wiki-id-integrity); the rest are advisory.

9. YAML-First MDX Generation

Location: app/scripts/lib/mdx-generator.mjs

For entities whose content is defined entirely in YAML, minimal MDX stub files are auto-generated. The guard condition for regeneration is carefully conservative:

shouldGenerateMdx = !fileExists OR (isAutoGenerated AND no ## headings AND < 20 lines)

This means: generate if missing, regenerate if it's a short auto-generated stub, but never overwrite custom content (detected by presence of ## headings or significant length). This enables a YAML-first workflow where data authors edit YAML and MDX files appear automatically.

10. Lazy-Loaded Index System

Location: app/src/data/index.ts

The database is loaded once at server startup, but indexes are built lazily on first access:

let _index: Map<string, Entity> | null = null;
function getIndex() {
  if (!_index) _index = new Map(getEntities().map(e => [e.id, e]));
  return _index;
}

This avoids building indexes for entity types that are never queried in a given request. Combined with Zod validation at load time (with graceful fallback to GenericEntity for unknown types), it balances strictness with forward-compatibility.

11. Server-Side Search

Location: app/src/lib/search.ts, app/src/app/api/search/route.ts

Search uses PostgreSQL full-text search via the wiki-server, proxied through the /api/search Next.js route. The client sends queries to this proxy, which forwards them to the wiki-server's search endpoint. This replaced an earlier MiniSearch client-side fallback, simplifying the search stack and reducing client bundle size.

12. Entity Ontology with Display Metadata

Location: app/src/data/entity-ontology.ts

A single file defines the canonical ontology for 30+ entity types, each with:

Lucide icon component
iconColor (Tailwind classes, light + dark variants)
badgeColor for explore-page filtering
headerColor for InfoBox headers

Organization subtypes (frontier-lab, safety-org, startup, academic) get their own display metadata via a separate ORG_TYPE_DISPLAY map. Backward-compat aliases (researcher → person, lab-* → organization) allow gradual migrations without breaking existing data.

13. Per-Page Edit Logs

Location: crux/lib/edit-log.ts, data/edit-logs/

Each page has a separate YAML file (data/edit-logs/<page-id>.yaml) tracking who changed it, when, and how:

- date: "2026-02-13"
  tool: crux-improve
  agency: ai-directed
  tier: standard
  note: "Added citations and restructured overview"

By storing edit history outside of page frontmatter, the system separates editorial metadata from content. LLM-generated content can't accidentally corrupt the edit log. The bulk-fix system logs one entry per fixed file automatically.

14. Session Log → Change History Integration

Location: app/scripts/build-data.mjs (lines 45-100)

Claude Code session logs (.claude/sessions/*.md) are parsed at build time and attached to pages in database.json. The structured format:

## 2026-02-13 | branch-name | Short title
**What was done:** Summary text.
**Pages:** page-id-1, page-id-2

...enables the system to show "what changed and why" for any page, correlated with git branches and PRs, without modifying the content files themselves.

15. Frontmatter Entity Auto-Creation

Location: app/scripts/lib/frontmatter-scanner.mjs

Pages don't need a corresponding YAML entity file. The build script auto-creates entities from MDX frontmatter for any page that doesn't have one:

YAML entities (explicit) + frontmatter entities (auto-created) = full entity set

YAML entities take precedence. This means a page can start as just an MDX file with frontmatter, and the system treats it as a first-class entity — it gets a numeric ID, appears in search, and can be linked via <EntityLink>.

16. Inverse Relationship Labels

Location: app/scripts/build-data.mjs (lines 259-291)

When entity A declares relationship: "mitigates" toward entity B, the system auto-generates the inverse label for the B→A direction using a lookup table:

"mitigates" ↔ "mitigated by"
"causes" ↔ "caused by"
"enables" ↔ "enabled by"
"child-of" ↔ "parent of"

Explicit labels are never overwritten by inferred ones. This gives both sides of a relationship meaningful edge labels without requiring authors to declare both directions.

17. Tag Specificity Weighting

Location: app/scripts/build-data.mjs (lines 349-361)

When computing related entities from shared tags, rarer tags get more weight:

specificity = 1 / log2(tagCount + 2)

A tag shared by 3 entities is more informative than one shared by 300. This prevents broad tags like "ai-safety" from drowning out specific connections.

Core Systems

Entity Data Pipeline

Flow: YAML sources → entity-transform.mjs → build-data.mjs → JSON artifacts → React components

Component	Location	Purpose
KB YAML	`packages/kb/data/things/*.yaml`	Authoritative structured facts (valuations, revenue, etc.)
Source YAML	`data/entities/*.yaml`	Human-editable entity definitions
Entity transform	`app/scripts/lib/entity-transform.mjs`	Type mapping and normalization
Build script	`app/scripts/build-data.mjs`	Main compilation pipeline
Generated JSON	`app/src/data/database.json`	Browser-ready merged data
Data layer	`app/src/data/index.ts`	Runtime access with Zod validation
Components	`app/src/components/wiki/`	Display entity data

Key files generated:

database.json — All entities, pages, relations, facts, search data, statistics (includes ID registry)

Wiki-Server (PostgreSQL)

Purpose: Durable storage for citation content, audit results, claims, facts, and other structured data. Provides full-text search and typed API access.

Location: Remote PostgreSQL database accessed via the wiki-server's Hono RPC API.

Table	Purpose
`citation_content`	Full text of fetched source URLs
`citation_audits`	Per-page citation verification results
`claims`	Extracted atomic claims with source references
`resources`	External resource metadata
`entities`	Entity metadata (synced from YAML)
`agent_sessions`	Claude Code session logs

CLI tools access the database through apiRequest() in crux/lib/wiki-server/. The frontend uses typed RPC clients with InferResponseType<> for compile-time type safety.

See: Content Database for the full storage architecture.

Page Creation Pipeline

Purpose: Generate wiki pages with proper citations using AI research and synthesis.

Pipeline phases:

canonical-links → research-perplexity → register-sources → fetch-sources
    → research-scry → synthesize → verify-sources → validate-loop → grade

Diagram (loading…)

flowchart LR
  subgraph Research["Research Phase"]
      CL[Canonical Links]
      PP[Perplexity Search]
      RS[Register Sources]
      FS[Fetch Sources]
      SC[SCRY Search]
  end

  subgraph Synthesis["Synthesis Phase"]
      SY[Claude Synthesis]
      VS[Verify Sources]
  end

  subgraph Validation["Validation Phase"]
      VL[Validate Loop]
      VF[Full Validation]
      GR[Grade]
  end

  CL --> PP --> RS --> FS --> SC --> SY --> VS --> VL --> VF --> GR

  FS -.->|"Firecrawl API"| CACHE[(Source Cache)]
  CACHE -.->|"Quote verification"| VS

Key design decisions:

Decision	Rationale
Perplexity for research	Cheap (≈$0.10), good at web search, provides citation URLs
Register + fetch sources	Enables quote verification against actual source content
Verify-sources phase	Catches hallucinated quotes before publication
Validation loop	Iterative fixing ensures build-passing output

Cost tiers: budget ($2-3), standard ($4-6), premium ($8-12) for create; polish ($2-3), standard ($5-8), deep ($10-15) for improve.

See: Page Creator Pipeline for experiment results.

Crux CLI

Purpose: Unified CLI for all wiki tooling.

Architecture: Domain-based command dispatch with 12+ domains:

pnpm crux validate          # Validation suite
pnpm crux content create    # AI page creation
pnpm crux content improve   # AI page improvement
pnpm crux fix escaping      # Auto-fix MDX issues
pnpm crux analyze           # Content analysis
pnpm crux edit-log view     # Per-page edit history

Each domain is a module with a commands export. Commands are async functions returning {output, exitCode}. A --ci flag switches output to JSON for CI integration.

See: crux/README.md for the full domain reference.

Validation System

Purpose: Enforce content quality and consistency at multiple levels.

Architecture: Single-pass validation engine runs composable rules. Each rule checks specific patterns and can optionally auto-fix issues.

Category	Examples	Blocking?
Critical	`dollar-signs`, `entitylink-ids`, `fake-urls`	Yes - breaks build
Quality	`tilde-dollar`, `markdown-lists`, `placeholders`	No - warnings only

Design decision: Two-tier validation allows fast feedback while still catching serious issues. Critical rules run in CI; quality rules are advisory.

Data Flow Diagrams

Page Creation Data Flow

Diagram (loading…)

sequenceDiagram
  participant User
  participant PageCreator
  participant Perplexity
  participant Firecrawl
  participant WikiServer
  participant Claude

  User->>PageCreator: Create page "Topic"
  PageCreator->>Perplexity: Research queries
  Perplexity-->>PageCreator: Content + citation URLs
  PageCreator->>WikiServer: Register source URLs
  PageCreator->>Firecrawl: Fetch page content
  Firecrawl-->>WikiServer: Store fetched content
  PageCreator->>Claude: Synthesize with research
  Claude-->>PageCreator: Draft MDX
  PageCreator->>WikiServer: Load fetched content
  PageCreator->>PageCreator: Verify quotes against sources
  PageCreator->>Claude: Validation loop
  Claude-->>PageCreator: Fixed MDX
  PageCreator-->>User: final.mdx

Entity Resolution Flow

Diagram (loading…)

sequenceDiagram
  participant MDX as MDX Page
  participant Component as EntityLink
  participant Registry as pathRegistry.json
  participant Database as database.json

  MDX->>Component: <EntityLink id="E521" name="coefficient-giving">
  Component->>Registry: Lookup path for ID
  Registry-->>Component: /knowledge-base/organizations/funders/open-philanthropy
  Component->>Database: Get entity metadata
  Database-->>Component: {title, type, ...}
  Component-->>MDX: Rendered link with icon

Design Principles

1. Source Files as Single Source of Truth

Human-editable files (YAML, MDX) are the canonical source. Everything else — JSON, search indexes, the ID registry — is a derived build artifact that can be regenerated. Generated files are gitignored where appropriate. This means: no merge conflicts on generated data, clear ownership boundaries, and deterministic builds from source.

2. Build-Time Computation, Runtime Speed

Expensive operations (relationship graph computation, redundancy detection, fact evaluation, search index building, entity transformation) all happen at build time. Runtime reads pre-computed JSON through lazy-loaded indexes. The result: a fast site with rich computed data, without runtime computation costs.

3. Progressive Enhancement for AI Features

AI features (summaries, page creation, grading) are optional enhancements. The wiki builds and serves without any API keys. Failures in the AI pipeline don't break the site. Costs are predictable and opt-in per-tier.

4. Validation at Multiple Levels

Level	Tool	When	Blocking?
Syntax	MDX compiler	Build time	Yes
Schema	Zod validation	Build time (with fallback)	Soft
Content rules	Validation engine	CI	3 rules blocking
References	EntityLink validator	CI	Advisory
Quality	Grading pipeline	Manual trigger	No

5. Forward-Compatible by Default

Unknown entity types pass through as GenericEntity (preserving all custom fields). Backward-compat aliases handle gradual migrations. The Zod schema validation logs warnings in dev but doesn't fail builds for unrecognized types. New features can be added to the data layer without updating every consumer.

Key Configuration Files

File	Purpose	When to Edit
`app/next.config.ts`	Next.js + MDX configuration	Adding plugins, redirects
`app/src/data/entity-schemas.ts`	Entity type definitions (Zod)	Adding entity types or fields
`app/src/data/entity-ontology.ts`	Display metadata (icons, colors)	Adding entity display styles
`app/src/data/entity-type-names.ts`	Canonical entity type list	Adding new entity types
`app/src/lib/internal-nav.ts`	Internal sidebar navigation	Adding internal pages
`app/scripts/build-data.mjs`	Main build pipeline	Changing data flow
`crux/lib/validation-engine.ts`	Validation rules framework	Adding validation rules

Environment Variables

Variable	Purpose	Required For
`ANTHROPIC_BILLING_KEY`	Claude API access	Summaries, grading, page creation
`OPENROUTER_API_KEY`	Perplexity via OpenRouter	Page creation research
`FIRECRAWL_KEY`	Web page fetching	Source content fetching
`SCRY_API_KEY`	Academic paper search	Deep research tier

All are optional. Features gracefully degrade when keys are missing.

Repository Structure

longterm-wiki/
├── content/docs/               # ~700 MDX wiki pages
│   ├── knowledge-base/         # Risks, responses, orgs, people
│   ├── models/                 # Analytical frameworks
│   ├── project/                # Public project documentation
│   └── internal/               # Contributor docs (including this page)
├── packages/kb/                # Knowledge Base package
│   ├── data/things/            # Authoritative structured facts (KB YAML)
│   ├── data/schemas/           # Property schemas (60 properties)
│   └── src/                    # KB loader, custom YAML tags (!ref, !date)
├── data/                       # YAML source data
│   ├── entities/               # Entity definitions (split by type)
│   ├── facts/                  # Legacy facts (deprecated for KB entities)
│   ├── resources/              # External resource metadata
│   ├── insights/               # Cross-page insights
│   ├── graphs/                 # Cause-effect graph YAML
│   └── edit-logs/              # Per-page edit history
├── app/                        # Next.js 15 frontend
│   ├── src/
│   │   ├── app/                # App Router pages
│   │   ├── components/         # React components (wiki/, ui/)
│   │   ├── data/               # Data layer + Zod schemas
│   │   └── lib/                # Utilities, search, navigation
│   └── scripts/                # Build scripts + libraries
│       ├── build-data.mjs      # Main data compilation pipeline
│       └── lib/                # Build utilities (transform, metrics, search, etc.)
├── crux/                       # Crux CLI + validation
│   ├── crux.mjs                # CLI entry point
│   ├── commands/               # Domain handlers
│   ├── authoring/              # Page create/improve/grade
│   ├── lib/                    # Validation engine, templates, utilities
│   └── validate/               # Validation rule implementations
└── package.json                # pnpm workspace root

Documentation Maintenance

This architecture documentation should be updated when:

New pipeline phases added — Update the pipeline diagram and phase list
New clever patterns introduced — Add to the "Clever Ideas" section
Database schema changes — Update the ER diagram
New environment variables — Add to the environment variables table
Tech stack changes — Update the stack table and diagrams

About This Wiki — Contributor overview
Content Database — Storage architecture (PostgreSQL, caching, YAML)
Automation Tools — CLI reference
Page Creator Pipeline — Generation experiments
Schema Overview — Entity types and data relationships
Entity Reference — Complete entity type catalog
Data System Authority Rules — Which data system is authoritative for each entity
Canonical Facts & Calc — KB fact components and usage conventions

System Architecture

High-Level Architecture

Tech Stack

Clever Ideas

1. Multi-Signal Relationship Graph

2. Stable Wiki ID System

3. Content-Derived Backlinks

4. N-gram Redundancy Detection

5. Safe Expression Evaluator for Computed Facts

6. Build-Time Entity Transformation

7. Format-Aware Quality Metrics

8. Single-Pass Validation Engine

9. YAML-First MDX Generation

10. Lazy-Loaded Index System

11. Server-Side Search

12. Entity Ontology with Display Metadata

13. Per-Page Edit Logs

14. Session Log → Change History Integration

15. Frontmatter Entity Auto-Creation

16. Inverse Relationship Labels

17. Tag Specificity Weighting

Core Systems

Entity Data Pipeline

Wiki-Server (PostgreSQL)

Page Creation Pipeline

Crux CLI

Validation System

Data Flow Diagrams

Page Creation Data Flow

Entity Resolution Flow

Design Principles

1. Source Files as Single Source of Truth

2. Build-Time Computation, Runtime Speed

3. Progressive Enhancement for AI Features

4. Validation at Multiple Levels

5. Forward-Compatible by Default

Key Configuration Files

Environment Variables

Repository Structure

Documentation Maintenance

Related Documentation