System Architecture
This document provides a technical overview of how the Longterm Wiki is built, the novel design patterns it uses, and the rationale behind key architectural decisions. It's intended both as a reference for contributors and as a catalog of the ideas that make the system work.
When making significant changes to pipelines or data flow, update the relevant sections here. See Documentation Maintenance for guidelines.
High-Level Architecture
The wiki is a Next.js 15 application with a YAML-first data layer, a CLI toolchain (Crux), and an AI-assisted content pipeline.
Tech Stack
| Layer | Technology |
|---|---|
| Framework | Next.js 15 with App Router |
| Components | React 19 + next-mdx-remote |
| Styling | Tailwind CSS v4 + shadcn/ui |
| Type Safety | TypeScript + Zod schemas |
| Graphs | ReactFlow (XYFlow) + Dagre/ELK layout |
| Diagrams | Mermaid 11 |
| Search | PostgreSQL full-text search (wiki-server) |
| CLI | Crux (custom multi-domain CLI) |
| Data | KB YAML + YAML sources → JSON build artifacts |
| Workspace | pnpm workspaces (app + crux) |
Clever Ideas
This section catalogs the novel architectural patterns — the ideas that distinguish this system from a typical documentation site.
1. Multi-Signal Relationship Graph
Location: app/scripts/build-data.mjs (lines 230-424)
Instead of manually curating "related pages" links, the system computes a weighted relationship graph by combining five different signals:
| Signal | Weight | Source |
|---|---|---|
Explicit YAML relatedEntries | 10 | Human-authored |
| Name/prefix matching (e.g. "anthropic" ↔ "anthropic-ipo") | 6 | Structural |
Content \<EntityLink\> references | 5 | Content-derived |
| N-gram content similarity | 0–3 (scaled) | Computed |
| Shared tags (specificity-weighted) | varies | Computed |
Each neighbor's score gets a quality boost based on the target page's quality and importance ratings: boost = 1 + quality/40 + importance/400 (max ~1.45x). Unrated pages default to average values so they aren't penalized.
The output uses type-diverse selection: at least 2 entries from each entity type are guaranteed before filling remaining slots by score. This prevents the "related" sidebar from being dominated by one type.
Directional relationship labels ("mitigates", "caused by") are preserved through an inverse-label mapping table, so both directions of a relationship get meaningful labels.
2. Stable Numeric ID System
Location: app/scripts/build-data.mjs (lines 704-776)
Every entity gets a stable numeric ID (E1, E42, E552) that enables canonical URLs surviving slug renames.
The key insight: IDs are allocated atomically by the wiki server and written back to source files. This means:
- YAML entities and MDX frontmatter are the single source of truth
- New entities get auto-assigned IDs from the server on their first build
- The server's PostgreSQL database prevents race conditions and ID reassignment
- ID conflicts are detected at build time and fail the build
Resolution at runtime supports both numeric IDs and slugs, so <EntityLink id="E521" name="coefficient-giving" > and path-based lookups both work.
3. Content-Derived Backlinks
Location: app/scripts/build-data.mjs (lines 190-225, 867-888)
The system merges two backlink sources intelligently:
- Explicit YAML
relatedEntries(semantic, directional) - MDX content scans for EntityLink components (implicit, extracted by regex)
Content scanning happens before raw MDX is stripped from the build output, creating an inbound-link map that's deduplicated by ID. This means every entity knows both who links to it and who it links to, without manual maintenance.
4. N-gram Redundancy Detection
Location: app/scripts/lib/redundancy.mjs
Pages are compared using 5-word n-gram shingling (Jaccard similarity) combined with word overlap. The system:
- Extracts clean text (strips code blocks, JSX, tables, headers, markdown formatting)
- Compares only within the same
contentFormat(articles vs. tables vs. diagrams) to avoid false positives - Uses
max(shingleSimilarity, wordSimilarity * 0.8)as combined metric - Stores top 5 similar pages per page at a 10% threshold
This feeds into the relationship graph (signal #4) and helps editors find pages that overlap.
5. Safe Expression Evaluator for Computed Facts
Location: app/scripts/lib/computed-facts.mjs
Facts can reference other facts in expressions like {openai.revenue-2024} * {growth-rate}. Instead of using eval(), the system uses a hand-written recursive descent parser that supports:
- Human-readable numeric parsing:
"$350 billion"→350_000_000_000,"40%"→0.4 - Arithmetic:
+,-,*,/, parentheses {entity.factId}references resolved in topological order- Format strings for display (currency prefixes, unit scaling)
Dependencies are resolved topologically, so a fact referencing another computed fact works correctly. Non-computable facts (qualitative values) are flagged with noCompute and skipped.
6. Build-Time Entity Transformation
Location: app/scripts/lib/entity-transform.mjs
Raw YAML entities are transformed into strictly typed entities at build time via a pure transformEntity() function. This handles:
- Type migration: Old names (
lab-frontier,researcher) map to canonical types (organization,person) - Subtype extraction:
lab-frontier→organization+orgType: frontier-lab - CustomField extraction: Generic key-value
customFieldsare promoted to typed fields (Role,Affiliation,Founded) - Risk categorization: Risk entities are auto-categorized (epistemic, misuse, structural) via a mapping table
By doing this at build time, the runtime never encounters legacy type names. Unknown entity types pass through unchanged — the system is forward-compatible.
7. Format-Aware Quality Metrics
Location: crux/lib/metrics-extractor.ts
Content quality is measured structurally, but the scoring adapts to the content format:
- Articles are scored on word count, section structure, presence of overview/conclusion
- Tables aren't penalized for low word count
- Diagrams don't need prose length or section counts
Metrics include raw counts (words, tables, diagrams, internal links, footnotes), ratios (bullet density), boolean checks (has overview?), and a composite structural score (0-15 raw, normalized to 0-50). A suggestQuality function proposes quality ratings based on structural scores, and getQualityDiscrepancy flags pages where the LLM-assigned quality disagrees with structural evidence.
8. Single-Pass Validation Engine
Location: crux/lib/validation-engine.ts
Instead of having 20+ separate validator scripts that each re-read all 625 files, the validation engine loads all content once and runs composable rules against it:
engine.load() → read all files once → run each Rule.check() → collect Issues
Each Rule has:
check(): returns issues (pure function)- Optional
fix(): returns corrected content (declarativeFixSpecwitholdText/newText) scope:'file'(runs per file) or'global'(runs once on all files)
Fixes are applied in bulk and logged to edit-logs. Four rules are CI-blocking (comparison-operators, dollar-signs, frontmatter-schema, numeric-id-integrity); the rest are advisory.
9. YAML-First MDX Generation
Location: app/scripts/lib/mdx-generator.mjs
For entities whose content is defined entirely in YAML, minimal MDX stub files are auto-generated. The guard condition for regeneration is carefully conservative:
shouldGenerateMdx = !fileExists OR (isAutoGenerated AND no ## headings AND < 20 lines)
This means: generate if missing, regenerate if it's a short auto-generated stub, but never overwrite custom content (detected by presence of ## headings or significant length). This enables a YAML-first workflow where data authors edit YAML and MDX files appear automatically.
10. Lazy-Loaded Index System
Location: app/src/data/index.ts
The database is loaded once at server startup, but indexes are built lazily on first access:
let _index: Map<string, Entity> | null = null;
function getIndex() {
if (!_index) _index = new Map(getEntities().map(e => [e.id, e]));
return _index;
}
This avoids building indexes for entity types that are never queried in a given request. Combined with Zod validation at load time (with graceful fallback to GenericEntity for unknown types), it balances strictness with forward-compatibility.
11. Server-Side Search
Location: app/src/lib/search.ts, app/src/app/api/search/route.ts
Search uses PostgreSQL full-text search via the wiki-server, proxied through the /api/search Next.js route. The client sends queries to this proxy, which forwards them to the wiki-server's search endpoint. This replaced an earlier MiniSearch client-side fallback, simplifying the search stack and reducing client bundle size.
12. Entity Ontology with Display Metadata
Location: app/src/data/entity-ontology.ts
A single file defines the canonical ontology for 30+ entity types, each with:
- Lucide icon component
iconColor(Tailwind classes, light + dark variants)badgeColorfor explore-page filteringheaderColorfor InfoBox headers
Organization subtypes (frontier-lab, safety-org, startup, academic) get their own display metadata via a separate ORG_TYPE_DISPLAY map. Backward-compat aliases (researcher → person, lab-* → organization) allow gradual migrations without breaking existing data.
13. Per-Page Edit Logs
Location: crux/lib/edit-log.ts, data/edit-logs/
Each page has a separate YAML file (data/edit-logs/<page-id>.yaml) tracking who changed it, when, and how:
- date: "2026-02-13"
tool: crux-improve
agency: ai-directed
tier: standard
note: "Added citations and restructured overview"
By storing edit history outside of page frontmatter, the system separates editorial metadata from content. LLM-generated content can't accidentally corrupt the edit log. The bulk-fix system logs one entry per fixed file automatically.
14. Session Log → Change History Integration
Location: app/scripts/build-data.mjs (lines 45-100)
Claude Code session logs (.claude/sessions/*.md) are parsed at build time and attached to pages in database.json. The structured format:
## 2026-02-13 | branch-name | Short title
**What was done:** Summary text.
**Pages:** page-id-1, page-id-2
...enables the system to show "what changed and why" for any page, correlated with git branches and PRs, without modifying the content files themselves.
15. Frontmatter Entity Auto-Creation
Location: app/scripts/lib/frontmatter-scanner.mjs
Pages don't need a corresponding YAML entity file. The build script auto-creates entities from MDX frontmatter for any page that doesn't have one:
YAML entities (explicit) + frontmatter entities (auto-created) = full entity set
YAML entities take precedence. This means a page can start as just an MDX file with frontmatter, and the system treats it as a first-class entity — it gets a numeric ID, appears in search, and can be linked via <EntityLink>.
16. Inverse Relationship Labels
Location: app/scripts/build-data.mjs (lines 259-291)
When entity A declares relationship: "mitigates" toward entity B, the system auto-generates the inverse label for the B→A direction using a lookup table:
"mitigates" ↔ "mitigated by"
"causes" ↔ "caused by"
"enables" ↔ "enabled by"
"child-of" ↔ "parent of"
Explicit labels are never overwritten by inferred ones. This gives both sides of a relationship meaningful edge labels without requiring authors to declare both directions.
17. Tag Specificity Weighting
Location: app/scripts/build-data.mjs (lines 349-361)
When computing related entities from shared tags, rarer tags get more weight:
specificity = 1 / log2(tagCount + 2)
A tag shared by 3 entities is more informative than one shared by 300. This prevents broad tags like "ai-safety" from drowning out specific connections.
Core Systems
Entity Data Pipeline
Flow: YAML sources → entity-transform.mjs → build-data.mjs → JSON artifacts → React components
| Component | Location | Purpose |
|---|---|---|
| KB YAML | packages/kb/data/things/*.yaml | Authoritative structured facts (valuations, revenue, etc.) |
| Source YAML | data/entities/*.yaml | Human-editable entity definitions |
| Entity transform | app/scripts/lib/entity-transform.mjs | Type mapping and normalization |
| Build script | app/scripts/build-data.mjs | Main compilation pipeline |
| Generated JSON | app/src/data/database.json | Browser-ready merged data |
| Data layer | app/src/data/index.ts | Runtime access with Zod validation |
| Components | app/src/components/wiki/ | Display entity data |
Key files generated:
database.json— All entities, pages, relations, facts, search data, statistics (includes ID registry)
Wiki-Server (PostgreSQL)
Purpose: Durable storage for citation content, audit results, claims, facts, and other structured data. Provides full-text search and typed API access.
Location: Remote PostgreSQL database accessed via the wiki-server's Hono RPC API.
| Table | Purpose |
|---|---|
citation_content | Full text of fetched source URLs |
citation_audits | Per-page citation verification results |
claims | Extracted atomic claims with source references |
resources | External resource metadata |
entities | Entity metadata (synced from YAML) |
agent_sessions | Claude Code session logs |
CLI tools access the database through apiRequest() in crux/lib/wiki-server/. The frontend uses typed RPC clients with InferResponseType<> for compile-time type safety.
See: Content Database for the full storage architecture.
Page Creation Pipeline
Purpose: Generate wiki pages with proper citations using AI research and synthesis.
Pipeline phases:
canonical-links → research-perplexity → register-sources → fetch-sources
→ research-scry → synthesize → verify-sources → validate-loop → grade
Key design decisions:
| Decision | Rationale |
|---|---|
| Perplexity for research | Cheap (≈$0.10), good at web search, provides citation URLs |
| Register + fetch sources | Enables quote verification against actual source content |
| Verify-sources phase | Catches hallucinated quotes before publication |
| Validation loop | Iterative fixing ensures build-passing output |
Cost tiers: budget ($2-3), standard ($4-6), premium ($8-12) for create; polish ($2-3), standard ($5-8), deep ($10-15) for improve.
See: Page Creator Pipeline for experiment results.
Crux CLI
Purpose: Unified CLI for all wiki tooling.
Architecture: Domain-based command dispatch with 12+ domains:
pnpm crux validate # Validation suite
pnpm crux content create # AI page creation
pnpm crux content improve # AI page improvement
pnpm crux fix escaping # Auto-fix MDX issues
pnpm crux analyze # Content analysis
pnpm crux edit-log view # Per-page edit history
Each domain is a module with a commands export. Commands are async functions returning {output, exitCode}. A --ci flag switches output to JSON for CI integration.
See: crux/README.md for the full domain reference.
Validation System
Purpose: Enforce content quality and consistency at multiple levels.
Architecture: Single-pass validation engine runs composable rules. Each rule checks specific patterns and can optionally auto-fix issues.
| Category | Examples | Blocking? |
|---|---|---|
| Critical | dollar-signs, entitylink-ids, fake-urls | Yes - breaks build |
| Quality | tilde-dollar, markdown-lists, placeholders | No - warnings only |
Design decision: Two-tier validation allows fast feedback while still catching serious issues. Critical rules run in CI; quality rules are advisory.
Data Flow Diagrams
Page Creation Data Flow
Entity Resolution Flow
Design Principles
1. Source Files as Single Source of Truth
Human-editable files (YAML, MDX) are the canonical source. Everything else — JSON, search indexes, the ID registry — is a derived build artifact that can be regenerated. Generated files are gitignored where appropriate. This means: no merge conflicts on generated data, clear ownership boundaries, and deterministic builds from source.
2. Build-Time Computation, Runtime Speed
Expensive operations (relationship graph computation, redundancy detection, fact evaluation, search index building, entity transformation) all happen at build time. Runtime reads pre-computed JSON through lazy-loaded indexes. The result: a fast site with rich computed data, without runtime computation costs.
3. Progressive Enhancement for AI Features
AI features (summaries, page creation, grading) are optional enhancements. The wiki builds and serves without any API keys. Failures in the AI pipeline don't break the site. Costs are predictable and opt-in per-tier.
4. Validation at Multiple Levels
| Level | Tool | When | Blocking? |
|---|---|---|---|
| Syntax | MDX compiler | Build time | Yes |
| Schema | Zod validation | Build time (with fallback) | Soft |
| Content rules | Validation engine | CI | 3 rules blocking |
| References | EntityLink validator | CI | Advisory |
| Quality | Grading pipeline | Manual trigger | No |
5. Forward-Compatible by Default
Unknown entity types pass through as GenericEntity (preserving all custom fields). Backward-compat aliases handle gradual migrations. The Zod schema validation logs warnings in dev but doesn't fail builds for unrecognized types. New features can be added to the data layer without updating every consumer.
Key Configuration Files
| File | Purpose | When to Edit |
|---|---|---|
app/next.config.ts | Next.js + MDX configuration | Adding plugins, redirects |
app/src/data/entity-schemas.ts | Entity type definitions (Zod) | Adding entity types or fields |
app/src/data/entity-ontology.ts | Display metadata (icons, colors) | Adding entity display styles |
app/src/data/entity-type-names.ts | Canonical entity type list | Adding new entity types |
app/src/lib/internal-nav.ts | Internal sidebar navigation | Adding internal pages |
app/scripts/build-data.mjs | Main build pipeline | Changing data flow |
crux/lib/validation-engine.ts | Validation rules framework | Adding validation rules |
Environment Variables
| Variable | Purpose | Required For |
|---|---|---|
ANTHROPIC_API_KEY | Claude API access | Summaries, grading, page creation |
OPENROUTER_API_KEY | Perplexity via OpenRouter | Page creation research |
FIRECRAWL_KEY | Web page fetching | Source content fetching |
SCRY_API_KEY | Academic paper search | Deep research tier |
All are optional. Features gracefully degrade when keys are missing.
Repository Structure
longterm-wiki/
├── content/docs/ # ~700 MDX wiki pages
│ ├── knowledge-base/ # Risks, responses, orgs, people
│ ├── models/ # Analytical frameworks
│ ├── project/ # Public project documentation
│ └── internal/ # Contributor docs (including this page)
├── packages/kb/ # Knowledge Base package
│ ├── data/things/ # Authoritative structured facts (KB YAML)
│ ├── data/schemas/ # Property schemas (60 properties)
│ └── src/ # KB loader, custom YAML tags (!ref, !date)
├── data/ # YAML source data
│ ├── entities/ # Entity definitions (split by type)
│ ├── facts/ # Legacy facts (deprecated for KB entities)
│ ├── resources/ # External resource metadata
│ ├── insights/ # Cross-page insights
│ ├── graphs/ # Cause-effect graph YAML
│ └── edit-logs/ # Per-page edit history
├── app/ # Next.js 15 frontend
│ ├── src/
│ │ ├── app/ # App Router pages
│ │ ├── components/ # React components (wiki/, ui/)
│ │ ├── data/ # Data layer + Zod schemas
│ │ └── lib/ # Utilities, search, navigation
│ └── scripts/ # Build scripts + libraries
│ ├── build-data.mjs # Main data compilation pipeline
│ └── lib/ # Build utilities (transform, metrics, search, etc.)
├── crux/ # Crux CLI + validation
│ ├── crux.mjs # CLI entry point
│ ├── commands/ # Domain handlers
│ ├── authoring/ # Page create/improve/grade
│ ├── lib/ # Validation engine, templates, utilities
│ └── validate/ # Validation rule implementations
└── package.json # pnpm workspace root
Documentation Maintenance
This architecture documentation should be updated when:
- New pipeline phases added — Update the pipeline diagram and phase list
- New clever patterns introduced — Add to the "Clever Ideas" section
- Database schema changes — Update the ER diagram
- New environment variables — Add to the environment variables table
- Tech stack changes — Update the stack table and diagrams
Related Documentation
- About This Wiki — Contributor overview
- Content Database — Storage architecture (PostgreSQL, caching, YAML)
- Automation Tools — CLI reference
- Page Creator Pipeline — Generation experiments
- Schema Overview — Entity types and data relationships
- Entity Reference — Complete entity type catalog
- Data System Authority Rules — Which data system is authoritative for each entity
- Canonical Facts & Calc — KB fact components and usage conventions