Skip to content
Longterm Wiki
Updated 2026-04-09HistoryData
Page StatusDocumentation
Edited 2 weeks ago3.7k wordsUpdated quarterlyDue in 10 weeks
Content3/13
SummaryScheduleEntityEdit historyOverview
Tables21/ ~15Diagrams7/ ~1Int. links19/ ~30Ext. links1/ ~19Footnotes0/ ~11References0/ ~11Quotes0Accuracy0

Data Architecture Overview

This page provides a high-level map of the entire data system powering the Longterm Wiki. The system is organized into three data layers, each with its own schema, storage, CLI tools, and automations. Data flows from external resources through verification into the wiki's core knowledge bases.

For the detailed naming guide and PG table reference, see Data Architecture: Three Bases and Naming Guide. For the overall system architecture (Next.js, Crux CLI, deployment), see System Architecture.


The Three Data Layers

Diagram (loading…)
flowchart TB
  subgraph L1["Resources Layer"]
      L1A["External content archive
citation_content,
resource_content_versions"]
      L1B["News collection
25 RSS/web sources"]
  end
  subgraph L2["Source-Check Layer"]
      L2A["Verification pipeline
claim extraction + LLM check"]
      L2B["Evidence + Verdicts
2 PG tables"]
  end
  subgraph L3["Three Bases Layer"]
      L3A["TableBase
entities, PG-primary tables"]
      L3B["FactBase
structured facts"]
      L3C["WikiBase
MDX prose pages"]
  end
  L1 -->|"cached content"| L2
  L2 -->|"verdicts inform"| L3
  L1 -->|"news triggers
page improvements"| L3
  L3 -.->|"claims to verify"| L2
  L2 -.->|"verdict confidence
filters news priority"| L1

Each layer is a self-contained data system with its own schema, storage tables, CLI commands, and health checks. The layers feed into each other: the Resources layer collects and archives external content, the Source-Check layer verifies claims against that content, and the Three Bases layer holds the wiki's own knowledge — informed by verification results.


Resources Layer — External Content

The Resources layer is the wiki's interface with the outside world. It collects, archives, and monitors external content — everything from RSS news items to cached web pages to tracked website snapshots. Other layers read from this archive instead of re-fetching live URLs.

What It Stores

TableWhat It StoresHow Content Arrives
resourcesMetadata for external papers, blog posts, reportsPG-native; populated via wiki-server API and build-data sync
resource_tabular_sourcesConfigured data sources (grant databases, career pages, etc.)Manual configuration per tracked website
citation_contentLatest HTML/text from cited URLs (hot cache)Fetched on-demand during sourcing and citation verification
resource_content_versionsVersioned content snapshots with content-hash dedupFetched by sourcing pipeline and website monitors
website_source_pagesMonitored pages from tracked websitesConfigured per resource_tabular_sources entry
website_source_page_snapshotsPoint-in-time snapshots of monitored pagesPeriodic fetches; diffed against previous to detect changes
auto_update_news_itemsDiscovered news items from RSS/web searchAuto-update pipeline (daily)
auto_update_runsExecution history of auto-update pipelineRecorded per run

News Collection (Auto-Update)

The auto-update system is the primary way new content enters the Resources layer. It scrapes 25 configured sources and routes news to wiki pages.

Diagram (loading…)
flowchart TB
  subgraph Sources["25 Configured Sources"]
      RSS["RSS/Atom Feeds"]
      WEB["Web Searches"]
  end
  subgraph Pipeline["Auto-Update Pipeline"]
      FETCH["Feed Fetcher"]
      DIGEST["Dedup + Digest"]
      ROUTER["Page Router"]
      FILTER["Source-Check Filter"]
  end
  subgraph Output["Outputs"]
      ARCHIVE["Resource Archive
content cached for
future verification"]
      IMPROVE["Page Improve Pipeline"]
  end
  RSS --> FETCH
  WEB --> FETCH
  FETCH --> DIGEST
  DIGEST --> ROUTER
  ROUTER --> FILTER
  FILTER --> ARCHIVE
  FILTER --> IMPROVE
Source CategoryCountTypeExamples
AI Lab Blogs4RSS + web-searchOpenAI, Anthropic, DeepMind, Meta AI
AI Safety / Alignment3RSSAlignment Forum, LessWrong, EA Forum
Newsletters / Aggregators7RSSImport AI, The Gradient, ML Safety Newsletter, Zvi, CAIS, etc.
Arxiv3RSSarXiv cs.AI, cs.CL, cs.LG
Political / AI Policy4Web-searchAI policy in Congress, PAC elections, state legislation, biosecurity
Policy / Governance2Web-searchAI safety news, executive orders
Compute / Industry1Web-searchAI industry news
Watchlist-Supporting1Web-searchEntity-specific monitoring searches

CLI

# Auto-update (news collection)
pnpm crux w auto-update plan              # Preview what would be updated
pnpm crux w auto-update run --budget=30   # Execute with $30 budget cap
pnpm crux w auto-update digest            # Fetch and display news digest
pnpm crux w auto-update sources           # List configured sources
pnpm crux w auto-update history           # Show past runs

# Link health
pnpm crux w check-links                   # Check external URL health

Automation

  • GitHub Actions: .github/workflows/auto-update.yml runs daily at 06:00 UTC
  • State tracking: data/auto-update/state.yaml tracks last-seen items per source
  • Watchlist: data/auto-update/watchlist.yaml identifies pages due for scheduled updates
  • Dashboards: Auto-Update Runs, Auto-Update News, Data Sources

Key Files

FileRole
data/auto-update/sources.yamlSource configuration (25 sources)
crux/auto-update/orchestrator.tsEnd-to-end pipeline orchestration
crux/auto-update/feed-fetcher.tsRSS/Atom feed fetching with caching
crux/auto-update/page-router.tsRoutes items to wiki pages by topic
crux/lib/sourcing/source-fetcher.tsFetches and caches source content (Firecrawl, HTTP, YouTube)
apps/wiki-server/src/routes/citations.tsCitation content fetch and cache API
apps/wiki-server/src/routes/resources.tsResource metadata and content versions API

Source-Check Layer — Verification

The Source-Check layer verifies that claims in wiki pages and FactBase facts are actually supported by their cited sources. It reads cached content from the Resources layer (instead of re-fetching live URLs) and produces verdicts that inform the Three Bases layer.

How It Works

Diagram (loading…)
flowchart TB
  subgraph Input["What Gets Checked"]
      FB_FACTS["FactBase Facts"]
      WIKI_CLAIMS["Wiki Page Claims"]
      TB_RECORDS["TableBase Records
personnel, grants,
investments, etc."]
  end
  subgraph Archive["Resources Layer"]
      CACHED["Cached source content"]
  end
  subgraph Check["Verification Pipeline"]
      COLLECT["Item Collector"]
      VERIFY["LLM Verifier"]
  end
  subgraph Store["Storage"]
      EVIDENCE["source_check_evidence"]
      VERDICTS["source_check_verdicts"]
  end
  FB_FACTS --> COLLECT
  WIKI_CLAIMS --> COLLECT
  TB_RECORDS --> COLLECT
  CACHED --> VERIFY
  COLLECT --> VERIFY
  VERIFY --> EVIDENCE
  EVIDENCE --> VERDICTS

What It Stores

TablePurpose
source_check_evidencePer-source raw checks — verdict, confidence, extracted quote, checker model
source_check_verdictsAggregate per-claim verdicts — roll up evidence into a single verdict per record

Verdict Categories

VerdictMeaning
confirmedSource explicitly supports the claim
contradictedSource says something different
outdatedClaim was once true but source shows newer information
partialSource partially supports the claim
unverifiableSource doesn't address the claim either way
uncheckedNot yet checked

Each verdict carries a confidence score (0.0 to 1.0) and is timestamped so stale checks can be detected.

CLI

# FactBase facts
pnpm crux fb sourcing --entity=anthropic      # Check all facts for an entity
pnpm crux fb sourcing --fact=f_dW5cR9mJ8q     # Check a single fact

# Wiki page claims
pnpm crux w sourcing-wiki-pages --page=anthropic        # Check one page
pnpm crux w sourcing-wiki-pages --limit=5 --budget=2    # Batch with budget

Key Files

FileRole
crux/lib/sourcing/orchestrator.tsMain orchestration logic
crux/lib/sourcing/item-verifier.tsLLM-based claim verification
crux/lib/sourcing/deterministic-matcher.tsCross-references claims to footnotes
crux/lib/sourcing/verdict-handler.tsStores evidence and computes aggregate verdicts
crux/lib/sourcing/wiki-page-claims.tsExtracts claims from MDX prose via LLM

Three Bases Layer — The Wiki's Knowledge

The Three Bases layer is the wiki's own knowledge — the entities, facts, and prose pages that readers see. It's organized into three conceptual "Bases," each with its own source of truth, plus a set of PG-primary tables for high-volume relational data.

The Three Bases

BaseWhat It StoresSource of TruthBuild ArtifactKey CLI Group
TableBaseTyped entity catalog (≈2,000 entities: orgs, people, models, risks, concepts)data/entities/*.yaml (15 files) + MDX frontmatterdatabase.jsoncrux tb
FactBaseStructured temporal facts with provenancepackages/factbase/data/fb-entities/*.yaml (539 files)factbase-data.jsoncrux fb
WikiBaseLong-form prose articlescontent/docs/**/*.mdx (≈720 pages)database.json (pages section)crux w

Each base has:

  • A source of truth (YAML or MDX files in git)
  • A PG mirror (synced at build time for API queries)
  • A build artifact (JSON file consumed by Next.js at build time)
  • A CLI command group (for querying, validating, and enriching)

YAML-Primary vs. PG-Primary

Beyond the three bases, this layer also includes PG-primary tables — data that lives in PostgreSQL directly with no YAML backing. This is one of the most common points of confusion.

YAML-Primary (Three Bases)PG-Primary
Source of truthYAML files in gitPostgreSQL tables
How data gets inHuman edits or LLM pipeline writes YAMLAPI endpoints write to PG
How data reaches PGbuild-data.mjs syncs at build timeAlready there
How the frontend reads itdatabase.json / factbase-data.json (build artifact)wiki-server API at request time (ISR)
Version controlFull git historyPG only (no git history)
Good forCatalog entries, prose, facts that benefit from reviewHigh-volume relational data, frequently updated records
ExamplesEntities, FactBase facts, MDX wiki pagesGrants, personnel, funding rounds, benchmarks, jobs

PG-primary tables include: personnel, grants, funding_rounds, investments, divisions, funding_programs, benchmarks, benchmark_results, jobs, research_areas, political_votes, political_scores, bluesky_posts, website_sources, prediction_market_questions.

Build Pipeline

The central transformation is apps/web/scripts/build-data.mjs, which runs 20 sequential phases to compile YAML + MDX + PG data into the JSON artifacts the frontend reads:

Diagram (loading…)
flowchart TB
  subgraph Input["Inputs"]
      YAML["YAML Entities"]
      FB_YAML2["FactBase YAML"]
      MDX2["MDX Pages"]
      API["Wiki-server API
facts, resources,
assessments"]
  end
  subgraph Phases["build-data.mjs (20 phases)"]
      direction TB
      P1["1-3. Load YAML, IDs, MDX"]
      P2["4-7. Derived, KB, pages, links"]
      P3["8-18. Risk, resources, refs,
coverage, rankings, etc."]
      P4["19-20. Transform + write"]
      P1 --> P2 --> P3 --> P4
  end
  subgraph Output["Outputs"]
      DB_JSON2["database.json"]
      FB_JSON2["factbase-data.json"]
      PG_SYNC["PostgreSQL sync"]
  end
  YAML --> P1
  FB_YAML2 --> P1
  MDX2 --> P1
  API --> P2
  P4 --> DB_JSON2
  P4 --> FB_JSON2
  P4 --> PG_SYNC

ID Schemes

Different parts of this layer use different ID formats. A single entity (e.g., Anthropic) might have all of these:

SystemID FormatExampleHow Allocated
TableBase slugKebab-case stringanthropicHuman-chosen in YAML
Wiki numeric IDE + integerE22crux tb ids allocate (sequence)
Stable IDsid_ + 10 alphanumeric charssid_1LcLlMGLbwcrux tb ids allocate or generateId()
FactBase entity ID10 alphanumeric charsmK9pX3rQ7nRandom, assigned at entity creation
WikiBase page IDKebab-case pathinternal/data-architectureDerived from file path

The entity_ids table and factbase-data.json's slugToEntityId mapping bridge between these ID systems.

Cross-Base Index: The things Table

The things table is a universal search index that gives every identifiable item in the system a single row — regardless of which base it comes from. This enables cross-domain search and a unified browse UI.

thing_typeSourceExample
entityentities tableAnthropic (organization)
factfacts tableAnthropic revenue 2025
grantgrants tableOpen Philanthropy grant to MIRI
personnelpersonnel tableDario Amodei, CEO of Anthropic
divisiondivisions tableAnthropic Alignment Science
resourceresources tableResearch paper on RLHF
benchmarkbenchmarks tableMMLU benchmark
investmentinvestments tableGoogle's investment in Anthropic
funding-roundfunding_rounds tableAnthropic Series E
funding-programfunding_programs tableNSF AI Safety Program

How Each Base Gets Populated

Most data entry is automated. The goal is a defensive pipeline: data enters through verification gates so records are born with green sourcing dots, rather than being verified after the fact. See Discussion #3958 for the full strategy.

WikiBase: Page Creation and Improvement

WikiBase has two pipeline engines:

EngineHow It WorksWhen to Use
V1 (fixed pipeline)Sequential phases: research → generate → review. Deterministic order. Default engine.Single-page work: crux w create, crux w improve
V2 (agent orchestrator)LLM agent with modules as tools. Decides its own phase order. Supports batch mode + Anthropic Batch API (50% cost savings).Batch improvements, auto-update: --engine=v2

Page creation (crux w create "Title" --tier=standard):

  • Tiers: budget ($8-12), standard ($15-25), premium ($30-50)
  • Multi-phase: web research via Firecrawl → LLM drafts page → adversarial review
  • Auto-creates YAML entity stubs if the entity doesn't exist

Page improvement (crux w improve <id> --tier=standard --apply):

  • Tiers: polish ($2-3, style only), standard ($5-8, light research), deep ($15-25, full research)
  • V2 batch mode: crux w improve --engine=v2 --batch=anthropic,miri --apply
  • Post-processing: citation audit, semantic diff safety check, auto-enrichment with FactBase references
  • Semantic diff blocks changes that exceed tier scope (exit code 75)

Auto-update pipeline (daily, automated):

  1. Fetches RSS/web sources → builds news digest
  2. Routes news items to relevant pages (LLM-based matching)
  3. Source-check filter prioritizes by verdict confidence
  4. Runs improve pipeline on matched pages (budget-capped, default $50/run)

TableBase: Enrichment Loop

TableBase uses a scan → rank → agent loop to systematically fill gaps in PG-primary tables:

  1. Scanner queries wiki-server for completeness per entity per table
  2. Task ranker scores gaps: (100 - completeness) * taskWeight * importance
  3. Agent runs web search + LLM to fill specific fields for one entity
  4. Pre-submit verification checks each record against its source URL before writing — records are born with verdicts
pnpm crux tb tablebase scan        # Per-table completeness scores
pnpm crux tb tablebase gaps        # Ranked gap list
pnpm crux tb tablebase improve     # Fill one gap (~$0.50-1.50/task)
pnpm crux tb tablebase loop        # Autonomous loop with --budget

People discovery scans 5 data sources (experts, org keyPeople, FactBase, entity refs, paper authors) and creates entity stubs for frequently-mentioned people:

pnpm crux tb people discover       # Find new people candidates
pnpm crux tb people enrich --source=wikidata  # Add Wikidata facts

FactBase: Fact Entry

FactBase facts are currently the least automated — most enter via manual addition or Wikidata enrichment:

pnpm crux fb add-fact anthropic revenue 5e9 --asOf=2025-06 --source=URL
pnpm crux fb wikidata-enrich --entity=anthropic   # Import from Wikidata
pnpm crux fb show anthropic                        # View all facts
pnpm crux fb validate                              # 40+ validation rules

The improve pipeline extracts some facts as a side effect (wrapping claims in <FBFactValue> tags). See Known Gaps for the lack of automated fact discovery.


Naming Confusions

The same words mean different things in different contexts. This is the single biggest source of confusion when working in the codebase.

"Entity" Across Contexts

ContextWhat "entity" meansID formatExample
data/entities/*.yamlYAML catalog entrySlug (anthropic)data/entities/organizations.yaml
entities PG tableMirror of YAML catalogSlug + stableId + wikiIdSELECT * FROM entities WHERE id = 'anthropic'
FactBase Entity typeFactBase thing with facts10-char alphanumericpackages/factbase/data/fb-entities/anthropic.yaml
entity_ids PG tableCentral ID registryMaps slug to E-numberanthropic maps to E22 maps to sid_1LcLlMGLbw

"Things" Across Contexts

ContextWhat "things" meansPurpose
packages/factbase/data/fb-entities/FactBase entity YAML filesOne file per entity, containing facts and metadata
things PG tableCross-base universal indexSearch index spanning all record types

These are completely unrelated despite sharing a name. The FactBase directory predates the PG table.

"Facts" Across Contexts

ContextWhat "facts" meansStatus
packages/factbase/data/fb-entities/*.yamlFactBase structured triples (authoritative)Active, primary
facts PG tableMirror of FactBase YAMLActive, read-only mirror
data/facts/*.yamlLegacy YAML facts systemDeprecated for FactBase-covered entities

Health and Monitoring

Each layer has its own health infrastructure:

LayerValidatorsHealth ChecksDashboards
ResourcesLink health checks, CI auditContent freshness, auto-update run trackingData Sources, Auto-Update Runs, Auto-Update News
Source-CheckSource-check coverage metricsVerdict freshness trackingSource Checks, Data Quality
Three Bases96 validators in crux/validate/, build phase validationGate checks (6 CI-blocking), build-time error detectionEntities, Page Changes, Update Schedule, DB Schema

CI-Blocking Gate Checks

The gate (pnpm crux w validate gate --fix) runs ~50 checks. Most are blocking — 17 are advisory (warnings only). The blocking checks include:

  • Unified content rules (13 rules in one pass): comparison-operators, dollar-signs, frontmatter-schema, wiki-id-integrity, prefer-entitylink, entitylink-ids, footnote-integrity, kbf-refs, no-deprecated-components, pipeline-artifacts, resource-ref-integrity, url-safety, no-quoted-subcategory
  • Code quality: TypeScript type checks, .returning() guard, no untyped row casts, no console.log in server, prompt escaping, dangerous patterns, conflict markers
  • Data integrity: YAML schema, FactBase stableId usage, KB schema, entity reference integrity, temporal invariants, controlled vocab, cross-base consistency
  • Build: tests, build-data, MDX compilation smoke-test

How Data Flows End-to-End

Putting it all together — here's the auto-update path (the most common flow) from a real-world event to a reader seeing updated information. Other paths exist: TableBase enrichment (scan → rank → agent → pre-submit verify → PG), on-demand sourcing, and manual edits.

  1. Resources: Auto-update pipeline fetches RSS feeds, discovers "Anthropic announces new model"
  2. Resources: Source fetcher downloads the blog post content and caches it in resource_content_versions
  3. Resources: Page router matches the news item to the Anthropic wiki page
  4. Three Bases: The page improve pipeline updates the Anthropic MDX page with new information
  5. Source-Check: Source-check reads the cached content, compares new claims against it, stores verdicts
  6. Three Bases: build-data.mjs compiles updated MDX + YAML into database.json
  7. Deploy: Next.js rebuilds, PostgreSQL synced, frontend serves updated page

Lifecycle of a Single Entity

To make the architecture concrete, here's how a single entity (Anthropic) exists across all three layers:

Diagram (loading…)
flowchart TB
  subgraph Resources["Resources Layer"]
      NEWS["Auto-update finds
new Anthropic blog post"]
      CACHED["Blog post cached in
resource_content_versions"]
  end
  subgraph Sourcing["Source-Check Layer"]
      CHECK["Verifies revenue claim
against cached Reuters article"]
  end
  subgraph ThreeBases["Three Bases Layer"]
      YAML_E["TableBase: entities/
organizations.yaml"]
      FB_E["FactBase: things/
anthropic.yaml"]
      MDX_E["WikiBase: content/docs/
anthropic.mdx"]
      BUILD_E["build-data.mjs →
database.json → /wiki/E22"]
  end
  NEWS --> CACHED
  CACHED --> CHECK
  NEWS -->|"triggers improve
pipeline for"| MDX_E
  CHECK -->|"verdicts for"| FB_E
  YAML_E --> BUILD_E
  FB_E --> BUILD_E
  MDX_E --> BUILD_E

IDs for Anthropic across systems:

SystemIDPurpose
TableBase sluganthropicYAML key, URL-friendly
Wiki numeric IDE22Stable URL: /wiki/E22
Stable IDsid_1LcLlMGLbwCross-system join key
FactBase entity IDmK9pX3rQ7nFactBase internal key
WikiBase page IDknowledge-base/organizations/anthropicMDX file path

What Queries What (Runtime vs. Build-Time)

This is a critical distinction: content pages make zero runtime API calls, while internal dashboards call the wiki-server API at request time.

ConsumerReads FromWhenExamples
Wiki content pagesdatabase.json, factbase-data.jsonBuild time (static)Entity pages, articles, comparison tables
Internal dashboardsWiki-server API via Hono RPCRuntime (ISR, 300s cache)Grants, personnel, sourcing, jobs
Auto-update pipelineRSS feeds + wiki-server APIScheduled (daily)News digest, page routing, run recording
Source-check pipelineLocal data files + wiki-server + external URLsOn-demand / scheduledFact verification, claim extraction
Crux CLI commandsLocal files and/or wiki-server APIOn-demandcrux query search, crux fb show
Build pipelineYAML + MDX + wiki-server APIAt build timebuild-data.mjs 20 phases
Groundskeeper daemonWiki-server APIContinuousHealth checks, job queue, maintenance tasks

Key implication: A wiki-server outage does NOT break the public-facing wiki. Content pages are fully static. Only internal dashboards and CLI tools are affected.

ISR Caching

Internal dashboard pages use Next.js Incremental Static Regeneration. Most pages cache for 300 seconds (5 minutes); data-source pages cache for 60 seconds. Pages that fail to fetch from the wiki-server fall back to local static files via withApiFallback.


Common Tasks Cheat Sheet

I want to...CommandsKey files
Add a new organizationWIKI_SERVER_ENV=prod pnpm crux tb ids allocate my-org → edit data/entities/organizations.yamlpnpm crux w create "My Org" --tier=standarddata/entities/organizations.yaml, new MDX page
Add structured facts about an entitypnpm crux fb add-fact or edit packages/factbase/data/fb-entities/<entity>.yaml directlypackages/factbase/data/fb-entities/
Check if a fact is accurately sourcedWIKI_SERVER_ENV=prod pnpm crux fb sourcing --entity=anthropiccrux/lib/sourcing/
Find why a page shows stale dataCheck update_frequency in frontmatter → WIKI_SERVER_ENV=prod pnpm crux w auto-update history → check sourcing verdictsdata/auto-update/sources.yaml
Create a new PG-primary tableAdd Drizzle schema → generate migration → add wiki-server route → add to things syncapps/wiki-server/src/schema.ts
Add a new directory pageSchema in entity-schemas.ts → transform in entity-transform.mjs → route in entity-nav.ts → App Router pagesapps/web/src/data/entity-schemas.ts
Run all validations before a PRpnpm crux w validate gate --fixpnpm buildpnpm testcrux/validate/
Search across everythingWIKI_SERVER_ENV=prod pnpm crux query search "topic"Cross-base things table
Get full context on a pageWIKI_SERVER_ENV=prod pnpm crux context for-page anthropicAssembles entity + facts + backlinks + citations
Improve an existing pagepnpm crux w improve anthropic --tier=standard --applycrux/lib/page-templates.ts

PG Table Relationship Map

The entities table is the hub — most PG-primary tables reference it via stable_id. Simplified view of major foreign key relationships:

Diagram (loading…)
flowchart TB
  ENTITIES["entities
(stable_id)"]
  FACTS["facts"]
  WIKI["wiki_pages"]
  THINGS["things"]
  PERSONNEL["personnel"]
  GRANTS["grants"]
  FUNDING_R["funding_rounds"]
  INVESTMENTS["investments"]
  DIVISIONS["divisions"]
  BENCHMARKS["benchmark_results"]
  RESOURCES["resources"]
  RES_CIT["resource_citations"]
  PAGE_LINKS["page_links"]
  SOURCE_CHK["source_check_evidence"]
  EDIT_LOGS["edit_logs"]
  FACTS -->|"entity_id"| ENTITIES
  PERSONNEL -->|"person + org
entity_id"| ENTITIES
  GRANTS -->|"org + grantee
entity_id"| ENTITIES
  FUNDING_R -->|"company
entity_id"| ENTITIES
  INVESTMENTS -->|"company + investor
entity_id"| ENTITIES
  DIVISIONS -->|"parent_org_id"| ENTITIES
  BENCHMARKS -->|"model_id"| ENTITIES
  RES_CIT -->|"page_id"| WIKI
  RES_CIT -->|"resource_id"| RESOURCES
  PAGE_LINKS -->|"source + target"| WIKI
  EDIT_LOGS -->|"page_id"| WIKI
  SOURCE_CHK -->|"entity_id"| ENTITIES
  THINGS -->|"parent_thing_id"| THINGS

Key patterns:

  • entities.stable_id is the universal join key — personnel, grants, funding rounds, investments, divisions, benchmark results, and facts all FK to it
  • wiki_pages.id is the hub for content relationships — resource citations, page links, edit logs, hallucination risk snapshots all FK to it
  • things is self-referential (parent_thing_id) and indexes all other tables via source_table + source_id
  • resources connects to wiki pages via resource_citations and to facts via source URLs

CLI Command Map

The Crux CLI is organized into domain groups. Here's the full tree:

Diagram (loading…)
flowchart TB
  CRUX["pnpm crux"]
  W["w (wiki)"]
  FB["fb (factbase)"]
  TB["tb (tablebase)"]
  GH["gh (github)"]
  SYS["sys (system)"]
  QUERY["query"]
  CTX["context"]
  CRUX --> W
  CRUX --> FB
  CRUX --> TB
  CRUX --> GH
  CRUX --> SYS
  CRUX --> QUERY
  CRUX --> CTX
GroupKey SubcommandsPurpose
w (wiki)create, improve, validate gate, fix escaping, auto-update, sourcing-wiki-pages, citations, qa-sweepContent authoring, validation, automated updates
fb (factbase)show, validate, search, add-fact, sourcing, coverageStructured fact management and verification
tb (tablebase)ids allocate, tablebase scan/gaps/improve/loop, ensure-entities, people discover, benchmarks, sourcingEntity catalog, PG table enrichment, ID allocation
gh (github)issues start/done/create, pr create/detect, ci status, epic create, release create, deploy-tasksIssue tracking, PR management, CI monitoring
sys (system)agent-checklist, agent-reset, audits, health, wiki-server sync, jobsAgent workflow, background jobs, system health
querysearch, entity, facts, related, risk, stats, blocksCross-domain search and data queries
contextfor-page, for-issue, for-entity, for-topicResearch context assembly

What's Not Here (Known Gaps)

The architecture has known limitations and missing pieces. Documenting them prevents wasted effort trying to find features that don't exist.

GapWhat's MissingWorkaround
No automated fact discoveryFactBase facts are added manually or via the improve pipeline. There's no system that proactively discovers new facts (e.g., "Anthropic's headcount changed")Auto-update catches news, but structured facts must be manually extracted from it
No cross-base consistency checkingFactBase and TableBase can have conflicting data about the same entity (e.g., different founding dates). No validator catches thisManual review; crux fb validate checks internal FactBase consistency only
No YAML-to-PG migration pathWhen a YAML-primary entity type outgrows YAML (needs aggregation, relationships), there's no automated migration to PG-primaryManual: create PG table, write migration, add API route, update build-data
No incremental buildsbuild-data.mjs rebuilds everything from scratch each time (≈30s for content-only, ~2min full). No caching or dirty-file detectionUse --scope=content for faster content-only builds
No real-time updatesWiki content is fully static — changes require a build + deploy cycle. ISR helps dashboards but content pages are build-time onlyDeploy pipeline; auto-update runs daily
No FactBase → WikiBase auto-syncWhen a FactBase fact changes, the wiki page prose isn't automatically updated to matchcrux w improve can update pages, but must be triggered manually
Source-check coverage is partialOnly about 15% of wiki pages have been sourcinged. Many facts lack source URLs entirelycrux fb sourcing and crux w sourcing-wiki-pages are available but expensive (about $0.07/page)
No rollback for PG-primary dataYAML-primary data has full git history. PG-primary data (grants, personnel) has no version history beyond edit logsConsider adding a changelog table for PG-primary records
things table can go staleThe cross-base index is synced at build time. Between builds, newly created PG records won't appear in things searchRebuild or wait for next deploy

Comparison to Conventional Architectures

For contributors coming from other systems, here's how this architecture maps to more common patterns:

Conventional PatternLongterm Wiki EquivalentKey Difference
CMS database (WordPress, Strapi)YAML files + MDX pagesSource of truth is git, not a database. PG is a read mirror.
Knowledge graph (Neo4j, Wikidata)FactBase (packages/factbase/)Triples stored in YAML, not a graph database. No SPARQL — uses TypeScript graph loader.
Data warehouse (Snowflake, BigQuery)database.json + PG tablesBuild artifact is a single JSON file, not a query engine. PG provides live queries for dashboards.
ETL pipeline (Airflow, dbt)build-data.mjs (20 phases)Single sequential script, not a DAG. No orchestration framework — runs in CI or locally.
Headless CMS APIWiki-server (Hono)API serves PG-primary data for dashboards. Content pages don't use the API at all.
RSS aggregator (Feedly)Auto-update pipelineNot just aggregation — routes items to specific pages and triggers LLM-powered content updates.
Fact-checking platform (ClaimBuster)Source-check systemIntegrated into the content pipeline. Verdicts feed back into page risk scores and auto-update priority.
Static site generator (Hugo, Astro)Next.js + database.jsonHybrid: content pages are static, dashboard pages use ISR. Data layer is much richer than typical SSG.

The unusual parts:

  • YAML as source of truth instead of a database — enables git review workflows but makes queries harder
  • Three separate data models (TableBase, FactBase, WikiBase) instead of one unified schema — each optimized for its domain but creating naming confusion
  • Build-time data compilation into a single JSON file — enables zero-API content pages but requires a full rebuild for any data change
  • LLM-powered automation at every layer — content creation, fact verification, data enrichment, news routing all use LLM calls

  • Data Architecture: Three Bases and Naming Guide — Detailed PG table reference and naming confusions
  • System Architecture — Overall technical architecture
  • Knowledge Base Architecture — Deep dive into FactBase internals
  • DB Schema Overview — Full ER diagrams and migration history
  • Data System Authority Rules — Which system is authoritative for each entity