Content Database System

The wiki uses a multi-layer storage architecture. There is no single database — different kinds of data live in the storage layer best suited for them.

Storage Layers

Diagram (loading…)

flowchart TD
  subgraph Durable["Durable Storage"]
      PG[("PostgreSQL
(wiki-server)")]
      KB[("KB YAML
packages/kb/data/things/")]
      YAML[("YAML Files
data/")]
      MDX[("MDX Pages
content/docs/")]
  end

  subgraph Transient["Transient / Session"]
      LRU["In-Memory LRU Cache
(500 entries, per-process)"]
      SOURCES[".cache/sources/
(fetched HTML/text)"]
      HASHES[".cache/content-hashes.json
(change detection)"]
  end

  subgraph Build["Build Artifacts"]
      JSON[("database.json
YAML + MDX → JSON")]
  end

  KB --> JSON
  YAML --> JSON
  MDX --> JSON
  PG -.->|"Hono RPC API"| LRU
  LRU -.->|"cache miss"| PG

1. PostgreSQL (wiki-server)

The wiki-server runs a PostgreSQL database that stores all structured data requiring durability and cross-machine access. This replaced the earlier local SQLite database (.cache/knowledge.db), which was retired in February 2026.

What it stores:

Table	Purpose
`citation_content`	Full text of fetched source URLs (for quote verification)
`citation_audits`	Per-page citation verification results
`claims`	Extracted atomic claims with source references
`facts`	Canonical facts with values and computed expressions
`resources`	External resource metadata (papers, blogs, reports)
`entities`	Entity metadata synced from YAML
`agent_sessions`	Claude Code session logs
`edit_logs`	Per-page edit history
`hallucination_evals`	Hallucination detection results

Access pattern: All access goes through the wiki-server's Hono RPC API. CLI tools use apiRequest() from crux/lib/wiki-server/. The frontend uses typed RPC clients (e.g., getFactsRpcClient()).

# Example CLI commands that read/write PostgreSQL
pnpm crux citations verify <page-id>    # Verify citations → writes audit results
pnpm crux query entity <id>             # Read entity data
pnpm crux query search "topic"          # Full-text search

2. In-Memory LRU Cache

Source fetching uses a session-scoped in-memory cache (crux/lib/citation-content-cache.ts) to avoid redundant network requests and database lookups within a single process.

Property	Value
Max entries	500
Eviction	Least Recently Used
Scope	Per-process (cleared on exit)
Persistence	None — purely ephemeral

When fetching a URL, the system checks:

In-memory LRU cache (fastest)
PostgreSQL citation_content table (durable)
Network fetch via Firecrawl or built-in fallback (slowest)

Results are written back to both the LRU cache and PostgreSQL.

3. KB YAML (packages/kb/)

The Knowledge Base package (packages/kb/) is the authoritative source for structured entity facts — valuations, revenue, headcounts, founding dates, and other typed properties. As of March 2026, 9+ entities have been migrated here from the older data/facts/ system.

Path	Content
`packages/kb/data/things/*.yaml`	Entity facts with typed properties, time series, sources
`packages/kb/data/schemas/`	Property schemas (60 properties across orgs, people, AI models, etc.)

KB facts are rendered on wiki pages via <FBF> and <FBFactValue> components, and computed values via <Calc>. See Data System Authority Rules for which system is authoritative for which entities.

4. YAML Files (data/)

Human-editable YAML files are the source of truth for content metadata:

Directory	Content
`data/entities/`	Entity definitions (type, description, relations)
`data/facts/`	Legacy facts (deprecated for entities migrated to KB)
`data/resources/`	External resource metadata
`data/graphs/`	Cause-effect graph data
`data/edit-logs/`	Per-page edit history
`data/citation-archive/`	Per-page citation verification YAML
`data/auto-update/`	Auto-update system configuration and state

YAML files are checked into git and are the canonical source for everything they contain. PostgreSQL mirrors some of this data for API access and full-text search.

5. File-System Caches (.cache/)

Temporary files for local development workflows:

Path	Purpose
`.cache/sources/`	Fetched source documents (HTML, text, PDF)
`.cache/content-hashes.json`	MD5 hashes for change detection during scans

These are gitignored and can be deleted without data loss.

6. Build Artifact (database.json)

The build pipeline (apps/web/scripts/build-data.mjs) compiles YAML + MDX frontmatter into apps/web/src/data/database.json. This single JSON file contains all entities, pages, relations, facts, search data, and statistics needed by the Next.js frontend.

pnpm build-data           # Full build (~2 min)
pnpm build-data:content   # Content-only rebuild (~15s)

The JSON is loaded at server startup with lazy-built indexes (see Architecture).

Data Flow

Diagram (loading…)

flowchart LR
  subgraph Edit["Authoring"]
      AUTHOR["Human or AI
edits YAML/MDX"]
  end

  subgraph Pipeline["Build Pipeline"]
      BUILD["build-data.mjs"]
  end

  subgraph Serve["Runtime"]
      NEXT["Next.js
reads database.json"]
      API["Wiki-server API
reads PostgreSQL"]
  end

  AUTHOR -->|"git push"| BUILD
  BUILD -->|"database.json"| NEXT
  AUTHOR -->|"crux citations verify"| API
  API -->|"search, facts, claims"| NEXT

Source Fetching Flow

When verifying citations or fetching content for page improvement:

Diagram (loading…)

sequenceDiagram
  participant CLI as Crux CLI
  participant Cache as LRU Cache
  participant PG as PostgreSQL
  participant Net as Network

  CLI->>Cache: getCachedContent(url)
  alt Cache hit
      Cache-->>CLI: cached content
  else Cache miss
      CLI->>PG: query citation_content
      alt DB hit
          PG-->>CLI: stored content
          CLI->>Cache: setCachedContent(url)
      else DB miss
          CLI->>Net: fetch via Firecrawl / fallback
          Net-->>CLI: raw content
          CLI->>PG: saveFetchResultToPostgres(url)
          CLI->>Cache: setCachedContent(url)
      end
  end

CLI Commands

Command	Purpose
`pnpm crux citations verify <page-id>`	Verify all citations on a page
`pnpm crux citations audit`	Run citation audits across pages
`pnpm crux scan-content`	Scan MDX files for content analysis
`pnpm crux query search "topic"`	Full-text search via wiki-server
`pnpm crux query entity <id>`	Look up entity data
`pnpm crux query related <id>`	Find related pages
`pnpm crux context for-page <id>`	Full research context for a page
`pnpm build-data`	Rebuild database.json from YAML + MDX
`pnpm build-data:content`	Content-only rebuild (≈15s)

Limitations

No offline PostgreSQL access: CLI commands that query the wiki-server require network connectivity
LRU cache is session-scoped: Restarting a process loses cached content (by design — PostgreSQL is the durable tier)
database.json must be rebuilt: Changes to YAML or MDX frontmatter are not visible to the frontend until build-data runs
Citation content is append-mostly: Old fetched content is not automatically refreshed

Architecture — System overview and design patterns
Automation Tools — Full CLI reference
Data System Authority Rules — Which data system is authoritative for each entity

Content Database System

Storage Layers

1. PostgreSQL (wiki-server)

2. In-Memory LRU Cache

3. KB YAML (packages/kb/)

4. YAML Files (data/)

5. File-System Caches (.cache/)

6. Build Artifact (database.json)

Data Flow

Source Fetching Flow

CLI Commands

Limitations

Related