Content Database System
The wiki uses a multi-layer storage architecture. There is no single database — different kinds of data live in the storage layer best suited for them.
Storage Layers
1. PostgreSQL (wiki-server)
The wiki-server runs a PostgreSQL database that stores all structured data requiring durability and cross-machine access. This replaced the earlier local SQLite database (.cache/knowledge.db), which was retired in February 2026.
What it stores:
| Table | Purpose |
|---|---|
citation_content | Full text of fetched source URLs (for quote verification) |
citation_audits | Per-page citation verification results |
claims | Extracted atomic claims with source references |
facts | Canonical facts with values and computed expressions |
resources | External resource metadata (papers, blogs, reports) |
entities | Entity metadata synced from YAML |
agent_sessions | Claude Code session logs |
edit_logs | Per-page edit history |
hallucination_evals | Hallucination detection results |
Access pattern: All access goes through the wiki-server's Hono RPC API. CLI tools use apiRequest() from crux/lib/wiki-server/. The frontend uses typed RPC clients (e.g., getFactsRpcClient()).
# Example CLI commands that read/write PostgreSQL
pnpm crux citations verify <page-id> # Verify citations → writes audit results
pnpm crux query entity <id> # Read entity data
pnpm crux query search "topic" # Full-text search
2. In-Memory LRU Cache
Source fetching uses a session-scoped in-memory cache (crux/lib/citation-content-cache.ts) to avoid redundant network requests and database lookups within a single process.
| Property | Value |
|---|---|
| Max entries | 500 |
| Eviction | Least Recently Used |
| Scope | Per-process (cleared on exit) |
| Persistence | None — purely ephemeral |
When fetching a URL, the system checks:
- In-memory LRU cache (fastest)
- PostgreSQL
citation_contenttable (durable) - Network fetch via Firecrawl or built-in fallback (slowest)
Results are written back to both the LRU cache and PostgreSQL.
3. KB YAML (packages/kb/)
The Knowledge Base package (packages/kb/) is the authoritative source for structured entity facts — valuations, revenue, headcounts, founding dates, and other typed properties. As of March 2026, 9+ entities have been migrated here from the older data/facts/ system.
| Path | Content |
|---|---|
packages/kb/data/things/*.yaml | Entity facts with typed properties, time series, sources |
packages/kb/data/schemas/ | Property schemas (60 properties across orgs, people, AI models, etc.) |
KB facts are rendered on wiki pages via <KBF> and <KBFactValue> components, and computed values via <Calc>. See Data System Authority Rules for which system is authoritative for which entities.
4. YAML Files (data/)
Human-editable YAML files are the source of truth for content metadata:
| Directory | Content |
|---|---|
data/entities/ | Entity definitions (type, description, relations) |
data/facts/ | Legacy facts (deprecated for entities migrated to KB) |
data/resources/ | External resource metadata |
data/graphs/ | Cause-effect graph data |
data/edit-logs/ | Per-page edit history |
data/citation-archive/ | Per-page citation verification YAML |
data/auto-update/ | Auto-update system configuration and state |
YAML files are checked into git and are the canonical source for everything they contain. PostgreSQL mirrors some of this data for API access and full-text search.
5. File-System Caches (.cache/)
Temporary files for local development workflows:
| Path | Purpose |
|---|---|
.cache/sources/ | Fetched source documents (HTML, text, PDF) |
.cache/content-hashes.json | MD5 hashes for change detection during scans |
These are gitignored and can be deleted without data loss.
6. Build Artifact (database.json)
The build pipeline (apps/web/scripts/build-data.mjs) compiles YAML + MDX frontmatter into apps/web/src/data/database.json. This single JSON file contains all entities, pages, relations, facts, search data, and statistics needed by the Next.js frontend.
pnpm build-data # Full build (~2 min)
pnpm build-data:content # Content-only rebuild (~15s)
The JSON is loaded at server startup with lazy-built indexes (see Architecture).
Data Flow
Source Fetching Flow
When verifying citations or fetching content for page improvement:
CLI Commands
| Command | Purpose |
|---|---|
pnpm crux citations verify <page-id> | Verify all citations on a page |
pnpm crux citations audit | Run citation audits across pages |
pnpm crux scan-content | Scan MDX files for content analysis |
pnpm crux query search "topic" | Full-text search via wiki-server |
pnpm crux query entity <id> | Look up entity data |
pnpm crux query related <id> | Find related pages |
pnpm crux context for-page <id> | Full research context for a page |
pnpm build-data | Rebuild database.json from YAML + MDX |
pnpm build-data:content | Content-only rebuild (≈15s) |
Limitations
- No offline PostgreSQL access: CLI commands that query the wiki-server require network connectivity
- LRU cache is session-scoped: Restarting a process loses cached content (by design — PostgreSQL is the durable tier)
- database.json must be rebuilt: Changes to YAML or MDX frontmatter are not visible to the frontend until
build-dataruns - Citation content is append-mostly: Old fetched content is not automatically refreshed
Related
- Architecture — System overview and design patterns
- Automation Tools — Full CLI reference
- Data System Authority Rules — Which data system is authoritative for each entity