Skip to content

System Architecture

This document provides a technical overview of how the Longterm Wiki is built, how data flows through the system, and the design rationale behind key architectural decisions.


Loading diagram...

Purpose: Maintain structured data about people, organizations, and concepts that can be referenced across pages.

Flow: YAML sources → build-data.mjs → JSON artifacts → React components

ComponentLocationPurpose
Source YAMLsrc/data/*.yamlHuman-editable entity definitions
Build scriptscripts/build-data.mjsCompiles YAML to JSON
Generated JSONsrc/data/*.jsonBrowser-ready data
Componentssrc/components/wiki/Display entity data

Key files generated:

  • database.json - All entities merged
  • pathRegistry.json - Entity ID → URL path mapping
  • backlinks.json - Reverse reference indices

Design decision: YAML for human editing, JSON for runtime. This separation allows manual curation while keeping the site fast.


Purpose: Index content for analysis, cache external sources, and support AI-assisted workflows.

Location: .cache/knowledge.db (gitignored, regenerated per machine)

Loading diagram...

Key capabilities:

  • Content indexing and search
  • Source fetching via Firecrawl API
  • AI summary generation
  • Change detection via content hashing

See: Content Database for full schema and API reference.


Purpose: Generate new wiki pages with proper citations using AI research and synthesis.

Pipeline phases:

canonical-links → research-perplexity → register-sources → fetch-sources
→ research-scry → synthesize → verify-sources → validate-loop → grade
Loading diagram...

Design decisions:

DecisionRationale
Perplexity for researchCheap ($0.10), good at web search, provides citation URLs
Register + fetch sourcesEnables quote verification against actual content
Verify-sources phaseCatches hallucinated quotes before publication
Validation loopIterative fixing ensures build-passing output

Cost tiers:

  • Budget: $2-3 (no source fetching)
  • Standard: $4-6 (with source fetching + verification)
  • Premium: $8-12 (deep research + review)

See: Page Creator Pipeline for experiment results.


Purpose: Fetch and cache actual webpage content for citation verification.

Flow:

Citation URLs (from Perplexity)
→ Register in SQLite (sources table)
→ Fetch via Firecrawl API
→ Store in SQLite + .cache/sources/
→ Use in quote verification

Components:

ComponentLocationPurpose
Knowledge DBscripts/lib/knowledge-db.mjsSQLite wrapper, source tracking
Fetch scriptscripts/utils/fetch-sources.mjsStandalone Firecrawl fetcher
Page creatorscripts/content/page-creator.mjsIntegrated fetch during page creation

Rate limiting: 7 seconds between requests (Firecrawl free tier limit).

Design decision: Fetch during page creation rather than eagerly. This keeps costs predictable and ensures we only fetch sources we actually need.


Purpose: Ensure content quality and prevent build failures.

Architecture: Unified rules engine with 20+ validators.

Terminal window
npm run crux -- validate unified --rules=dollar-signs,entitylink-ids

Rule categories:

CategoryExamplesBlocking?
Criticaldollar-signs, entitylink-ids, fake-urlsYes - breaks build
Qualitytilde-dollar, markdown-lists, placeholdersNo - warnings only

Design decision: Two-tier validation allows fast feedback while still catching serious issues. Critical rules run in CI; quality rules are advisory.


Loading diagram...
Loading diagram...

1. Separation of Source and Generated Data

Section titled “1. Separation of Source and Generated Data”

Principle: Human-editable files (YAML, MDX) are separate from generated artifacts (JSON, SQLite).

Why:

  • Generated files can be regenerated from source
  • No merge conflicts on generated files (gitignored where appropriate)
  • Clear ownership: humans edit YAML, scripts generate JSON

Principle: Cache computationally expensive results locally (SQLite, .cache/).

Why:

  • AI summaries are expensive; don’t regenerate unnecessarily
  • Source fetching has API costs; cache results
  • Content hashing enables incremental updates

Trade-off: Cache must be rebuilt on new machines. This is acceptable because:

  • Build is deterministic from source files
  • Cache is optimization, not source of truth

3. Progressive Enhancement for AI Features

Section titled “3. Progressive Enhancement for AI Features”

Principle: AI features (summaries, page creation) are optional enhancements.

Why:

  • Wiki works without API keys
  • Failures in AI pipeline don’t break the site
  • Costs are predictable and opt-in

Principle: Catch errors early and at appropriate granularity.

LevelToolWhen
SyntaxMDX compilerBuild time
SchemaZod validationBuild time
ReferencesEntityLink validatorCI
QualityGrading pipelineManual trigger

FilePurposeWhen to Edit
astro.config.mjsSidebar structure, Starlight configAdding new sections
src/content.config.tsMDX frontmatter schemaAdding frontmatter fields
src/data/schema.tsEntity type definitions (Zod)Adding entity types
scripts/lib/knowledge-db.mjsSQLite schemaAdding database tables
scripts/content/page-creator.mjsPage creation pipelineModifying generation flow

VariablePurposeRequired For
ANTHROPIC_API_KEYClaude API accessSummaries, grading, page creation
OPENROUTER_API_KEYPerplexity via OpenRouterPage creation research
FIRECRAWL_KEYWeb page fetchingSource content fetching

All are optional. Features gracefully degrade when keys are missing.


This architecture documentation should be updated when:

  1. New pipeline phases added - Update the pipeline diagram and phase list
  2. Database schema changes - Update the ER diagram
  3. New environment variables - Add to the environment variables table
  4. New validation rules - Document in the validation section

Each internal doc should include:

  • lastEdited in frontmatter (updated when content changes)
  • Verification notes for time-sensitive information

Consider adding:

Terminal window
# Check if docs mention deprecated scripts
npm run crux -- validate docs-freshness

When code changes affect documentation:

  1. Update the relevant internal doc
  2. Add a comment in the code: // Docs: /internal/architecture/#section-name
  3. Run npm run build to verify links still work