System Architecture
This document provides a technical overview of how the Longterm Wiki is built, how data flows through the system, and the design rationale behind key architectural decisions.
High-Level Architecture
Section titled “High-Level Architecture”Core Systems
Section titled “Core Systems”1. Entity Data System
Section titled “1. Entity Data System”Purpose: Maintain structured data about people, organizations, and concepts that can be referenced across pages.
Flow: YAML sources → build-data.mjs → JSON artifacts → React components
| Component | Location | Purpose |
|---|---|---|
| Source YAML | src/data/*.yaml | Human-editable entity definitions |
| Build script | scripts/build-data.mjs | Compiles YAML to JSON |
| Generated JSON | src/data/*.json | Browser-ready data |
| Components | src/components/wiki/ | Display entity data |
Key files generated:
database.json- All entities mergedpathRegistry.json- Entity ID → URL path mappingbacklinks.json- Reverse reference indices
Design decision: YAML for human editing, JSON for runtime. This separation allows manual curation while keeping the site fast.
2. Knowledge Database (SQLite)
Section titled “2. Knowledge Database (SQLite)”Purpose: Index content for analysis, cache external sources, and support AI-assisted workflows.
Location: .cache/knowledge.db (gitignored, regenerated per machine)
Key capabilities:
- Content indexing and search
- Source fetching via Firecrawl API
- AI summary generation
- Change detection via content hashing
See: Content Database for full schema and API reference.
3. Page Creation Pipeline
Section titled “3. Page Creation Pipeline”Purpose: Generate new wiki pages with proper citations using AI research and synthesis.
Pipeline phases:
canonical-links → research-perplexity → register-sources → fetch-sources → research-scry → synthesize → verify-sources → validate-loop → gradeDesign decisions:
| Decision | Rationale |
|---|---|
| Perplexity for research | Cheap ($0.10), good at web search, provides citation URLs |
| Register + fetch sources | Enables quote verification against actual content |
| Verify-sources phase | Catches hallucinated quotes before publication |
| Validation loop | Iterative fixing ensures build-passing output |
Cost tiers:
- Budget: $2-3 (no source fetching)
- Standard: $4-6 (with source fetching + verification)
- Premium: $8-12 (deep research + review)
See: Page Creator Pipeline for experiment results.
4. Source Fetching System
Section titled “4. Source Fetching System”Purpose: Fetch and cache actual webpage content for citation verification.
Flow:
Citation URLs (from Perplexity) → Register in SQLite (sources table) → Fetch via Firecrawl API → Store in SQLite + .cache/sources/ → Use in quote verificationComponents:
| Component | Location | Purpose |
|---|---|---|
| Knowledge DB | scripts/lib/knowledge-db.mjs | SQLite wrapper, source tracking |
| Fetch script | scripts/utils/fetch-sources.mjs | Standalone Firecrawl fetcher |
| Page creator | scripts/content/page-creator.mjs | Integrated fetch during page creation |
Rate limiting: 7 seconds between requests (Firecrawl free tier limit).
Design decision: Fetch during page creation rather than eagerly. This keeps costs predictable and ensures we only fetch sources we actually need.
5. Validation System
Section titled “5. Validation System”Purpose: Ensure content quality and prevent build failures.
Architecture: Unified rules engine with 20+ validators.
npm run crux -- validate unified --rules=dollar-signs,entitylink-idsRule categories:
| Category | Examples | Blocking? |
|---|---|---|
| Critical | dollar-signs, entitylink-ids, fake-urls | Yes - breaks build |
| Quality | tilde-dollar, markdown-lists, placeholders | No - warnings only |
Design decision: Two-tier validation allows fast feedback while still catching serious issues. Critical rules run in CI; quality rules are advisory.
Data Flow Diagrams
Section titled “Data Flow Diagrams”Page Creation Data Flow
Section titled “Page Creation Data Flow”Entity Resolution Flow
Section titled “Entity Resolution Flow”Design Principles
Section titled “Design Principles”1. Separation of Source and Generated Data
Section titled “1. Separation of Source and Generated Data”Principle: Human-editable files (YAML, MDX) are separate from generated artifacts (JSON, SQLite).
Why:
- Generated files can be regenerated from source
- No merge conflicts on generated files (gitignored where appropriate)
- Clear ownership: humans edit YAML, scripts generate JSON
2. Local-First Caching
Section titled “2. Local-First Caching”Principle: Cache computationally expensive results locally (SQLite, .cache/).
Why:
- AI summaries are expensive; don’t regenerate unnecessarily
- Source fetching has API costs; cache results
- Content hashing enables incremental updates
Trade-off: Cache must be rebuilt on new machines. This is acceptable because:
- Build is deterministic from source files
- Cache is optimization, not source of truth
3. Progressive Enhancement for AI Features
Section titled “3. Progressive Enhancement for AI Features”Principle: AI features (summaries, page creation) are optional enhancements.
Why:
- Wiki works without API keys
- Failures in AI pipeline don’t break the site
- Costs are predictable and opt-in
4. Validation at Multiple Levels
Section titled “4. Validation at Multiple Levels”Principle: Catch errors early and at appropriate granularity.
| Level | Tool | When |
|---|---|---|
| Syntax | MDX compiler | Build time |
| Schema | Zod validation | Build time |
| References | EntityLink validator | CI |
| Quality | Grading pipeline | Manual trigger |
Key Configuration Files
Section titled “Key Configuration Files”| File | Purpose | When to Edit |
|---|---|---|
astro.config.mjs | Sidebar structure, Starlight config | Adding new sections |
src/content.config.ts | MDX frontmatter schema | Adding frontmatter fields |
src/data/schema.ts | Entity type definitions (Zod) | Adding entity types |
scripts/lib/knowledge-db.mjs | SQLite schema | Adding database tables |
scripts/content/page-creator.mjs | Page creation pipeline | Modifying generation flow |
Environment Variables
Section titled “Environment Variables”| Variable | Purpose | Required For |
|---|---|---|
ANTHROPIC_API_KEY | Claude API access | Summaries, grading, page creation |
OPENROUTER_API_KEY | Perplexity via OpenRouter | Page creation research |
FIRECRAWL_KEY | Web page fetching | Source content fetching |
All are optional. Features gracefully degrade when keys are missing.
Documentation Maintenance
Section titled “Documentation Maintenance”Keeping Docs Updated
Section titled “Keeping Docs Updated”This architecture documentation should be updated when:
- New pipeline phases added - Update the pipeline diagram and phase list
- Database schema changes - Update the ER diagram
- New environment variables - Add to the environment variables table
- New validation rules - Document in the validation section
Freshness Indicators
Section titled “Freshness Indicators”Each internal doc should include:
lastEditedin frontmatter (updated when content changes)- Verification notes for time-sensitive information
Automated Checks
Section titled “Automated Checks”Consider adding:
# Check if docs mention deprecated scriptsnpm run crux -- validate docs-freshnessCross-References
Section titled “Cross-References”When code changes affect documentation:
- Update the relevant internal doc
- Add a comment in the code:
// Docs: /internal/architecture/#section-name - Run
npm run buildto verify links still work
Related Documentation
Section titled “Related Documentation”- Content Database - SQLite schema and API
- Automation Tools - CLI reference
- Page Creator Pipeline - Generation experiments
- About This Wiki - Contributor overview