Content Pipeline Architecture: Faster Page Creation
The Problem
Adding a single page to the wiki currently takes an AI agent 15-30 minutes end-to-end. Compare this to a POST request to a database, which takes about 20 seconds.
| Step | Time | Bottleneck |
|---|---|---|
| Agent writes MDX + YAML | 2-5 min | LLM generation |
assign-ids.mjs (requires wiki-server) | 5-10 sec | Server dependency |
build-data.mjs | 30-90 sec | Full rebuild of all 700+ pages |
| Local validations (gate) | 60-120 sec | Sequential checks |
git push + CI | 5-10 min | Full install + build + validate |
| Merge + Vercel deploy | 3-5 min | Full static site rebuild |
| Total | ≈15-30 min |
The core tension: git gives versioning, review, and auditability, but forces a batch rebuild model. A database gives instant writes, but loses the review workflow.
Current Architecture
What Lives Where
In Git (authoritative source of truth):
packages/kb/data/things/*.yaml— KB structured facts (authoritative for migrated entities)data/entities/*.yaml— entity definitions, relationships, sourcesdata/facts/*.yaml— legacy numeric facts (deprecated for entities migrated to KB)content/docs/**/*.mdx— wiki page contentnumericIdfields in both (assigned by server, written back to files)
In Postgres (wiki-server — read mirrors + operational data):
entity_ids— single source of truth for ID allocation (PostgreSQL sequence)entities,wiki_pages,facts,resources— read mirrors of YAML/MDX dataedit_logs,sessions,hallucination_risk_snapshots— operational datacitation_quotes,citation_accuracy_snapshots— verification resultsauto_update_runs,auto_update_news_items— auto-update pipeline records
The Build Pipeline
build-data.mjs (2,050 lines) is the heart of the data pipeline. It:
- Loads all YAML files from
data/into memory - Merges entity sources (YAML entities + auto-created frontmatter entities)
- Builds the ID registry from
numericIdfields - Computes backlinks from YAML
relatedEntries - Builds tag index, path registry
- Loads canonical facts, normalizes values
- Builds git date maps (one pass of
git log --name-onlyfor every content file) - Fetches enrichment data from wiki-server (edit logs, citation stats, session history)
- Scans MDX for EntityLinks, fact usage, block-level IR
- Computes hallucination risk scores per page
- Computes TF-IDF similarity between all page pairs
- Builds the related graph (5-signal weighted bidirectional)
- Computes derived scores (coverage, rankings, staleness)
- Writes
database.json(~50-100MB) plus individual data files
This takes 30-90 seconds locally. The output database.json is the single artifact consumed by Next.js — all 700+ pages are pre-rendered from it at build time.
Key Architectural Properties
database.jsonis monolithic. The entire wiki's data is in one file. Changing one page triggers recomputation of cross-page data (related graph, similarity scores, backlinks).- Static generation means full rebuilds. A content change requires a complete Vercel build (~3-5 min) to appear live. No incremental page rebuilding.
- The wiki-server runs continuously (Hono/Node.js, port 3100). It's accessed during local dev (for ID assignment and queries) and during CI builds (for enriching
database.json). - Two parallel content flows exist: manual agent edits (committed directly) and automated daily updates (
auto-update.ymlcreates PRs). Both merge to main and trigger Vercel deploys.
Options Evaluated
Option 1: Move Content to Postgres (Full DB-First)
MDX content lives in Postgres as source of truth. Git becomes a downstream mirror or is dropped. Next.js reads from the DB at request time (SSR) or via ISR.
Strengths:
- Write a page = one API call. Instantly queryable.
- No build step for content changes. ISR revalidates individual pages in seconds.
- ID assignment is trivial (auto-increment column).
- Agent workflow becomes: call API, done. No git, no CI, no Vercel rebuild.
Weaknesses:
- Loses the git review workflow. PRs are the primary quality control mechanism. Without them, every agent write goes live immediately — or you need to build a review/staging system in the DB.
- Loses offline editing. Can't open an MDX file in VS Code.
- Migration is massive. 700 pages of MDX with complex frontmatter, EntityLinks, Squiggle components, custom MDX components — all need DB storage, versioning, and rendering.
- MDX compilation at request time is expensive (~100-500ms per page). Needs aggressive caching or pre-compilation.
- Loses the entire crux validation pipeline. All validators assume files on disk.
- Single point of failure. DB down = wiki down. Currently the static site serves from Vercel CDN even when wiki-server is offline.
- Version history is harder. Git gives full history for free. In a DB you need shadow tables or event sourcing.
Red team: This is the "rewrite the whole system" option. The migration is 2-4 weeks of work, and you lose the one thing git is genuinely good at: review before publish. The wiki's quality depends heavily on validation gates and PR review.
Verdict: Too much disruption for the payoff. The problem isn't "git is the wrong model for content" — it's "the build pipeline is too heavy for small changes."
Option 2: Separate Content Repo (Git Split)
Split content/docs/ and data/ into a dedicated longterm-wiki-content repo. The main repo has only the Next.js app and crux tooling.
Strengths:
- Content PRs don't trigger app builds — lighter CI.
- Content repo can have simplified CI (just validations, no Next.js build).
- Agents working on content don't need the full app checkout.
Weaknesses:
- Doesn't solve the speed problem. You still need
build-data.mjs+ Vercel deploy for changes to go live. You've just moved where files live. - Submodule complexity. Git submodules are painful. Every developer, CI job, and agent needs to manage two repos.
- Crux tooling depends on both. Validators read MDX files AND import TypeScript from
apps/web/. Splitting means duplicating code or cross-repo dependencies. - Cross-repo PRs. Content changes that require component changes (common with new Squiggle models, new EntityLink types) need coordinated PRs.
build-data.mjsneeds everything. It reads YAML, MDX, and app TypeScript. It can't run in the content repo alone.
Red team: Adds complexity (submodules, cross-repo coordination) without fixing the latency. The deploy pipeline is still: merge content PR, trigger app rebuild, Vercel build, live. You save maybe 2-3 minutes on CI by skipping app tests, but the Vercel build is unchanged.
Verdict: Pain without sufficient gain. The build pipeline, not git structure, is the bottleneck.
Option 3: Hybrid — Draft in DB, Publish via Git
Add a "draft" mode to the wiki-server. Agents POST content to the DB. Drafts are instantly visible at a preview URL. When ready, a "publish" action creates a git commit + PR automatically.
Architecture:
Agent writes → POST /api/pages/draft → Postgres (instant, ~1 sec)
→ Preview at /preview/E894 (SSR from DB)
Agent publishes → POST /api/pages/publish → Creates git branch
→ Writes MDX file
→ Runs validations
→ Opens PR
→ (merge → Vercel deploy → live)
Strengths:
- Instant feedback loop. Agent sees its page immediately. Iterates fast.
- Git review preserved. Publishing still goes through PR review, CI, validations.
- Incremental adoption. Can add alongside the existing system. Old workflow still works.
- Preview URLs for review. Humans review draft pages before they enter the git pipeline.
- Batch publishing. An agent could draft 10 pages, then publish all at once in one PR.
- Simpler agent workflow. No git operations needed during drafting. Just API calls.
- Validation on draft. The API runs schema validation, escaping checks, etc. on draft content — instant feedback without the full gate.
Weaknesses:
- Two sources of truth during draft phase. A page exists in the DB but not in git. Needs clear status tracking (draft/published/outdated).
- MDX rendering from DB differs from file-based. Must ensure preview renderer matches production renderer exactly.
- Still need the full pipeline to publish. 15-30 minute publish latency doesn't go away — but feedback latency drops from 15 min to 1 sec.
- Drift risk. If someone edits the MDX file directly while a draft exists in the DB, you have a conflict.
- Need to build preview infrastructure. SSR pages, draft management UI, publish automation.
Red team: The main risk is complexity — maintaining two rendering paths (file-based SSG and DB-based SSR). Mitigated by using the same MDX compilation pipeline for both, just with different content sources. Drift risk is manageable with status flags and last-modified timestamps.
Verdict: Strong option. Solves the feedback latency problem while preserving the review workflow.
Option 4: Incremental Static Regeneration (ISR) + Webhook Deploy
Switch from full static builds to Next.js ISR. When content changes, a webhook triggers revalidation of only the affected pages.
Architecture:
Merge to main → GitHub webhook → POST /api/revalidate?pages=E894,E22
→ Next.js ISR revalidates only those pages
→ Live in ~10-30 seconds
Strengths:
- Dramatically faster deploy. Individual page revalidation is 1-5 seconds, not 3-5 minutes.
- Minimal architectural change. Next.js ISR is built in.
- Still git-based. No change to source of truth or review workflow.
Weaknesses:
build-data.mjsis the real bottleneck, not Next.js. Even with ISR, you need to rebuilddatabase.jsonwhen content changes, because it contains cross-page data. Changing one page affects scores on dozens of others.database.jsonis monolithic. ISR works for page-level revalidation, but the data layer is wiki-level, not page-level.- Cold start penalty. First request after revalidation compiles MDX + fetches data.
- Vercel ISR quirks. Revalidation is best-effort, not guaranteed. Stale-while-revalidate semantics can be confusing.
- Doesn't help the local dev/CI pipeline. The agent still runs build-data + validations locally.
Red team: This treats the symptom (slow Vercel builds) rather than the disease (the monolithic build-data pipeline). The pipeline that takes 30-90 seconds locally and forces sequential validation is the actual agent bottleneck. ISR helps deployment but not authoring.
Verdict: Worth doing eventually as a deployment optimization, but doesn't solve the core agent workflow problem.
Option 5: Lightweight Write Path + Async Validation
Decouple the "write" path from the "validate + build" path. Agent commits the MDX file and pushes immediately with only minimal validation. Full validation happens asynchronously in CI.
What to cut from the local gate:
- Drop
build-data.mjsfrom pre-push (runs in CI anyway) - Drop
tsc --noEmitfrom pre-push - Keep only: escaping check, frontmatter schema, YAML schema (<5 sec)
Strengths:
- Near-zero local overhead. Push in <15 seconds.
- No architectural changes. Same git workflow, same CI, same Vercel deploys.
- Easy to implement. Just modify
.githooks/pre-pushandvalidate gate. - CI still catches everything. No validation actually skipped — runs in a different place.
Weaknesses:
- Feedback loop is slower. Agent doesn't know if the page is valid until CI finishes (5-10 min).
- More failed CI runs. Moving validation to CI means more red builds, more fix-and-push cycles.
- Doesn't help with deployment latency.
- Risk of "push and forget." Agent pushes, assumes it's fine, session ends. CI fails, nobody fixes it.
Red team: This is the "just make the local step faster" option. It helps agent throughput at the cost of reliability. The pre-push gate exists precisely because agents were pushing broken content. The fix-in-CI loop (push, wait for CI, fix, push again) could be slower than fixing locally.
Verdict: Quick win for reducing local friction, but risky without good CI-failure recovery automation.
Option 6: Split build-data.mjs Into Incremental + Full Modes
Refactor the monolithic build script to support incremental builds. When adding one page, only compute data for that page + its direct neighbors.
Architecture:
Full build (CI/deploy): build-data.mjs --full → 30-90 sec (unchanged)
Incremental (local): build-data.mjs --incremental → 3-5 sec (new)
Incremental mode would:
- Read the existing
database.json - Parse only the changed MDX file(s) (from git diff)
- Update: that page's entry, its direct backlinks, its entity data
- Skip: similarity scores, full related graph, git date maps, wiki-server enrichment
- Write a patched
database.json
Strengths:
- Dramatically faster local builds. 3-5 sec instead of 30-90 sec.
- Preserves accuracy. Full build still runs in CI and on deploy.
- No architectural change. Same files, same format, same consumers.
Weaknesses:
- Incremental data is slightly stale. Related graph, similarity scores, and risk scores for neighboring pages won't update until the full CI build.
- Complex to implement correctly.
build-data.mjsis 2,050 lines with deep interdependencies. Making it incremental requires understanding which computations are page-local vs. wiki-global. - Cache invalidation. When does a page's neighbors need recomputation? Entity renames, relationship changes, tag changes all have ripple effects.
- Testing burden. Need to verify incremental mode produces correct-enough results and that full mode catches drift.
Red team: Engineering-intensive but solves the right problem. The risk is "incremental" and "full" modes slowly diverge. Needs good tests comparing outputs. But the payoff is significant: the local dev loop drops from 2-3 minutes to ~10 seconds.
Verdict: High effort, high reward. Best combined with Option 5.
Recommendation: Phased Approach
Phase 1: Quick Wins (1-2 days)
Combine Option 5 (lighter gate) + minimal Option 6 (skip expensive build-data steps locally).
- Add a
--quickmode tobuild-data.mjsthat skips: similarity computation, git date maps, wiki-server enrichment, block-level IR extraction. This alone probably cuts local build time from 60-90s to 10-20s. - Add a
--quickmode tovalidate gatethat runs only the 3 fastest blocking checks (escaping, frontmatter schema, YAML schema) and skips build-data, tsc, and tests. For use during rapid iteration; full gate still runs on push. - Move
assign-ids.mjsto be a CI-only step (or make it gracefully no-op when server is unavailable, using a local counter for provisional IDs that get replaced in CI).
Expected result: Local page creation drops from ~5 min to ~30 sec.
Phase 2: Draft/Preview System (1-2 weeks)
Implement Option 3 (hybrid draft-in-DB).
- Add
POST /api/pages/draftto wiki-server. Store MDX content + frontmatter in apage_draftstable. - Add
/preview/[id]route that renders from DB (SSR, not SSG). - Add
pnpm crux content draft "Title"that creates a page via API instead of file. - Add
pnpm crux content publish <draft-id>that writes the MDX file to git, commits, and opens a PR. - Run lightweight validation (escaping, schema) on draft creation for instant feedback.
Expected result: Agent creates a page in ~20 seconds. Preview available immediately. Publishing still goes through git review.
Phase 3: ISR + Incremental Builds (longer term)
Implement Options 4 + 6.
- Refactor
build-data.mjsinto page-local and wiki-global computations. - Switch Vercel to ISR with on-demand revalidation webhooks.
- On merge, revalidate only affected pages instead of full rebuild.
Expected result: Time from merge to live drops from 3-5 min to 10-30 seconds.
What to Avoid
- Full DB migration (Option 1): Too much disruption, loses git's strengths for review and auditability.
- Repo split (Option 2): Adds complexity without solving the actual bottleneck.
- Doing nothing: The current 20-30 minute cycle is a genuine productivity drain, especially as the wiki grows past 700 pages.
Key Insight
The feedback latency, not the publish latency, is what hurts agents most. An agent doesn't mind if deployment takes 5 minutes after merge. It minds that it takes 5 minutes to find out if its content is even valid. Phase 1 attacks that directly with minimal effort.