Wiki-Server Environment Architecture
This document analyzes how the wiki-server handles (or fails to handle) environment isolation between development, preview, and production. It identifies the core architectural tension — entity ID allocation must be globally shared, but content data should not — and proposes a tiered remediation plan.
Status: Analysis complete (Feb 2026). Scoped API keys were implemented then removed (Mar 2026) — the added complexity wasn't justified. The system uses a single LONGTERMWIKI_SERVER_API_KEY for all API access, with timing-safe comparison.
Background: What the Wiki-Server Does
The wiki-server (apps/wiki-server/) is a Hono + PostgreSQL service deployed to Kubernetes at wiki-server.k8s.quantifieduncertainty.org. It provides:
| Capability | Endpoints | Used by |
|---|---|---|
| Entity ID allocation | /api/ids/* | assign-ids.mjs (build-time), crux CLI |
| Content sync | /api/pages/sync, /api/entities/sync, /api/facts/sync | CI (main-branch only), crux CLI |
| Query/read | /api/search/*, /api/explore/*, /api/citations/*, /api/facts/* | Next.js runtime (ISR), crux CLI |
| Operational data | /api/sessions/*, /api/edit-logs/*, /api/jobs/* | Crux CLI, internal dashboards |
All of these share one PostgreSQL database with zero environment isolation — no environment column, no tenant separation, no namespace filtering.
The Core Tension: What Must Be Shared vs What Should Be Isolated
The server's responsibilities fall into three categories with different sharing requirements:
Project-wide coordination (must be shared across all environments)
These are append-only or idempotent writes that track the state of the wiki project as a whole, regardless of which environment or branch the caller is on:
- Entity ID allocation — PostgreSQL sequence (
entity_id_seq): E1, E2, E3... Monotonic, globally unique, write-once. Dev and prod must share the same sequence to avoid ID conflicts on merge. - Session logs — "Claude session on branch
feature/fooimproved page X." This is project-level operational history. A dev session's log belongs in the same history as a production session's. - Edit logs — Per-page edit history. If you improve a page on a dev branch, the edit should be recorded so it's visible when the branch merges.
- Agent coordination — Which agents are working on which issues, to prevent duplicate work across sessions and environments.
- Research artifacts — Citation verification results, quote extractions, research context bundles.
Content state (should be per-environment)
These are destructive upserts that represent the canonical rendered state of wiki content. A dev sync here overwrites what production sees:
- Page content sync (
/api/pages/sync) - Entity/fact/citation sync
- Explore index, risk scores, staleness data, backlink counts
Derived query data (read-only, derived from content state)
Search, explore pagination, related pages, backlinks — these are computed from content state and served as read-only endpoints. They inherit whatever isolation the content layer has.
Why the current architecture is problematic
All three categories share one database with no partitioning. The project-wide coordination concern (IDs, sessions, edit logs) drags content into shared space. Because there is only one server, any authenticated client can overwrite production page content alongside legitimately writing session logs.
Why This Is Getting Worse: The Runtime Read Migration
Before February 2026, the wiki-server was primarily a build-time dependency. The Next.js app read from a static database.json generated at build time. The server was used for ID allocation, content sync, and internal dashboards.
Recent PRs have been migrating data fetching from static build-time to live runtime reads via the withApiFallback pattern:
| PR | What moved to runtime server reads |
|---|---|
| #947 | Related pages and backlinks |
| #945 | Claims data |
| #952 | Citation health and hallucination risk |
| #954 | Update schedule (staleness) |
| #955 | Facts and timeseries |
| #951 (open) | Explore page with pagination, filtering, sorting |
All of these use ISR with revalidate: 300 (5-minute cache), meaning the production site constantly re-fetches from the server. This changes the server from "build convenience" to "runtime dependency." If server data is corrupted, the live production site is affected.
The withApiFallback Pattern
// apps/web/src/lib/wiki-server.ts
async function withApiFallback<T>(
apiLoader: () => Promise<FetchResult<T> | T | null>,
localLoader: () => T | null
): Promise<WithSource<T | null>>
This tries the server first, falls back to local database.json, and tracks which source was used. It handles availability well (site works without the server), but masks the isolation problem — production reads might return data corrupted by a dev sync, with no way to detect this.
Current Safeguards
| Safeguard | What it protects against | Weakness |
|---|---|---|
| CI sync is main-only | Feature branch CI writing to prod DB | Local pnpm crux wiki-server sync bypasses this |
| API key required | Unauthorized writes | Anyone with the key can write from any branch |
withApiFallback | Server downtime | Doesn't detect corrupted data |
Vercel skips claude/* branches | Unnecessary preview builds | Doesn't address server isolation |
Missing safeguards:
- No write guard that checks the source branch/environment
- No audit trail (who synced what, from which branch)
- No versioning or rollback capability on synced data
- No way to distinguish "prod data" from "dev data" in the database
Risk Scenarios
-
Accidental dev sync: A developer runs
pnpm crux wiki-server syncfrom a feature branch with.envconfigured. Production database now has half-written page content. -
ID sequence burn: A feature branch runs
assign-ids.mjsfor entities that are later deleted before merge. Those IDs are permanently consumed from the global sequence. -
Race condition: Two CI runs (main merge + scheduled maintenance) sync simultaneously. Last-write-wins produces undefined row state.
-
Preview pollution: Vercel preview deployments inherit the same
LONGTERMWIKI_SERVER_URL. They read production data, potentially mixing it with local branch content in confusing ways. -
Operational noise: Dev session logs, test edit-log entries, and experiment jobs are mixed into production operational data, cluttering internal dashboards.
Proposed Solution: Tiered Approach
Tier 1 — Immediate (single PR, minimal risk)
A. Write guards on sync endpoints. Add a required X-Wiki-Environment header. Only production is allowed to write content:
// Middleware for content write routes
app.post("/api/pages/sync", (c, next) => {
const env = c.req.header("X-Wiki-Environment");
if (env !== "production") {
return c.json({ error: "Content writes restricted to production" }, 403);
}
return next();
});
CI passes X-Wiki-Environment: production. Local dev and preview are blocked from writing content. ID allocation remains unguarded (shared by design).
B. Default local dev to local-only mode. Remove LONGTERMWIKI_SERVER_URL from the default .env setup. Most dev work edits existing pages and doesn't need the server. Only set it explicitly when running assign-ids.mjs for new entities.
C. Add branch/commit tracking to syncs. Add synced_from_branch and synced_from_commit columns to wiki_pages. This creates an audit trail and enables future rollback.
Tier 2 — Medium-term (moderate effort)
D. Logical service separation within one deployment. Group routes into services with different auth/isolation rules:
| Service | Routes | Write access | Read access |
|---|---|---|---|
| ID Registry | /api/ids/* | All environments (shared) | All environments |
| Content | /api/pages/*, /api/entities/*, /api/facts/* | Production only | All (production from DB, dev from local) |
| Query | /api/search/*, /api/explore/*, /api/citations/* | N/A (read-only) | All environments |
| Operations | /api/sessions/*, /api/edit-logs/*, /api/jobs/* | All, but tagged with environment | Filtered by environment |
This doesn't require separate deployments — just middleware per route group.
E. Add environment column to operational tables. For sessions, edit_logs, auto_update_runs, jobs: add environment TEXT NOT NULL DEFAULT 'production'. Reads filter by environment. Dev session logs stop cluttering production dashboards.
Tier 3 — Longer-term (cleanest result, inverts the model)
Instead of adding guards to a monolithic shared server, split along the real boundary: project-wide coordination (shared) vs content state (per-environment).
F. Shared project database. A single shared service handles all project-wide coordination — things that are append-only or idempotent and should be visible across all environments:
- Entity ID allocation (
/api/ids/*) - Session logs (
/api/sessions/*) - Edit logs (
/api/edit-logs/*) - Agent coordination / job queue (
/api/jobs/*) - Research artifacts
All environments write here. The key property: these writes are append-only (log a session, allocate an ID, record an edit), never destructive upserts. A dev branch writing a session log doesn't corrupt anything.
Configured via: LONGTERMWIKI_PROJECT_DB_URL + LONGTERMWIKI_PROJECT_DB_KEY (all environments get these).
G. Per-environment content server. Content sync, query endpoints, and the explore index are per-environment:
- Production: Full content server with its own DB, written to only by CI on main
- Development: Local-only (
database.jsonfallback), or optional local server for testing server-driven features - Preview: Local-only (no server)
Configured via: LONGTERMWIKI_SERVER_URL + LONGTERMWIKI_SERVER_API_KEY. Dev uses localhost:3002, prod uses the k8s URL. The URL separation is the isolation mechanism.
H. ID allocation becomes a pre-commit step. Instead of assign-ids.mjs running during the build and silently calling production, creating a new entity is an explicit action: run crux ids allocate, get the ID, commit it to the YAML file. The build pipeline becomes fully offline — no production server dependency at build time.
How PR #951 (Server-Driven Explore) Fits In
PR #951 adds a complex /api/explore endpoint with SQL-driven pagination, filtering, and faceted counts. This accelerates the runtime-read trend significantly — database.json cannot replicate this functionality well.
The ExploreGrid hybrid mode (server when available, client-side fallback when not) is the right pattern for graceful degradation, but it means the production explore UX depends on the server returning correct, non-corrupted data. This makes Tier 1 write guards more urgent.
Recommendation
Start with Tier 1A + 1B (#966). Adding write guards and defaulting dev to local-only mode is the highest-value, lowest-effort change. It directly prevents the most likely failure mode (accidental dev sync) while preserving the correctly-shared ID allocation.
Tier 2 follows as the runtime-read pattern stabilizes: logical service separation (#967) and environment tagging for operational tables (#968).
Tier 3 (#969) is the target architecture: split into a shared project database (IDs, sessions, edit logs, agent coordination) and per-environment content servers. This eliminates the need for guards entirely — dev can't corrupt prod content because they're different databases, while project-wide coordination data flows freely across environments.
Key principle: The real boundary isn't "prod vs dev" — it's "project-wide coordination (append-only, safe to share)" vs "content state (destructive upserts, must be isolated)."