Longterm Wiki
Updated 2026-03-13HistoryData
Page StatusDocumentation
Edited today634 wordsPoint-in-time
54QualityAdequate10ImportancePeripheral14.5ResearchMinimal
Summary

Technical proposal for three-phase wiki cross-linking automation: Phase 1 (deterministic matching, implemented) found 546 matches across 236 files; Phase 2 proposes vector embeddings (LanceDB recommended, ~\$0.002 cost for 500 entities); Phase 3 adds LLM verification (~\$0.06 for 2,500 verifications using Haiku). Total implementation estimated at 10 hours with monthly costs under \$0.02.

Content3/13
LLM summaryScheduleEntityEdit history1Overview
Tables3/ ~3Diagrams0Int. links0/ ~5Ext. links0/ ~3Footnotes0/ ~2References0/ ~2Quotes0Accuracy0RatingsN:2 R:6.5 A:7.5 C:7
Change History1
Internal pages entity infrastructure#1424 weeks ago

Added full entity infrastructure to internal pages (style guides, architecture docs, research reports, schema docs). Internal pages now have the `internal` entity type, get auto-assigned E* numeric IDs (E698-E731), are included in the search index, and participate in backlinks/related graph computation. Includes review fixes: filtering internal pages from public explore/home, converting all 7 remaining .md files, adding `internal` to data/schema.ts, and updating all `shouldSkipValidation`/`pageType === 'documentation'` checks.

Cross-Link Automation Proposal

Status: Proposal Author: Claude Code Date: February 2026

Executive Summary

This proposal outlines a multi-phase approach to improving cross-linking across the wiki. Phase 1 (deterministic matching) is implemented. Phases 2 and 3 propose vector embeddings and LLM verification to catch semantic relationships the deterministic approach misses.

Current State

Phase 1: Deterministic Matching (Implemented ✓)

npm run crux -- fix cross-links              # Preview
npm run crux -- fix cross-links --apply      # Apply
npm run crux -- fix cross-links --fuzzy      # Include fuzzy suggestions

Results:

  • 546 exact matches across 236 files
  • Uses case-insensitive exact name matching with word boundaries
  • Includes basic fuzzy matching via Levenshtein distance on proper nouns

Limitations:

  • Only catches exact name matches (e.g., "Anthropic" but not "Anthropic's research team")
  • Misses semantic relationships (e.g., "RLHF paper" should link to RLHF page)
  • Can't detect when a paragraph discusses a topic without naming it explicitly

Phase 2: Vector Embedding Index

Goal

Build a semantic search index that can:

  1. Find entities related to any text passage
  2. Suggest links based on meaning, not just name matching
  3. Enable "find similar entities" queries

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Entity Embedding Index                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐   │
│  │   Entity    │     │  Embedding  │     │   Vector    │   │
│  │   Loader    │────▶│   Model     │────▶│    Store    │   │
│  │             │     │ (local/API) │     │  (LanceDB)  │   │
│  └─────────────┘     └─────────────┘     └─────────────┘   │
│         │                                       │           │
│         ▼                                       ▼           │
│  ┌─────────────┐                        ┌─────────────┐    │
│  │ - Title     │                        │  Similarity │    │
│  │ - Summary   │                        │   Search    │    │
│  │ - LLMSummary│                        │   Query     │    │
│  │ - Body      │                        └─────────────┘    │
│  └─────────────┘                                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implementation Options

Pros:

  • Native JavaScript/TypeScript
  • Local-first, no external dependencies
  • Fast (Rust-based)
  • Supports incremental updates

Setup:

npm install @lancedb/lancedb

Usage:


// Create/load database
const db = await lancedb.connect('./.vector-db');

// Create embeddings table
const entities = await loadEntities();
const embeddings = await embedTexts(entities.map(e => e.summary));

await db.createTable('entities', entities.map((e, i) => ({
  id: e.id,
  title: e.title,
  embedding: embeddings[i],
})));

// Query similar entities
const query = await embed("reinforcement learning from human feedback");
const results = await db.search(query).limit(5).execute();

Option B: SQLite + sqlite-vss

Pros:

  • Uses existing SQLite infrastructure
  • Single file database
  • No new dependencies

Cons:

  • Requires sqlite-vss extension compilation
  • Limited vector operations

Option C: Turbopuffer (Serverless)

Pros:

  • No local setup
  • Managed infrastructure
  • Good for larger scale

Cons:

  • External dependency
  • Network latency
  • Cost ($0.10/1M vectors/month)

Embedding Model Options

ModelDimensionsSpeedQualityCost
Nomic Embed (local)768FastGoodFree
text-embedding-3-small1536APIGood$0.02/1M tokens
text-embedding-3-large3072APIBest$0.13/1M tokens
GTE-base (local)768FastGoodFree

Recommendation: Start with OpenAI text-embedding-3-small for quality, migrate to local model (Nomic) once validated.

Data to Embed

For each entity, embed concatenation of:

const textToEmbed = [
  entity.title,
  entity.description,
  entity.llmSummary,
  // Optionally: first 500 chars of body
].filter(Boolean).join('\n');

Estimated tokens: ~500 entities × ~200 tokens = 100K tokens Embedding cost: $0.002 (one-time)

CLI Commands

# Build/rebuild embedding index
npm run crux -- embeddings build

# Search for similar entities
npm run crux -- embeddings search "deceptive AI behavior"

# Suggest links for a page
npm run crux -- embeddings suggest-links knowledge-base/risks/accident/scheming.mdx
// In fix-cross-links.mjs
async function findSemanticSuggestions(pageContent, existingLinks) {
  const db = await loadVectorDB();

  // Embed page paragraphs
  const paragraphs = splitIntoParagraphs(pageContent);

  const suggestions = [];
  for (const para of paragraphs) {
    const embedding = await embed(para.text);
    const similar = await db.search(embedding).limit(5).execute();

    for (const result of similar) {
      if (!existingLinks.has(result.id) && result.score > 0.75) {
        suggestions.push({
          entityId: result.id,
          entityTitle: result.title,
          context: para.text.slice(0, 100),
          score: result.score,
        });
      }
    }
  }

  return suggestions;
}

Phase 3: LLM Verification Layer

Goal

Use a cheap LLM (Haiku/Flash) to verify suggestions before applying, catching:

  • False positives from semantic search
  • Context-inappropriate links
  • Redundant links

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     LLM Verification                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Input:                                                      │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Page: "The alignment problem involves..."           │    │
│  │ Suggestion: Link "alignment" to alignment.mdx       │    │
│  │ Context: "...solving the alignment problem for..."  │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Haiku / Gemini Flash                    │    │
│  │                                                      │    │
│  │  Prompt: "Should this text link to this entity?     │    │
│  │           Reply YES/NO with brief reason."          │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
│                          ▼                                   │
│  Output: { approve: true, reason: "Direct discussion" }     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Batch Processing

// Process all pages in batches
async function verifyAllSuggestions() {
  const pages = await loadAllPages();
  const vectorDB = await loadVectorDB();

  for (const page of pages) {
    // Get semantic suggestions
    const suggestions = await findSemanticSuggestions(page.content);

    // Batch verify with LLM
    const verified = await verifyWithLLM(suggestions, {
      model: 'claude-3-haiku',
      batchSize: 20,
    });

    // Apply approved changes
    if (verified.length > 0) {
      await applyLinks(page.path, verified);
    }
  }
}

Cost Estimate

  • ~500 pages × ~5 suggestions each = 2,500 verifications
  • ~100 tokens per verification
  • 250K tokens total
  • Haiku cost: ≈$0.06
  • Gemini Flash cost: ≈$0.02

Prompt Template

You are reviewing suggested cross-links for a wiki about AI safety.

Page excerpt:
"{context}"

Suggested link: "{entityTitle}" (page about {entityDescription})
Suggested text to link: "{matchedText}"

Should this text be linked to the suggested page?
Consider:
1. Is the text actually discussing this specific entity/concept?
2. Would a reader benefit from this link?
3. Is it the first mention (wiki convention)?

Reply with JSON: {"approve": true/false, "reason": "brief explanation"}

Implementation Timeline

PhaseEffortDependenciesStatus
Phase 1: Deterministic4hNone✅ Complete
Phase 2a: Vector DB setup2hLanceDBProposed
Phase 2b: Embedding pipeline3hOpenAI APIProposed
Phase 2c: CLI integration2hPhase 2a, 2bProposed
Phase 3: LLM verification3hHaiku APIProposed

Total remaining: ~10 hours

Cost Summary

ComponentOne-timeMonthly
Embedding 500 entities$0.002-
Re-embedding on changes-≈$0.001
LLM verification (batch)$0.06-
LLM verification (incremental)-≈$0.01
Total≈$0.10≈$0.01

Decision Points

  1. Vector DB choice: LanceDB vs SQLite-vss vs Turbopuffer?
  2. Embedding model: API (OpenAI) vs local (Nomic)?
  3. LLM verification: Haiku vs Gemini Flash?
  4. Scope: All pages vs high-importance only?

Next Steps

  1. Approve this proposal
  2. Set up LanceDB in project
  3. Create embedding pipeline script
  4. Test on 10 sample pages
  5. Full rollout if quality is acceptable