Cross-Link Automation Proposal

Status: Proposal Author: Claude Code Date: February 2026

Executive Summary

This proposal outlines a multi-phase approach to improving cross-linking across the wiki. Phase 1 (deterministic matching) is implemented. Phases 2 and 3 propose vector embeddings and LLM verification to catch semantic relationships the deterministic approach misses.

Current State

Phase 1: Deterministic Matching (Implemented ✓)

pnpm crux fix cross-links              # Preview
pnpm crux fix cross-links --apply      # Apply
pnpm crux fix cross-links --fuzzy      # Include fuzzy suggestions

Results:

546 exact matches across 236 files
Uses case-insensitive exact name matching with word boundaries
Includes basic fuzzy matching via Levenshtein distance on proper nouns

Limitations:

Only catches exact name matches (e.g., "Anthropic" but not "Anthropic's research team")
Misses semantic relationships (e.g., "RLHF paper" should link to RLHF page)
Can't detect when a paragraph discusses a topic without naming it explicitly

Phase 2: Vector Embedding Index

Goal

Build a semantic search index that can:

Find entities related to any text passage
Suggest links based on meaning, not just name matching
Enable "find similar entities" queries

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Entity Embedding Index                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐   │
│  │   Entity    │     │  Embedding  │     │   Vector    │   │
│  │   Loader    │────▶│   Model     │────▶│    Store    │   │
│  │             │     │ (local/API) │     │  (LanceDB)  │   │
│  └─────────────┘     └─────────────┘     └─────────────┘   │
│         │                                       │           │
│         ▼                                       ▼           │
│  ┌─────────────┐                        ┌─────────────┐    │
│  │ - Title     │                        │  Similarity │    │
│  │ - Summary   │                        │   Search    │    │
│  │ - LLMSummary│                        │   Query     │    │
│  │ - Body      │                        └─────────────┘    │
│  └─────────────┘                                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implementation Options

Option A: LanceDB (Recommended)

Pros:

Native JavaScript/TypeScript
Local-first, no external dependencies
Fast (Rust-based)
Supports incremental updates

Setup:

pnpm add @lancedb/lancedb

Usage:


// Create/load database
const db = await lancedb.connect('./.vector-db');

// Create embeddings table
const entities = await loadEntities();
const embeddings = await embedTexts(entities.map(e => e.summary));

await db.createTable('entities', entities.map((e, i) => ({
  id: e.id,
  title: e.title,
  embedding: embeddings[i],
})));

// Query similar entities
const query = await embed("reinforcement learning from human feedback");
const results = await db.search(query).limit(5).execute();

Option B: SQLite + sqlite-vss

Pros:

Uses existing SQLite infrastructure
Single file database
No new dependencies

Cons:

Requires sqlite-vss extension compilation
Limited vector operations

Option C: Turbopuffer (Serverless)

Pros:

No local setup
Managed infrastructure
Good for larger scale

Cons:

External dependency
Network latency
Cost ($0.10/1M vectors/month)

Embedding Model Options

Model	Dimensions	Speed	Quality	Cost
Nomic Embed (local)	768	Fast	Good	Free
text-embedding-3-small	1536	API	Good	$0.02/1M tokens
text-embedding-3-large	3072	API	Best	$0.13/1M tokens
GTE-base (local)	768	Fast	Good	Free

Recommendation: Start with OpenAI text-embedding-3-small for quality, migrate to local model (Nomic) once validated.

Data to Embed

For each entity, embed concatenation of:

const textToEmbed = [
  entity.title,
  entity.description,
  entity.summary,
  // Optionally: first 500 chars of body
].filter(Boolean).join('\n');

Estimated tokens: ~500 entities × ~200 tokens = 100K tokens Embedding cost: $0.002 (one-time)

CLI Commands

# Build/rebuild embedding index
pnpm crux embeddings build

# Search for similar entities
pnpm crux embeddings search "deceptive AI behavior"

# Suggest links for a page
pnpm crux embeddings suggest-links knowledge-base/risks/accident/scheming.mdx

Integration with Cross-Link Fixer

// In fix-cross-links.mjs
async function findSemanticSuggestions(pageContent, existingLinks) {
  const db = await loadVectorDB();

  // Embed page paragraphs
  const paragraphs = splitIntoParagraphs(pageContent);

  const suggestions = [];
  for (const para of paragraphs) {
    const embedding = await embed(para.text);
    const similar = await db.search(embedding).limit(5).execute();

    for (const result of similar) {
      if (!existingLinks.has(result.id) && result.score > 0.75) {
        suggestions.push({
          entityId: result.id,
          entityTitle: result.title,
          context: para.text.slice(0, 100),
          score: result.score,
        });
      }
    }
  }

  return suggestions;
}

Phase 3: LLM Verification Layer

Goal

Use a cheap LLM (Haiku/Flash) to verify suggestions before applying, catching:

False positives from semantic search
Context-inappropriate links
Redundant links

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     LLM Verification                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Input:                                                      │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Page: "The alignment problem involves..."           │    │
│  │ Suggestion: Link "alignment" to alignment.mdx       │    │
│  │ Context: "...solving the alignment problem for..."  │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
│                          ▼                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Haiku / Gemini Flash                    │    │
│  │                                                      │    │
│  │  Prompt: "Should this text link to this entity?     │    │
│  │           Reply YES/NO with brief reason."          │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
│                          ▼                                   │
│  Output: { approve: true, reason: "Direct discussion" }     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Batch Processing

// Process all pages in batches
async function verifyAllSuggestions() {
  const pages = await loadAllPages();
  const vectorDB = await loadVectorDB();

  for (const page of pages) {
    // Get semantic suggestions
    const suggestions = await findSemanticSuggestions(page.content);

    // Batch verify with LLM
    const verified = await verifyWithLLM(suggestions, {
      model: 'claude-3-haiku',
      batchSize: 20,
    });

    // Apply approved changes
    if (verified.length > 0) {
      await applyLinks(page.path, verified);
    }
  }
}

Cost Estimate

~500 pages × ~5 suggestions each = 2,500 verifications
~100 tokens per verification
250K tokens total
Haiku cost: ≈$0.06
Gemini Flash cost: ≈$0.02

Prompt Template

You are reviewing suggested cross-links for a wiki about AI safety.

Page excerpt:
"{context}"

Suggested link: "{entityTitle}" (page about {entityDescription})
Suggested text to link: "{matchedText}"

Should this text be linked to the suggested page?
Consider:
1. Is the text actually discussing this specific entity/concept?
2. Would a reader benefit from this link?
3. Is it the first mention (wiki convention)?

Reply with JSON: {"approve": true/false, "reason": "brief explanation"}

Implementation Timeline

Phase	Effort	Dependencies	Status
Phase 1: Deterministic	4h	None	✅ Complete
Phase 2a: Vector DB setup	2h	LanceDB	Proposed
Phase 2b: Embedding pipeline	3h	OpenAI API	Proposed
Phase 2c: CLI integration	2h	Phase 2a, 2b	Proposed
Phase 3: LLM verification	3h	Haiku API	Proposed

Total remaining: ~10 hours

Cost Summary

Component	One-time	Monthly
Embedding 500 entities	$0.002	-
Re-embedding on changes	-	≈$0.001
LLM verification (batch)	$0.06	-
LLM verification (incremental)	-	≈$0.01
Total	≈$0.10	≈$0.01

Decision Points

Vector DB choice: LanceDB vs SQLite-vss vs Turbopuffer?
Embedding model: API (OpenAI) vs local (Nomic)?
LLM verification: Haiku vs Gemini Flash?
Scope: All pages vs high-importance only?

Next Steps

Approve this proposal
Set up LanceDB in project
Create embedding pipeline script
Test on 10 sample pages
Full rollout if quality is acceptable