Cross-Link Automation Proposal
Cross-Link Automation Proposal
Section titled “Cross-Link Automation Proposal”Status: Proposal Author: Claude Code Date: February 2026
Executive Summary
Section titled “Executive Summary”This proposal outlines a multi-phase approach to improving cross-linking across the wiki. Phase 1 (deterministic matching) is implemented. Phases 2 and 3 propose vector embeddings and LLM verification to catch semantic relationships the deterministic approach misses.
Current State
Section titled “Current State”Phase 1: Deterministic Matching (Implemented ✓)
Section titled “Phase 1: Deterministic Matching (Implemented ✓)”npm run crux -- fix cross-links # Previewnpm run crux -- fix cross-links --apply # Applynpm run crux -- fix cross-links --fuzzy # Include fuzzy suggestionsResults:
- 546 exact matches across 236 files
- Uses case-insensitive exact name matching with word boundaries
- Includes basic fuzzy matching via Levenshtein distance on proper nouns
Limitations:
- Only catches exact name matches (e.g., “Anthropic” but not “Anthropic’s research team”)
- Misses semantic relationships (e.g., “RLHF paper” should link to RLHF page)
- Can’t detect when a paragraph discusses a topic without naming it explicitly
Phase 2: Vector Embedding Index
Section titled “Phase 2: Vector Embedding Index”Build a semantic search index that can:
- Find entities related to any text passage
- Suggest links based on meaning, not just name matching
- Enable “find similar entities” queries
Architecture
Section titled “Architecture”┌─────────────────────────────────────────────────────────────┐│ Entity Embedding Index │├─────────────────────────────────────────────────────────────┤│ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Entity │ │ Embedding │ │ Vector │ ││ │ Loader │────▶│ Model │────▶│ Store │ ││ │ │ │ (local/API) │ │ (LanceDB) │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ │ │ ││ ▼ ▼ ││ ┌─────────────┐ ┌─────────────┐ ││ │ - Title │ │ Similarity │ ││ │ - Summary │ │ Search │ ││ │ - LLMSummary│ │ Query │ ││ │ - Body │ └─────────────┘ ││ └─────────────┘ ││ │└─────────────────────────────────────────────────────────────┘Implementation Options
Section titled “Implementation Options”Option A: LanceDB (Recommended)
Section titled “Option A: LanceDB (Recommended)”Pros:
- Native JavaScript/TypeScript
- Local-first, no external dependencies
- Fast (Rust-based)
- Supports incremental updates
Setup:
npm install @lancedb/lancedbUsage:
import * as lancedb from '@lancedb/lancedb';
// Create/load databaseconst db = await lancedb.connect('./.vector-db');
// Create embeddings tableconst entities = await loadEntities();const embeddings = await embedTexts(entities.map(e => e.summary));
await db.createTable('entities', entities.map((e, i) => ({ id: e.id, title: e.title, embedding: embeddings[i],})));
// Query similar entitiesconst query = await embed("reinforcement learning from human feedback");const results = await db.search(query).limit(5).execute();Option B: SQLite + sqlite-vss
Section titled “Option B: SQLite + sqlite-vss”Pros:
- Uses existing SQLite infrastructure
- Single file database
- No new dependencies
Cons:
- Requires sqlite-vss extension compilation
- Limited vector operations
Option C: Turbopuffer (Serverless)
Section titled “Option C: Turbopuffer (Serverless)”Pros:
- No local setup
- Managed infrastructure
- Good for larger scale
Cons:
- External dependency
- Network latency
- Cost ($0.10/1M vectors/month)
Embedding Model Options
Section titled “Embedding Model Options”| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
| Nomic Embed (local) | 768 | Fast | Good | Free |
| text-embedding-3-small | 1536 | API | Good | $0.02/1M tokens |
| text-embedding-3-large | 3072 | API | Best | $0.13/1M tokens |
| GTE-base (local) | 768 | Fast | Good | Free |
Recommendation: Start with OpenAI text-embedding-3-small for quality, migrate to local model (Nomic) once validated.
Data to Embed
Section titled “Data to Embed”For each entity, embed concatenation of:
const textToEmbed = [ entity.title, entity.description, entity.llmSummary, // Optionally: first 500 chars of body].filter(Boolean).join('\n');Estimated tokens: ~500 entities × ~200 tokens = 100K tokens Embedding cost: $0.002 (one-time)
CLI Commands
Section titled “CLI Commands”# Build/rebuild embedding indexnpm run crux -- embeddings build
# Search for similar entitiesnpm run crux -- embeddings search "deceptive AI behavior"
# Suggest links for a pagenpm run crux -- embeddings suggest-links knowledge-base/risks/accident/scheming.mdxIntegration with Cross-Link Fixer
Section titled “Integration with Cross-Link Fixer”// In fix-cross-links.mjsasync function findSemanticSuggestions(pageContent, existingLinks) { const db = await loadVectorDB();
// Embed page paragraphs const paragraphs = splitIntoParagraphs(pageContent);
const suggestions = []; for (const para of paragraphs) { const embedding = await embed(para.text); const similar = await db.search(embedding).limit(5).execute();
for (const result of similar) { if (!existingLinks.has(result.id) && result.score > 0.75) { suggestions.push({ entityId: result.id, entityTitle: result.title, context: para.text.slice(0, 100), score: result.score, }); } } }
return suggestions;}Phase 3: LLM Verification Layer
Section titled “Phase 3: LLM Verification Layer”Use a cheap LLM (Haiku/Flash) to verify suggestions before applying, catching:
- False positives from semantic search
- Context-inappropriate links
- Redundant links
Architecture
Section titled “Architecture”┌─────────────────────────────────────────────────────────────┐│ LLM Verification │├─────────────────────────────────────────────────────────────┤│ ││ Input: ││ ┌─────────────────────────────────────────────────────┐ ││ │ Page: "The alignment problem involves..." │ ││ │ Suggestion: Link "alignment" to alignment.mdx │ ││ │ Context: "...solving the alignment problem for..." │ ││ └─────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────┐ ││ │ Haiku / Gemini Flash │ ││ │ │ ││ │ Prompt: "Should this text link to this entity? │ ││ │ Reply YES/NO with brief reason." │ ││ └─────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ Output: { approve: true, reason: "Direct discussion" } ││ │└─────────────────────────────────────────────────────────────┘Batch Processing
Section titled “Batch Processing”// Process all pages in batchesasync function verifyAllSuggestions() { const pages = await loadAllPages(); const vectorDB = await loadVectorDB();
for (const page of pages) { // Get semantic suggestions const suggestions = await findSemanticSuggestions(page.content);
// Batch verify with LLM const verified = await verifyWithLLM(suggestions, { model: 'claude-3-haiku', batchSize: 20, });
// Apply approved changes if (verified.length > 0) { await applyLinks(page.path, verified); } }}Cost Estimate
Section titled “Cost Estimate”- ~500 pages × ~5 suggestions each = 2,500 verifications
- ~100 tokens per verification
- 250K tokens total
- Haiku cost: ≈$0.06
- Gemini Flash cost: ≈$0.02
Prompt Template
Section titled “Prompt Template”You are reviewing suggested cross-links for a wiki about AI safety.
Page excerpt:"{context}"
Suggested link: "{entityTitle}" (page about {entityDescription})Suggested text to link: "{matchedText}"
Should this text be linked to the suggested page?Consider:1. Is the text actually discussing this specific entity/concept?2. Would a reader benefit from this link?3. Is it the first mention (wiki convention)?
Reply with JSON: {"approve": true/false, "reason": "brief explanation"}Implementation Timeline
Section titled “Implementation Timeline”| Phase | Effort | Dependencies | Status |
|---|---|---|---|
| Phase 1: Deterministic | 4h | None | ✅ Complete |
| Phase 2a: Vector DB setup | 2h | LanceDB | Proposed |
| Phase 2b: Embedding pipeline | 3h | OpenAI API | Proposed |
| Phase 2c: CLI integration | 2h | Phase 2a, 2b | Proposed |
| Phase 3: LLM verification | 3h | Haiku API | Proposed |
Total remaining: ~10 hours
Cost Summary
Section titled “Cost Summary”| Component | One-time | Monthly |
|---|---|---|
| Embedding 500 entities | $0.002 | - |
| Re-embedding on changes | - | ≈$0.001 |
| LLM verification (batch) | $0.06 | - |
| LLM verification (incremental) | - | ≈$0.01 |
| Total | ≈$0.10 | ≈$0.01 |
Decision Points
Section titled “Decision Points”- Vector DB choice: LanceDB vs SQLite-vss vs Turbopuffer?
- Embedding model: API (OpenAI) vs local (Nomic)?
- LLM verification: Haiku vs Gemini Flash?
- Scope: All pages vs high-importance only?
Next Steps
Section titled “Next Steps”- Approve this proposal
- Set up LanceDB in project
- Create embedding pipeline script
- Test on 10 sample pages
- Full rollout if quality is acceptable