Research-First Page Creation Pipeline

📋Page Status

Page Type:ContentStyle Guide →Standard knowledge base article

Quality:49 (Adequate)

Importance:2 (Peripheral)

Words:1.1k

Structure:

📊 7📈 0🔗 0📚 0•24%Score: 7/15

LLM Summary:Internal experiment documenting a multi-phase LLM pipeline that enforces citation discipline, reducing table rows by 97% while increasing citation density from 0.9 to 2.3 per 100 words. Standard tier ($10.50) achieved same quality (78/100) as premium ($15), with research-first approach producing 42 citations per article across three test topics.

Executive Summary

Finding	Result	Implication
Research-first works	All 3 test articles found Wikipedia, primary sources, AND critical perspectives	Front-loading research prevents hallucination
Citation discipline enforced	42 inline citations per article (vs ≈40 poorly-sourced in original)	“Only use facts.json” rule eliminates unsourced claims
Tables dramatically reduced	196 → 5 table rows (97% reduction)	Prose-first prompting produces readable content
Standard tier is optimal	$10.50 achieved same quality (78) as $15 Premium	Review→gap-fill→polish cycle is worth the cost; extra rewrite isn’t
Budget tier has known gaps	Verify phase identifies issues but can’t fix them	Good for drafts, not final articles

Background

Why Single-Pass Generation Fails

The standard approach of prompting an LLM to “write an article about X” fails because:

Writing before researching - LLM generates plausible content without verified sources
No citation requirement - Facts appear without URLs
Tables as a crutch - LLMs over-produce tables because they’re easy to generate
No verification - Errors and gaps persist to final output

The Solution: Research-First Pipeline

We built scripts/content/page-creator-v2.mjs with this structure:

Research → Extract → Synthesize → Review → Gap-fill → Polish

Key innovation: The synthesis phase receives only extracted facts with citations, not raw sources. The prompt explicitly says “If a fact isn’t in facts.json, DO NOT include it.”

Pipeline Design

Phase Structure

Phase	Model	Budget	Purpose
research	Sonnet	$1-4	Gather 10-16 sources via WebSearch/WebFetch
extract	Sonnet	$1.50	Pull facts into structured JSON with citations
synthesize	Opus/Sonnet	$1.50-2.50	Write article from facts.json ONLY
review	Opus	$1.50-2	Identify gaps, bias, missing perspectives
gap-fill	Sonnet	$1.50	Research topics identified as missing
polish	Opus	$1.50	Integrate new facts, improve prose

Tier Configurations

budget:   research-lite → extract → synthesize-sonnet → verify     (~$4.50)
standard: research → extract → synthesize-opus → review → gap-fill → polish (~$10.50)
premium:  research-deep → extract → synthesize-opus → critical-review → gap-fill → rewrite → polish (~$15)

Citation Enforcement

The extract phase outputs structured JSON:

{
  "facts": [
    {
      "claim": "LessWrong was founded in February 2009",
      "sourceUrl": "https://en.wikipedia.org/wiki/LessWrong",
      "sourceTitle": "Wikipedia",
      "confidence": "high"
    }
  ],
  "controversies": [...],
  "statistics": [...],
  "gaps": ["Topics we have no facts for"]
}

The synthesize prompt then says:

“Every factual claim MUST have an inline citation. If a fact isn’t in facts.json, DO NOT include it.”

Experiment Design

Test Topics

Topic	Tier	Why Chosen
MIRI	Budget	Well-documented nonprofit, good for testing minimal pipeline
LessWrong	Standard	Existing page to compare against (quality 43)
Anthropic	Premium	High-profile, controversial, tests deep research

Metrics Tracked

Total cost and time
Citation count
Table row count
Word count
Whether controversies section included
Self-assessed quality score

Results

Pipeline Completion

All three pipelines completed successfully:

Topic	Tier	Time	Cost	Phases
MIRI	Budget	10m	$4.50	4/4
LessWrong	Standard	16m	$10.50	6/6
Anthropic	Premium	24m	$15.00	7/7

Quality Metrics

Metric	MIRI (Budget)	LessWrong (Standard)	Anthropic (Premium)
Final Quality	75*	78	78
Word Count	≈2,700	2,480	2,850
Citations	≈35	42	42
Tables	≈3	1	1
Has Controversies	Yes	Yes	Yes

*Budget tier’s verify phase identified gaps but couldn’t fix them.

Comparison: Original vs New LessWrong

Aspect	Original	New (Standard)
Table rows	196	5
URLs/Citations	41	46
Citation density	0.9/100 words	2.3/100 words
Critical sources cited	0	4
Controversies	Superficial table	Full section with quotes

Key Findings

1. Research Quality Was Excellent

All three pipelines found diverse source types:

LessWrong sources found:

Wikipedia article
Official LessWrong posts (history, surveys, FAQ)
EA Forum discussions
Critical perspectives: Bryan Caplan (Econlib), Tyler Cowen, Greg Epstein (NYT), RationalWiki

Anthropic sources found:

Wikipedia, official company page
Financial data (valuations, revenue)
Critical: SaferAI critique, White House feud coverage, deceptive AI behavior reports
Policy positions on SB 1047, export controls

2. The Review Phase Catches Real Problems

The Anthropic critical-review phase identified:

“Quick Assessment table is overwhelmingly favorable” “Company culture section reads like PR” “Several interpretive statements presented as fact without sources” “Missing: lobbying positions, concrete safety failures, competitor comparisons”

The gap-fill phase then researched exactly those topics and the rewrite integrated them.

3. Standard Tier Hits Diminishing Returns

Tier	Cost	Quality	Notes
Budget	$4.50	75	Gaps identified but not fixed
Standard	$10.50	78	Gaps fixed
Premium	$15.00	78	Same quality, more thorough

The extra $4.50 from Standard to Premium didn’t improve the quality score. The review→gap-fill→polish cycle is where the value is.

4. Prose-First Prompting Works

Explicit instructions matter:

“Maximum 4 tables”
“Minimum 60% prose”
“Tables are for genuinely comparative data, not lists”

Result: 97% reduction in table rows.

Recommendations

For This Wiki

Use Standard tier ($10.50) for most pages - Best quality/cost ratio
Use Budget tier ($4.50) for drafts - Good starting point for human editing
Reserve Premium ($15) for controversial topics - Extra scrutiny is valuable for Anthropic, OpenAI, etc.

For the Pipeline

Add to package.json for easy access:

"scripts": {
  "create-page": "node scripts/content/page-creator-v2.mjs"
}

Consider batch mode - Run multiple Standard-tier pages overnight
Integrate with grading - Auto-grade output and re-run if below threshold

For Future Work

Perplexity Deep Research integration - Could improve research phase
Human-in-the-loop review - Show review.json before gap-fill for approval
Incremental updates - Re-run pipeline on existing pages to improve them

Appendix: Files Created

scripts/content/page-creator-v2.mjs    # The pipeline script

.claude/temp/page-creator/
├── miri/
│   ├── sources.json      # 10 sources
│   ├── facts.json        # 18 facts, 20 stats, 8 controversies
│   ├── draft.mdx         # Final output (budget has no polish)
│   └── review.json       # Identified but unfixed gaps
├── lesswrong/
│   ├── sources.json      # Research results
│   ├── facts.json        # Extracted claims
│   ├── draft.mdx         # Initial synthesis
│   ├── review.json       # Gap analysis
│   ├── additional-facts.json  # Gap-fill results
│   ├── final.mdx         # Polished output
│   └── summary.json      # Quality metrics
└── anthropic/
    └── [same structure as lesswrong]

Usage

# Standard tier (recommended)
node scripts/content/page-creator-v2.mjs "Topic Name" --tier standard

# Budget tier (for drafts)
node scripts/content/page-creator-v2.mjs "Topic Name" --tier budget

# Premium tier (for controversial topics)
node scripts/content/page-creator-v2.mjs "Topic Name" --tier premium

# Copy output to specific location
node scripts/content/page-creator-v2.mjs "Topic Name" --tier standard --output ./my-article.mdx

Conclusion

The research-first pipeline successfully addresses the core problem of AI-generated content: unsourced, table-heavy data dumps. By structuring the process as Research → Extract → Synthesize with explicit citation requirements, we produce articles that are:

Well-sourced (42 citations with URLs)
Readable (90% prose, not tables)
Balanced (includes critical perspectives)
Cost-effective ($10.50 for production quality)

The Standard tier is recommended for most use cases. The key insight is that research quality matters more than generation quality - you can’t synthesize what you haven’t found.