Skip to content

Rating System

LongtermWiki uses a multi-dimensional rating system combining LLM-graded subscores with automated metrics to produce a derived quality score (0-100).

Terminal window
# Grade a single page
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --page scheming
# Grade all pages (with cost estimate)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --dry-run
# Grade and apply to frontmatter
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --apply
# Parallel processing (faster, higher API cost)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --parallel 5 --apply

How significant is this page for AI risk prioritization work?

RangeDescriptionExpected Count
90-100Essential for prioritization decisions5-10 pages
70-89High value for practitioners30-50 pages
50-69Useful context80-100 pages
30-49Reference material60-80 pages
0-29Peripheral30-50 pages

Category adjustments applied to base assessment:

  • Responses/interventions: +10
  • Capabilities: +5
  • Core risks: +5
  • Risk factors: 0
  • Models/analysis: -5
  • Arguments/debates: -10
  • People/organizations: -15
  • Internal/infrastructure: -30

Scoring is harsh: a 7 is exceptional, 8+ is world-class. Most content should score 3-5.

How original is the content beyond its sources?

ScoreMeaning
9-10Groundbreaking original research (academic publication level)
7-8Significant original synthesis not found elsewhere
5-6Some original framing, modest value beyond sources
3-4Accurate summary with minimal original perspective
1-2Mostly restates common knowledge

How well-evidenced and precise are the claims?

ScoreMeaning
9-10Every claim sourced to primary sources, quantified with uncertainty
7-8Nearly all claims well-sourced and quantified
5-6Most major claims sourced, some quantification
3-4Mix of sourced and unsourced, vague claims common
1-2Few sources, mostly assertions

How useful is this for making decisions?

ScoreMeaning
9-10Specific decision procedures with quantified tradeoffs
7-8Clear concrete recommendations with supporting analysis
5-6Some actionable takeaways, general guidance
3-4Mostly abstract, implications unclear
1-2Purely descriptive, no practical application

How comprehensive is the coverage?

ScoreMeaning
9-10Exhaustive authoritative reference (textbook-level)
7-8Covers all major aspects thoroughly with depth
5-6Covers main points, some gaps
3-4Notable gaps, missing important aspects
1-2Very incomplete, barely started

These are computed directly from content, not LLM-graded:

MetricWhat It MeasuresHow Computed
wordCountProse words (excluding tables)Strip tables, code blocks, imports, components
citationsExternal sourcesCount <R id=...> + markdown links [](https://...)
tablesData tablesCount |---| patterns
diagramsVisual elementsCount <Mermaid> + ![](...) images
quality = (avgSubscore × 8) + lengthBonus + evidenceBonus

Where:

  • avgSubscore = (novelty + rigor + actionability + completeness) / 4 → contributes 0-80
  • lengthBonus = min(10, wordCount / 500) → contributes 0-10
  • evidenceBonus = min(10, citations × 0.5) → contributes 0-10

Subscores are the primary driver (~80% of score). Bonuses reward depth but can’t compensate for weak content.

Quality RangeLabelMeaning
80-100ComprehensiveFully developed, authoritative
60-79GoodSolid content, minor gaps
40-59AdequateUseful but needs work
20-39DraftEarly stage, significant gaps
0-19StubPlaceholder only

After grading, pages have this frontmatter structure:

---
title: "Page Title"
description: "Executive summary with methodology AND conclusions"
quality: 65 # Derived 0-100
importance: 75 # LLM-assessed 0-100
lastEdited: "2025-01-28"
ratings:
novelty: 4.5 # 0-10 scale
rigor: 5.2
actionability: 4.8
completeness: 5.0
metrics:
wordCount: 1250 # Automated
citations: 12
tables: 3
diagrams: 1
llmSummary: "This page analyzes X using Y methodology. It finds that Z with N% probability."
---

Terminal window
node scripts/content/grade-content.mjs [options]
Options:
--page ID Grade single page by ID or partial match
--dry-run Preview without API calls
--limit N Process only N pages
--parallel N Concurrent API requests (default: 1)
--category X Filter by category (models, risks, responses)
--skip-graded Skip pages with existing importance
--output FILE JSON output path (default: .claude/temp/grades-output.json)
--apply Write grades directly to frontmatter
ScenarioInput TokensOutput TokensCost
Single page≈4K≈200≈$1.05
All 300 pages≈1.2M≈60K≈$15
10 pages parallel≈40K≈2K≈$1.50

Pages are validated against quality criteria based on their type:

Terminal window
npm run crux -- validate templates # Template structure
npm run crux -- validate unified --rules=placeholders # Incomplete content

See Page Types for which pages are validated.