Rating System

LongtermWiki uses a multi-dimensional rating system combining LLM-graded subscores with automated metrics to produce a derived quality score (0-100).

Quick Reference

# Grade a single page
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --page scheming

# Grade all pages (with cost estimate)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --dry-run

# Grade and apply to frontmatter
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --apply

# Parallel processing (faster, higher API cost)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --parallel 5 --apply

Score Components

1. Importance (0-100)

How significant is this page for AI risk prioritization work?

Range	Description	Expected Count
90-100	Essential for prioritization decisions	5-10 pages
70-89	High value for practitioners	30-50 pages
50-69	Useful context	80-100 pages
30-49	Reference material	60-80 pages
0-29	Peripheral	30-50 pages

Category adjustments applied to base assessment:

Responses/interventions: +10
Capabilities: +5
Core risks: +5
Risk factors: 0
Models/analysis: -5
Arguments/debates: -10
People/organizations: -15
Internal/infrastructure: -30

2. Quality Subscores (0-10 each)

Scoring is harsh: a 7 is exceptional, 8+ is world-class. Most content should score 3-5.

Novelty

How original is the content beyond its sources?

Score	Meaning
9-10	Groundbreaking original research (academic publication level)
7-8	Significant original synthesis not found elsewhere
5-6	Some original framing, modest value beyond sources
3-4	Accurate summary with minimal original perspective
1-2	Mostly restates common knowledge

Rigor

How well-evidenced and precise are the claims?

Score	Meaning
9-10	Every claim sourced to primary sources, quantified with uncertainty
7-8	Nearly all claims well-sourced and quantified
5-6	Most major claims sourced, some quantification
3-4	Mix of sourced and unsourced, vague claims common
1-2	Few sources, mostly assertions

Actionability

How useful is this for making decisions?

Score	Meaning
9-10	Specific decision procedures with quantified tradeoffs
7-8	Clear concrete recommendations with supporting analysis
5-6	Some actionable takeaways, general guidance
3-4	Mostly abstract, implications unclear
1-2	Purely descriptive, no practical application

Completeness

How comprehensive is the coverage?

Score	Meaning
9-10	Exhaustive authoritative reference (textbook-level)
7-8	Covers all major aspects thoroughly with depth
5-6	Covers main points, some gaps
3-4	Notable gaps, missing important aspects
1-2	Very incomplete, barely started

3. Automated Metrics

These are computed directly from content, not LLM-graded:

Metric	What It Measures	How Computed
`wordCount`	Prose words (excluding tables)	Strip tables, code blocks, imports, components
`citations`	External sources	Count `<R id=...>` + markdown links `[](https://...)`
`tables`	Data tables	Count `\|---\|` patterns
`diagrams`	Visual elements	Count `<Mermaid>` + `![](...)` images

4. Derived Quality Score (0-100)

quality = (avgSubscore × 8) + lengthBonus + evidenceBonus

Where:

avgSubscore = (novelty + rigor + actionability + completeness) / 4 → contributes 0-80
lengthBonus = min(10, wordCount / 500) → contributes 0-10
evidenceBonus = min(10, citations × 0.5) → contributes 0-10

Subscores are the primary driver (~80% of score). Bonuses reward depth but can’t compensate for weak content.

Quality Range	Label	Meaning
80-100	Comprehensive	Fully developed, authoritative
60-79	Good	Solid content, minor gaps
40-59	Adequate	Useful but needs work
20-39	Draft	Early stage, significant gaps
0-19	Stub	Placeholder only

Frontmatter Schema

After grading, pages have this frontmatter structure:

---
title: "Page Title"
description: "Executive summary with methodology AND conclusions"
quality: 65           # Derived 0-100
importance: 75        # LLM-assessed 0-100
lastEdited: "2025-01-28"
ratings:
  novelty: 4.5        # 0-10 scale
  rigor: 5.2
  actionability: 4.8
  completeness: 5.0
metrics:
  wordCount: 1250     # Automated
  citations: 12
  tables: 3
  diagrams: 1
llmSummary: "This page analyzes X using Y methodology. It finds that Z with N% probability."
---

Script Options

node scripts/content/grade-content.mjs [options]

Options:
  --page ID       Grade single page by ID or partial match
  --dry-run       Preview without API calls
  --limit N       Process only N pages
  --parallel N    Concurrent API requests (default: 1)
  --category X    Filter by category (models, risks, responses)
  --skip-graded   Skip pages with existing importance
  --output FILE   JSON output path (default: .claude/temp/grades-output.json)
  --apply         Write grades directly to frontmatter

Cost Estimates

Scenario	Input Tokens	Output Tokens	Cost
Single page	≈4K	≈200	≈$1.05
All 300 pages	≈1.2M	≈60K	≈$15
10 pages parallel	≈40K	≈2K	≈$1.50

Validation

Pages are validated against quality criteria based on their type:

npm run crux -- validate templates    # Template structure
npm run crux -- validate unified --rules=placeholders  # Incomplete content

See Page Types for which pages are validated.

Models Style Guide - Requirements for analytical model pages
Risk Style Guide - Requirements for risk analysis pages
Response Style Guide - Requirements for intervention pages
Page Types - How page types affect validation

Rating System

Rating System

Quick Reference

Score Components

1. Importance (0-100)

2. Quality Subscores (0-10 each)

Novelty

Rigor

Actionability

Completeness

3. Automated Metrics

4. Derived Quality Score (0-100)

Frontmatter Schema

Script Options

Cost Estimates

Validation

Related Documentation