Rating System
Rating System
Section titled “Rating System”LongtermWiki uses a multi-dimensional rating system combining LLM-graded subscores with automated metrics to produce a derived quality score (0-100).
Quick Reference
Section titled “Quick Reference”# Grade a single pageANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --page scheming
# Grade all pages (with cost estimate)ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --dry-run
# Grade and apply to frontmatterANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --apply
# Parallel processing (faster, higher API cost)ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --parallel 5 --applyScore Components
Section titled “Score Components”1. Importance (0-100)
Section titled “1. Importance (0-100)”How significant is this page for AI risk prioritization work?
| Range | Description | Expected Count |
|---|---|---|
| 90-100 | Essential for prioritization decisions | 5-10 pages |
| 70-89 | High value for practitioners | 30-50 pages |
| 50-69 | Useful context | 80-100 pages |
| 30-49 | Reference material | 60-80 pages |
| 0-29 | Peripheral | 30-50 pages |
Category adjustments applied to base assessment:
- Responses/interventions: +10
- Capabilities: +5
- Core risks: +5
- Risk factors: 0
- Models/analysis: -5
- Arguments/debates: -10
- People/organizations: -15
- Internal/infrastructure: -30
2. Quality Subscores (0-10 each)
Section titled “2. Quality Subscores (0-10 each)”Scoring is harsh: a 7 is exceptional, 8+ is world-class. Most content should score 3-5.
Novelty
Section titled “Novelty”How original is the content beyond its sources?
| Score | Meaning |
|---|---|
| 9-10 | Groundbreaking original research (academic publication level) |
| 7-8 | Significant original synthesis not found elsewhere |
| 5-6 | Some original framing, modest value beyond sources |
| 3-4 | Accurate summary with minimal original perspective |
| 1-2 | Mostly restates common knowledge |
How well-evidenced and precise are the claims?
| Score | Meaning |
|---|---|
| 9-10 | Every claim sourced to primary sources, quantified with uncertainty |
| 7-8 | Nearly all claims well-sourced and quantified |
| 5-6 | Most major claims sourced, some quantification |
| 3-4 | Mix of sourced and unsourced, vague claims common |
| 1-2 | Few sources, mostly assertions |
Actionability
Section titled “Actionability”How useful is this for making decisions?
| Score | Meaning |
|---|---|
| 9-10 | Specific decision procedures with quantified tradeoffs |
| 7-8 | Clear concrete recommendations with supporting analysis |
| 5-6 | Some actionable takeaways, general guidance |
| 3-4 | Mostly abstract, implications unclear |
| 1-2 | Purely descriptive, no practical application |
Completeness
Section titled “Completeness”How comprehensive is the coverage?
| Score | Meaning |
|---|---|
| 9-10 | Exhaustive authoritative reference (textbook-level) |
| 7-8 | Covers all major aspects thoroughly with depth |
| 5-6 | Covers main points, some gaps |
| 3-4 | Notable gaps, missing important aspects |
| 1-2 | Very incomplete, barely started |
3. Automated Metrics
Section titled “3. Automated Metrics”These are computed directly from content, not LLM-graded:
| Metric | What It Measures | How Computed |
|---|---|---|
wordCount | Prose words (excluding tables) | Strip tables, code blocks, imports, components |
citations | External sources | Count <R id=...> + markdown links [](https://...) |
tables | Data tables | Count |---| patterns |
diagrams | Visual elements | Count <Mermaid> +  images |
4. Derived Quality Score (0-100)
Section titled “4. Derived Quality Score (0-100)”quality = (avgSubscore × 8) + lengthBonus + evidenceBonusWhere:
avgSubscore= (novelty + rigor + actionability + completeness) / 4 → contributes 0-80lengthBonus= min(10, wordCount / 500) → contributes 0-10evidenceBonus= min(10, citations × 0.5) → contributes 0-10
Subscores are the primary driver (~80% of score). Bonuses reward depth but can’t compensate for weak content.
| Quality Range | Label | Meaning |
|---|---|---|
| 80-100 | Comprehensive | Fully developed, authoritative |
| 60-79 | Good | Solid content, minor gaps |
| 40-59 | Adequate | Useful but needs work |
| 20-39 | Draft | Early stage, significant gaps |
| 0-19 | Stub | Placeholder only |
Frontmatter Schema
Section titled “Frontmatter Schema”After grading, pages have this frontmatter structure:
---title: "Page Title"description: "Executive summary with methodology AND conclusions"quality: 65 # Derived 0-100importance: 75 # LLM-assessed 0-100lastEdited: "2025-01-28"ratings: novelty: 4.5 # 0-10 scale rigor: 5.2 actionability: 4.8 completeness: 5.0metrics: wordCount: 1250 # Automated citations: 12 tables: 3 diagrams: 1llmSummary: "This page analyzes X using Y methodology. It finds that Z with N% probability."---Script Options
Section titled “Script Options”node scripts/content/grade-content.mjs [options]
Options: --page ID Grade single page by ID or partial match --dry-run Preview without API calls --limit N Process only N pages --parallel N Concurrent API requests (default: 1) --category X Filter by category (models, risks, responses) --skip-graded Skip pages with existing importance --output FILE JSON output path (default: .claude/temp/grades-output.json) --apply Write grades directly to frontmatterCost Estimates
Section titled “Cost Estimates”| Scenario | Input Tokens | Output Tokens | Cost |
|---|---|---|---|
| Single page | ≈4K | ≈200 | ≈$1.05 |
| All 300 pages | ≈1.2M | ≈60K | ≈$15 |
| 10 pages parallel | ≈40K | ≈2K | ≈$1.50 |
Validation
Section titled “Validation”Pages are validated against quality criteria based on their type:
npm run crux -- validate templates # Template structurenpm run crux -- validate unified --rules=placeholders # Incomplete contentSee Page Types for which pages are validated.
Related Documentation
Section titled “Related Documentation”- Models Style Guide - Requirements for analytical model pages
- Risk Style Guide - Requirements for risk analysis pages
- Response Style Guide - Requirements for intervention pages
- Page Types - How page types affect validation