Skip to content

Response Pages Style Guide

Response pages describe interventions, policies, and technical approaches that address AI risks. They explain how something works and assess its effectiveness.

Response pages are at: /knowledge-base/responses/**/*.mdx

---
title: "Response Name"
description: "One sentence explaining what this response does and its key mechanism."
quality: 60 # 0-100
importance: 70 # 0-100
lastEdited: "2025-01-28"
---

What is this response and why does it matter?

## Quick Assessment
| Dimension | Rating | Notes |
|-----------|--------|-------|
| Tractability | Medium | Requires significant research investment |
| Scalability | High | Applies to most foundation models |
| Current Maturity | Low | Early research stage |
| Time Horizon | 5-10 years | Needs fundamental advances |
| Key Proponents | Anthropic, DeepMind | Active research programs |

Technical explanation with diagram:

## How It Works
<Mermaid client:load chart={`
flowchart LR
A[Input] --> B[Safety Layer]
B --> C{Check}
C -->|Pass| D[Output]
C -->|Fail| E[Block/Modify]
`} />
[Detailed explanation of mechanism]
## Risks Addressed
| Risk | Relevance | How It Helps |
|------|-----------|--------------|
| [Deceptive Alignment](/...) | High | Detects hidden goals |
| [Reward Hacking](/...) | Medium | Identifies misspecified rewards |

What this approach cannot do or gets wrong.

Who is working on this, what progress has been made.

Unsolved problems and research directions.


Task({
subagent_type: 'general-purpose',
prompt: `Create a response page for [RESPONSE_NAME].
FIRST: Read /internal/response-style-guide/.
THEN: Research using WebSearch:
- Academic papers on the technique
- Lab blog posts and announcements
- Current implementations and results
Create at: src/content/docs/knowledge-base/responses/[category]/[name].mdx
Include:
1. Overview (2-3 paragraphs)
2. Quick Assessment table
3. How It Works (with diagram)
4. Risks Addressed table
5. Limitations
6. Current State
7. Open Questions`
})
Task({
subagent_type: 'general-purpose',
prompt: `Improve response page at [PATH].
Read /internal/response-style-guide/ first.
Add:
1. Quick Assessment table (if missing)
2. Mermaid diagram
3. Risks Addressed cross-links
4. Citations from recent papers
Use WebSearch for current research.`
})

Pages are scored on four dimensions (0-10 scale). Scoring is harsh - a 7 is exceptional, most content should score 3-5.

Dimension3-4 (Adequate)5-6 (Good)7+ (Exceptional)
NoveltyAccurate summarySome original framingSignificant original insight
RigorMixed sourcingMostly sourcedFully sourced with quantification
ActionabilityAbstract implicationsSome actionable takeawaysConcrete decision guidance
CompletenessNotable gapsCovers main pointsThorough coverage

Derived quality (0-100) combines subscores with word count and citation bonuses. See CLAUDE.md for formula.


See Mechanistic Interpretability for reference.