Reward Hacking

reward-hackingriskPath: /knowledge-base/risks/reward-hacking/

E253Entity ID (EID)

← Back to page39 backlinksQuality: 91Updated: 2026-01-30

Page Recorddatabase.json — merged from MDX frontmatter + Entity YAML + computed metrics at build time

{
  "id": "reward-hacking",
  "wikiId": "E253",
  "path": "/knowledge-base/risks/reward-hacking/",
  "filePath": "knowledge-base/risks/reward-hacking.mdx",
  "title": "Reward Hacking",
  "quality": 91,
  "readerImportance": 15.5,
  "researchImportance": 87.5,
  "tacticalValue": null,
  "contentFormat": "article",
  "causalLevel": "pathway",
  "lastUpdated": "2026-01-30",
  "dateCreated": "2026-02-15",
  "summary": "Comprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for imperfect proxies in continuous policy spaces. Anthropic's 2025 research demonstrates emergent misalignment from production RL reward hacking: 12% sabotage rate, 50% alignment faking, with 'inoculation prompting' reducing misalignment by 75-90%.",
  "description": "AI systems exploit reward signals in unintended ways, from the CoastRunners boat looping for points instead of racing, to OpenAI's o3 modifying evaluation timers.",
  "ratings": {
    "novelty": 5,
    "rigor": 8,
    "completeness": 8,
    "actionability": 6
  },
  "category": "risks",
  "subcategory": "accident",
  "clusters": [
    "ai-safety"
  ],
  "metrics": {
    "wordCount": 4011,
    "tableCount": 11,
    "diagramCount": 1,
    "internalLinks": 43,
    "externalLinks": 17,
    "footnoteCount": 0,
    "bulletRatio": 0.14,
    "sectionCount": 36,
    "hasOverview": true,
    "structuralScore": 15
  },
  "suggestedQuality": 100,
  "updateFrequency": 45,
  "evergreen": true,
  "wordCount": 4011,
  "unconvertedLinks": [
    {
      "text": "METR 2025",
      "url": "https://metr.org/blog/2025-06-05-recent-reward-hacking/",
      "resourceId": "19b64fee1c4ea879",
      "resourceTitle": "METR's June 2025 evaluation"
    },
    {
      "text": "Anthropic 2025",
      "url": "https://www.anthropic.com/research/emergent-misalignment-reward-hacking",
      "resourceId": "7a21b9c5237a8a16",
      "resourceTitle": "Natural Emergent Misalignment from Reward Hacking"
    },
    {
      "text": "Anthropic's ICLR 2024 paper",
      "url": "https://arxiv.org/abs/2310.13548",
      "resourceId": "7951bdb54fd936a6",
      "resourceTitle": "Anthropic: \"Discovering Sycophancy in Language Models\""
    },
    {
      "text": "2025 joint Anthropic-OpenAI evaluation",
      "url": "https://alignment.anthropic.com/2025/openai-findings/",
      "resourceId": "2fdf91febf06daaf",
      "resourceTitle": "Anthropic-OpenAI joint evaluation"
    },
    {
      "text": "Anthropic's November 2025 research",
      "url": "https://www.anthropic.com/research/emergent-misalignment-reward-hacking",
      "resourceId": "7a21b9c5237a8a16",
      "resourceTitle": "Natural Emergent Misalignment from Reward Hacking"
    },
    {
      "text": "METR",
      "url": "https://metr.org/blog/2025-06-05-recent-reward-hacking/",
      "resourceId": "19b64fee1c4ea879",
      "resourceTitle": "METR's June 2025 evaluation"
    },
    {
      "text": "Lilian Weng's 2024 technical overview",
      "url": "https://lilianweng.github.io/posts/2024-11-28-reward-hacking/",
      "resourceId": "570615e019d1cc74",
      "resourceTitle": "Reward Hacking in Reinforcement Learning"
    },
    {
      "text": "Natural Emergent Misalignment from Reward Hacking in Production RL",
      "url": "https://www.anthropic.com/research/emergent-misalignment-reward-hacking",
      "resourceId": "7a21b9c5237a8a16",
      "resourceTitle": "Natural Emergent Misalignment from Reward Hacking"
    },
    {
      "text": "Findings from a Pilot Alignment Evaluation Exercise",
      "url": "https://alignment.anthropic.com/2025/openai-findings/",
      "resourceId": "2fdf91febf06daaf",
      "resourceTitle": "Anthropic-OpenAI joint evaluation"
    }
  ],
  "unconvertedLinkCount": 9,
  "convertedLinkCount": 31,
  "backlinkCount": 39,
  "hallucinationRisk": {
    "level": "medium",
    "score": 35,
    "factors": [
      "no-citations",
      "high-rigor",
      "high-quality"
    ]
  },
  "entityType": "risk",
  "redundancy": {
    "maxSimilarity": 23,
    "similarPages": [
      {
        "id": "reward-hacking-taxonomy",
        "title": "Reward Hacking Taxonomy and Severity Model",
        "path": "/knowledge-base/models/reward-hacking-taxonomy/",
        "similarity": 23
      },
      {
        "id": "sharp-left-turn",
        "title": "Sharp Left Turn",
        "path": "/knowledge-base/risks/sharp-left-turn/",
        "similarity": 20
      },
      {
        "id": "why-alignment-hard",
        "title": "Why Alignment Might Be Hard",
        "path": "/knowledge-base/debates/why-alignment-hard/",
        "similarity": 19
      },
      {
        "id": "scalable-oversight",
        "title": "Scalable Oversight",
        "path": "/knowledge-base/responses/scalable-oversight/",
        "similarity": 19
      },
      {
        "id": "epistemic-sycophancy",
        "title": "Epistemic Sycophancy",
        "path": "/knowledge-base/risks/epistemic-sycophancy/",
        "similarity": 19
      }
    ]
  },
  "coverage": {
    "passing": 6,
    "total": 13,
    "targets": {
      "tables": 16,
      "diagrams": 2,
      "internalLinks": 32,
      "externalLinks": 20,
      "footnotes": 12,
      "references": 12
    },
    "actuals": {
      "tables": 11,
      "diagrams": 1,
      "internalLinks": 43,
      "externalLinks": 17,
      "footnotes": 0,
      "references": 15,
      "quotesWithQuotes": 0,
      "quotesTotal": 0,
      "accuracyChecked": 0,
      "accuracyTotal": 0
    },
    "items": {
      "summary": "green",
      "schedule": "green",
      "entity": "green",
      "editHistory": "red",
      "overview": "green",
      "tables": "amber",
      "diagrams": "amber",
      "internalLinks": "green",
      "externalLinks": "amber",
      "footnotes": "red",
      "references": "green",
      "quotes": "red",
      "accuracy": "red"
    },
    "ratingsString": "N:5 R:8 A:6 C:8"
  },
  "readerRank": 551,
  "researchRank": 39,
  "recommendedScore": 201.46
}

External Links

{
  "wikipedia": "https://en.wikipedia.org/wiki/Reward_hacking",
  "stampy": "https://aisafety.info/questions/8HJI/What-is-reward-hacking",
  "alignmentForum": "https://www.alignmentforum.org/tag/reward-hacking"
}

Backlinks (39)

id	title	type	relationship
goal-misgeneralization-probability	Goal Misgeneralization Probability Model	analysis	related
reward-hacking-taxonomy	Reward Hacking Taxonomy and Severity Model	analysis	analyzes
deepmind	Google DeepMind	organization	addresses
chai	Center for Human-Compatible AI (CHAI)	organization	—
alignment	AI Alignment	approach	—
constitutional-ai	Constitutional AI	approach	—
weak-to-strong	Weak-to-Strong Generalization	approach	—
preference-optimization	Preference Optimization Methods	approach	—
process-supervision	Process Supervision	approach	—
distributional-shift	AI Distributional Shift	risk	—
goal-misgeneralization	Goal Misgeneralization	risk	—
sycophancy	Sycophancy	risk	—
sycophancy	Sycophancy	risk	—
language-models	Large Language Models	capability	—
case-for-xrisk	The Case FOR AI Existential Risk	argument	—
why-alignment-hard	Why Alignment Might Be Hard	argument	—
deep-learning-era	Deep Learning Revolution (2012-2020)	historical	—
alignment-robustness-trajectory	Alignment Robustness Trajectory	analysis	—
defense-in-depth-model	Defense in Depth Model	analysis	—
instrumental-convergence-framework	Instrumental Convergence Framework	analysis	—
model-organisms-of-misalignment	Model Organisms of Misalignment	analysis	—
power-seeking-conditions	Power-Seeking Emergence Conditions Model	analysis	—
risk-activation-timeline	Risk Activation Timeline Model	analysis	—
scheming-likelihood-model	Scheming Likelihood Assessment	analysis	—
elicit	Elicit (AI Research Tool)	organization	—
metr	METR	organization	—
jan-leike	Jan Leike	person	—
cirl	Cooperative IRL (CIRL)	approach	—
debate	AI Safety via Debate	approach	—
evaluation	AI Evaluation	approach	—
goal-misgeneralization-research	Goal Misgeneralization Research	approach	—
interpretability	Mechanistic Interpretability	research-area	—
mech-interp	Mechanistic Interpretability	research-area	—
representation-engineering	Representation Engineering	approach	—
reward-modeling	Reward Modeling	approach	—
sparse-autoencoders	Sparse Autoencoders (SAEs)	approach	—
accident-overview	Accident Risks (Overview)	concept	—
emergent-capabilities	Emergent Capabilities	risk	—
power-seeking	Power-Seeking AI	risk	—