Goal Misgeneralization

goal-misgeneralizationriskPath: /knowledge-base/risks/goal-misgeneralization/

E151Entity ID (EID)

← Back to page35 backlinksQuality: 63Updated: 2026-01-29

Page Recorddatabase.json — merged from MDX frontmatter + Entity YAML + computed metrics at build time

{
  "id": "goal-misgeneralization",
  "wikiId": "E151",
  "path": "/knowledge-base/risks/goal-misgeneralization/",
  "filePath": "knowledge-base/risks/goal-misgeneralization.mdx",
  "title": "Goal Misgeneralization",
  "quality": 63,
  "readerImportance": 84,
  "researchImportance": 82.5,
  "tacticalValue": 55,
  "contentFormat": "article",
  "causalLevel": "pathway",
  "lastUpdated": "2026-01-29",
  "dateCreated": "2026-02-15",
  "summary": "Goal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shift and Claude 3 Opus showing 12-78% alignment faking rates. The phenomenon is currently observable in production systems, with partial mitigation strategies (diverse training, interpretability) showing promise but no complete solution existing.",
  "description": "Goal misgeneralization occurs when AI systems learn capabilities that transfer to new situations but pursue wrong objectives in deployment.",
  "ratings": {
    "novelty": 4.5,
    "rigor": 7,
    "completeness": 7.5,
    "actionability": 5.5
  },
  "category": "risks",
  "subcategory": "accident",
  "clusters": [
    "ai-safety"
  ],
  "metrics": {
    "wordCount": 3513,
    "tableCount": 11,
    "diagramCount": 1,
    "internalLinks": 32,
    "externalLinks": 33,
    "footnoteCount": 0,
    "bulletRatio": 0,
    "sectionCount": 19,
    "hasOverview": true,
    "structuralScore": 15
  },
  "suggestedQuality": 100,
  "updateFrequency": 45,
  "evergreen": true,
  "wordCount": 3513,
  "unconvertedLinks": [
    {
      "text": "Langosco et al. (2022)",
      "url": "https://proceedings.mlr.press/v162/langosco22a.html",
      "resourceId": "c4dda1bfea152190",
      "resourceTitle": "Langosco et al. (2022)"
    },
    {
      "text": "Greenblatt et al. (2024)",
      "url": "https://arxiv.org/abs/2412.14093",
      "resourceId": "19a35a5cec9d9b80",
      "resourceTitle": "Anthropic Alignment Faking (2024)"
    },
    {
      "text": "Anthropic",
      "url": "https://www.anthropic.com/research/alignment-faking",
      "resourceId": "c2cfd72baafd64a9",
      "resourceTitle": "Anthropic's 2024 alignment faking study"
    },
    {
      "text": "DeepMind",
      "url": "https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/",
      "resourceId": "b0f5f87778543882",
      "resourceTitle": "Specification Gaming: The Flip Side of AI Ingenuity"
    },
    {
      "text": "60-80% of trained reinforcement learning agents",
      "url": "https://proceedings.mlr.press/v162/langosco22a.html",
      "resourceId": "c4dda1bfea152190",
      "resourceTitle": "Langosco et al. (2022)"
    },
    {
      "text": "alignment faking in up to 78% of cases",
      "url": "https://arxiv.org/abs/2412.14093",
      "resourceId": "19a35a5cec9d9b80",
      "resourceTitle": "Anthropic Alignment Faking (2024)"
    },
    {
      "text": "published at ICML 2022",
      "url": "https://proceedings.mlr.press/v162/langosco22a.html",
      "resourceId": "c4dda1bfea152190",
      "resourceTitle": "Langosco et al. (2022)"
    },
    {
      "text": "Langosco et al.",
      "url": "https://proceedings.mlr.press/v162/langosco22a.html",
      "resourceId": "c4dda1bfea152190",
      "resourceTitle": "Langosco et al. (2022)"
    },
    {
      "text": "Sharma et al.",
      "url": "https://arxiv.org/abs/2310.13548",
      "resourceId": "7951bdb54fd936a6",
      "resourceTitle": "Anthropic: \"Discovering Sycophancy in Language Models\""
    },
    {
      "text": "Greenblatt et al.",
      "url": "https://arxiv.org/abs/2412.14093",
      "resourceId": "19a35a5cec9d9b80",
      "resourceTitle": "Anthropic Alignment Faking (2024)"
    },
    {
      "text": "Betley et al.",
      "url": "https://arxiv.org/abs/2209.00626",
      "resourceId": "9124298fbb913c3d",
      "resourceTitle": "Gaming RLHF evaluation"
    },
    {
      "text": "Palisade Research",
      "url": "https://arxiv.org/pdf/2502.13295",
      "resourceId": "dccfa7405702077d",
      "resourceTitle": "Palisade Research, 2025"
    },
    {
      "text": "follow-up study by Wei et al. (2023)",
      "url": "https://arxiv.org/abs/2308.03958",
      "resourceId": "40f208ddd2720ec6",
      "resourceTitle": "Wei et al. (2023): \"Simple Synthetic Data\""
    },
    {
      "text": "medical domain (2025)",
      "url": "https://www.nature.com/articles/s41746-025-02008-z",
      "resourceId": "c0ee1b2a55e0d646",
      "resourceTitle": "Nature Digital Medicine (2025)"
    },
    {
      "text": "Anthropic's Alignment Science team and Redwood Research",
      "url": "https://www.anthropic.com/research/alignment-faking",
      "resourceId": "c2cfd72baafd64a9",
      "resourceTitle": "Anthropic's 2024 alignment faking study"
    },
    {
      "text": "exfiltrate its own weights",
      "url": "https://arxiv.org/abs/2412.14093",
      "resourceId": "19a35a5cec9d9b80",
      "resourceTitle": "Anthropic Alignment Faking (2024)"
    },
    {
      "text": "2025 study by Palisade Research",
      "url": "https://arxiv.org/pdf/2502.13295",
      "resourceId": "dccfa7405702077d",
      "resourceTitle": "Palisade Research, 2025"
    },
    {
      "text": "Betley et al. (2025)",
      "url": "https://arxiv.org/abs/2209.00626",
      "resourceId": "9124298fbb913c3d",
      "resourceTitle": "Gaming RLHF evaluation"
    },
    {
      "text": "Anthropic",
      "url": "https://www.anthropic.com/research/alignment-faking",
      "resourceId": "c2cfd72baafd64a9",
      "resourceTitle": "Anthropic's 2024 alignment faking study"
    },
    {
      "text": "CoinRun experiments",
      "url": "https://proceedings.mlr.press/v162/langosco22a.html",
      "resourceId": "c4dda1bfea152190",
      "resourceTitle": "Langosco et al. (2022)"
    },
    {
      "text": "Greenblatt et al.",
      "url": "https://arxiv.org/abs/2412.14093",
      "resourceId": "19a35a5cec9d9b80",
      "resourceTitle": "Anthropic Alignment Faking (2024)"
    },
    {
      "text": "Langosco et al. 2022",
      "url": "https://proceedings.mlr.press/v162/langosco22a.html",
      "resourceId": "c4dda1bfea152190",
      "resourceTitle": "Langosco et al. (2022)"
    },
    {
      "text": "Wei et al. 2023",
      "url": "https://arxiv.org/abs/2308.03958",
      "resourceId": "40f208ddd2720ec6",
      "resourceTitle": "Wei et al. (2023): \"Simple Synthetic Data\""
    },
    {
      "text": "DeepMind",
      "url": "https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/",
      "resourceId": "b0f5f87778543882",
      "resourceTitle": "Specification Gaming: The Flip Side of AI Ingenuity"
    },
    {
      "text": "Anthropic research",
      "url": "https://www.anthropic.com/research/alignment-faking",
      "resourceId": "c2cfd72baafd64a9",
      "resourceTitle": "Anthropic's 2024 alignment faking study"
    },
    {
      "text": "Anthropic",
      "url": "https://www.anthropic.com/research/alignment-faking",
      "resourceId": "c2cfd72baafd64a9",
      "resourceTitle": "Anthropic's 2024 alignment faking study"
    },
    {
      "text": "Sharma et al.",
      "url": "https://arxiv.org/abs/2310.13548",
      "resourceId": "7951bdb54fd936a6",
      "resourceTitle": "Anthropic: \"Discovering Sycophancy in Language Models\""
    },
    {
      "text": "Langosco et al.",
      "url": "https://proceedings.mlr.press/v162/langosco22a.html",
      "resourceId": "c4dda1bfea152190",
      "resourceTitle": "Langosco et al. (2022)"
    },
    {
      "text": "Greenblatt et al.",
      "url": "https://arxiv.org/abs/2412.14093",
      "resourceId": "19a35a5cec9d9b80",
      "resourceTitle": "Anthropic Alignment Faking (2024)"
    }
  ],
  "unconvertedLinkCount": 29,
  "convertedLinkCount": 29,
  "backlinkCount": 35,
  "hallucinationRisk": {
    "level": "medium",
    "score": 40,
    "factors": [
      "no-citations",
      "high-rigor"
    ]
  },
  "entityType": "risk",
  "redundancy": {
    "maxSimilarity": 23,
    "similarPages": [
      {
        "id": "mesa-optimization",
        "title": "Mesa-Optimization",
        "path": "/knowledge-base/risks/mesa-optimization/",
        "similarity": 23
      },
      {
        "id": "sharp-left-turn",
        "title": "Sharp Left Turn",
        "path": "/knowledge-base/risks/sharp-left-turn/",
        "similarity": 22
      },
      {
        "id": "scheming",
        "title": "Scheming",
        "path": "/knowledge-base/risks/scheming/",
        "similarity": 21
      },
      {
        "id": "situational-awareness",
        "title": "Situational Awareness",
        "path": "/knowledge-base/capabilities/situational-awareness/",
        "similarity": 19
      },
      {
        "id": "model-organisms-of-misalignment",
        "title": "Model Organisms of Misalignment",
        "path": "/knowledge-base/models/model-organisms-of-misalignment/",
        "similarity": 19
      }
    ]
  },
  "coverage": {
    "passing": 8,
    "total": 13,
    "targets": {
      "tables": 14,
      "diagrams": 1,
      "internalLinks": 28,
      "externalLinks": 18,
      "footnotes": 11,
      "references": 11
    },
    "actuals": {
      "tables": 11,
      "diagrams": 1,
      "internalLinks": 32,
      "externalLinks": 33,
      "footnotes": 0,
      "references": 17,
      "quotesWithQuotes": 0,
      "quotesTotal": 0,
      "accuracyChecked": 0,
      "accuracyTotal": 0
    },
    "items": {
      "summary": "green",
      "schedule": "green",
      "entity": "green",
      "editHistory": "red",
      "overview": "green",
      "tables": "amber",
      "diagrams": "green",
      "internalLinks": "green",
      "externalLinks": "green",
      "footnotes": "red",
      "references": "green",
      "quotes": "red",
      "accuracy": "red"
    },
    "ratingsString": "N:4.5 R:7 A:5.5 C:7.5"
  },
  "readerRank": 58,
  "researchRank": 73,
  "recommendedScore": 179.63
}

External Links

{
  "stampy": "https://aisafety.info/questions/8TJ7/What-is-goal-misgeneralization",
  "alignmentForum": "https://www.alignmentforum.org/tag/goal-misgeneralization"
}

Backlinks (35)

id	title	type	relationship
mesa-optimization-analysis	Mesa-Optimization Risk Analysis	analysis	related
goal-misgeneralization-probability	Goal Misgeneralization Probability Model	analysis	analyzes
value-learning	Value Learning	research-area	addresses
goal-misgeneralization-research	Goal Misgeneralization Research	approach	—
distributional-shift	AI Distributional Shift	risk	—
mesa-optimization	Mesa-Optimization	risk	—
reward-hacking	Reward Hacking	risk	—
sharp-left-turn	Sharp Left Turn	risk	—
near-term-risks	Key Near-Term AI Risks	risk	—
why-alignment-hard	Why Alignment Might Be Hard	argument	—
carlsmith-six-premises	Carlsmith's Six-Premise Argument	analysis	—
defense-in-depth-model	Defense in Depth Model	analysis	—
intervention-effectiveness-matrix	Intervention Effectiveness Matrix	analysis	—
scheming-likelihood-model	Scheming Likelihood Assessment	analysis	—
technical-pathways	Technical Pathway Decomposition	analysis	—
toby-ord	Toby Ord	person	—
adversarial-training	Adversarial Training	approach	—
ai-control	AI Control	research-area	—
alignment-theoretical-overview	Theoretical Foundations (Overview)	concept	—
alignment	AI Alignment	approach	—
cirl	Cooperative IRL (CIRL)	approach	—
evals	Evals & Red-teaming	research-area	—
interpretability	Mechanistic Interpretability	research-area	—
mit-ai-risk-repository	MIT AI Risk Repository	project	—
preference-optimization	Preference Optimization Methods	approach	—
provably-safe	Provably Safe AI (davidad agenda)	approach	—
refusal-training	Refusal Training	approach	—
reward-modeling	Reward Modeling	approach	—
rlhf	RLHF / Constitutional AI	research-area	—
sparse-autoencoders	Sparse Autoencoders (SAEs)	approach	—
weak-to-strong	Weak-to-Strong Generalization	approach	—
accident-overview	Accident Risks (Overview)	concept	—
__index__/knowledge-base/risks	AI Risks	concept	—
sycophancy	Sycophancy	risk	—
__index__/insight-hunting	Insight Hunting	concept	—