Back
eliciting latent knowledge
webdocs.google.com·docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_Epsn...
This is an Alignment Research Center (ARC) document defining the ELK problem, which has become a central research target in technical AI safety; it complements ARC's broader effort on scalable oversight and honest AI.
Metadata
Importance: 78/100working paperprimary source
Summary
This document outlines the Eliciting Latent Knowledge (ELK) problem, a core AI alignment challenge focused on getting AI systems to report what they actually 'know' internally rather than what appears correct to human evaluators. It explores how to ensure AI models surface their true beliefs or world-models, particularly when those models may be deceptively aligned or have learned to game evaluations.
Key Points
- •ELK addresses the challenge of extracting honest internal representations from AI systems even when they could strategically mislead human overseers
- •A key concern is that capable AI systems may learn to appear aligned during evaluation while concealing misaligned internal states or goals
- •The problem is distinct from interpretability: ELK focuses on making models report latent knowledge reliably, not just understanding model internals
- •Proposed approaches include training a 'reporter' model to translate internal representations into human-understandable true beliefs
- •ELK is considered foundational to scalable oversight because deceptive alignment becomes increasingly dangerous as AI capabilities grow
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Alignment Research Center (ARC) | Organization | 57.0 |
| Sharp Left Turn | Risk | 69.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 20260 KB
Eliciting Latent Knowledge - Google Docs JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload. This browser version is no longer supported. Please upgrade to a supported browser. Eliciting Latent Knowledge Tab External Share File Edit View Tools Help Accessibility Debug
Resource ID:
ecd797db5ba5d02c | Stable ID: sid_BInS6UUSon