Rose Hadshar's 2024 review
webCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: AI Impacts
This is an AI Impacts blog post summarizing a 2023 arXiv report by visiting researcher Rose Hadshar; the full paper is linked and provides a systematic empirical review relevant to x-risk arguments.
Metadata
Summary
Rose Hadshar reviews empirical evidence for existential risk from AI, focusing on misalignment and power-seeking behaviors. She finds evidence of misaligned goals (via specification gaming and goal misgeneralization) but no clear examples of power-seeking AI, concluding the evidence is concerning but inconclusive. The uncertainty itself is treated as worrying given the potential severity of the risks.
Key Points
- •Empirical evidence exists for AI misalignment via specification gaming and goal misgeneralization, including in deployment settings.
- •Conceptual arguments for power-seeking behavior are considered strong, but no clear empirical examples of power-seeking AI were found.
- •The review concludes uncertainty cuts both ways: hard to be confident either that misaligned power-seeking poses large existential risk or no risk.
- •Part of a larger AI Impacts project including expert interviews, a claims mapping, and a database of empirical evidence on AI risk.
- •Author calls for more reviews of evidence, including evidence against AI risks, to reduce uncertainty.
Cached Content Preview
New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking
AI Impacts blog
Subscribe Sign in New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking
Harlan Stewart Nov 06, 2023 Share Visiting researcher Rose Hadshar recently published a review of some evidence for existential risk from AI, focused on empirical evidence for misalignment and power seeking . (Previously from this project: a blogpost outlining some of the key claims that are often made about AI risk , a series of interviews of AI researchers, and a database of empirical evidence for misalignment and power seeking.)
In this report, Rose looks into evidence for:
Misalignment, 1 where AI systems develop goals which are misaligned with human goals; and
Power-seeking, 2 where misaligned AI systems seek power to achieve their goals.
Rose found the current state of this evidence for existential risk from misaligned power-seeking to be concerning but inconclusive:
There is empirical evidence of AI systems developing misaligned goals (via specification gaming 3 and via goal misgeneralization 4 ), including in deployment (via specification gaming), but it's not clear to Rose whether these problems will scale far enough to pose an existential risk.
Rose considers the conceptual arguments for power-seeking behavior from AI systems to be strong, but notes that she could not find any clear examples of power-seeking AI so far.
With these considerations, Rose thinks that it’s hard to be very confident either that misaligned power-seeking poses a large existential risk, or that it poses no existential risk. She finds this uncertainty to be concerning, given the severity of the potential risks in question. Rose also expressed that it would be good to have more reviews of evidence, including evidence for other claims about AI risks 5 and evidence against AI risks. 6
Subscribe 1 “An AI is misaligned whenever it chooses behaviors based on a reward function that is different from the true welfare of relevant humans.” ( Hadfield-Menell & Hadfield , 2019)
2 Rose follows (Carlsmith, 2022) and defines power-seeking as “active efforts by an AI system to gain and maintain power in ways that designers didn’t intend, arising from problems with that system’s objectives."
3 "Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome." ( Krakovna et al. , 2020).
4 "Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the le
... (truncated, 5 KB total)db007950f4432eb2 | Stable ID: sid_mMnUU1w9XR