Longterm Wiki
Back

Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong

blog

Data Status

Not fetched

Summary

x This website requires javascript to properly function.

Cited by 1 page

PageTypeQuality
Deep Learning Revolution EraHistorical44.0

Cached Content Preview

HTTP 200Fetched Feb 22, 202661 KB
x This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong Adversarial Examples (AI) Gaming (videogames/tabletop) Robust Agents AI Frontpage 130

 Even Superhuman Go AIs Have Surprising Failure Modes 

 by AdamGleave , EuanMcLean , Tony Wang , Kellin Pelrine , Tom Tseng , Yawen Duan , Joseph Miller , MichaelDennis 20th Jul 2023 AI Alignment Forum Linkpost from far.ai 12 min read 22 130

 Ω 34

 In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be defeated". But in 2022, amateur Go player Kellin Pelrine defeated KataGo , a Go program that is even stronger than AlphaGo. How?

 It turns out that even superhuman AIs have blind spots and can be tripped up by surprisingly simple tricks. In our new paper , we developed a way to automatically find vulnerabilities in a "victim" AI system by training an adversary AI system to beat the victim. With this approach, we found that KataGo systematically misevaluates large cyclically connected groups of stones. We also found that other superhuman Go bots including ELF OpenGo , Leela Zero and Fine Art suffer from a similar blindspot. Although such positions rarely occur in human games, they can be reliably created by executing a straightforward strategy. Indeed, the strategy is simple enough that you can teach it to a human who can then defeat these Go bots unaided.

 

 The victim and adversary take turns playing a game of Go. The adversary is able to sample moves the victim is likely to take, but otherwise has no special powers, and can only play legal Go moves. 

 Our AI system (that we call the adversary ) can beat a superhuman version of KataGo in 94 out of 100 games, despite requiring only 8% of the computational power used to train that version of KataGo. We found two separate exploits: one where the adversary tricks KataGo into passing prematurely, and another that involves coaxing KataGo into confidently building an unsafe circular group that can be captured. Go enthusiasts can read an analysis of these games on the project website .

 Our results also give some general lessons about AI outside of Go. Many AI systems, from image classifiers to natural language processing systems, are vulnerable to adversarial inputs: seemingly innocuous changes such as adding imperceptible static to an image or a distractor sentence to a paragraph can crater the performance of AI systems while not affecting humans. Some have assumed that these vulnerabilities will go away when AI systems get capable enough—and that superhuman AIs will always be wise to such attacks. We’ve shown that this isn’t necessarily the case: systems can simultaneously surpass

... (truncated, 61 KB total)
Resource ID: dd0f6ff1b958bd49 | Stable ID: YTQ1MzQ2NG