AGI Safety & Alignment team

blog

Medium·deepmindsafetyresearch.medium.com/agi-safety-and-alignmen...

Credibility Rating

2/5

Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Medium

Official update from Google DeepMind's main technical safety team, useful for understanding the institutional structure and research priorities of one of the largest industry AI safety groups as of late 2024.

Metadata

Importance: 62/100blog postprimary source

Summary

A 2024 overview by Google DeepMind's AGI Safety & Alignment team summarizing their recent technical work on existential risk from AI, covering subteams focused on mechanistic interpretability, scalable oversight, and frontier safety evaluations. Written by Rohin Shah, Seb Farquhar, and Anca Dragan, it describes the team's structure, growth, and key research priorities including amplified oversight and dangerous capability evaluations.

Key Points

•The AGI Safety & Alignment team has grown ~39% and ~37% in consecutive years, encompassing subteams in mechanistic interpretability, scalable oversight, and frontier safety.
•Big bets for 2023-2024 include amplified oversight (to generate correct learning signals for alignment) and other technical safety approaches.
•The Frontier Safety subteam develops and runs dangerous capability evaluations as part of DeepMind's Frontier Safety Framework.
•The broader org also includes Gemini Safety (safety training for current models) and Voices of All in Alignment (value/viewpoint pluralism techniques).
•The post is intended to help external researchers build on DeepMind's work and understand how their own research connects to this agenda.

Cited by 3 pages

Page	Type	Quality
AI Safety via Debate	Approach	70.0
AI Alignment Research Agendas	Crux	69.0
Technical AI Safety Research	Crux	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202618 KB

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work | by DeepMind Safety Research | Medium

 

 
 
 
 

 Aug
 SEP
 Oct
 

 
 

 
 11
 
 

 
 

 2024
 2025
 2026
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20250911094719/https://deepmindsafetyresearch.medium.com/agi-safety-and-alignment-at-google-deepmind-a-summary-of-recent-work-8e600aca582a

 

Sitemap

Open in app

Sign up

Sign in

Medium Logo

Write

Sign up

Sign in

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

DeepMind Safety Research

11 min read
·
Oct 18, 2024

--

Listen

Share

By Rohin Shah, Seb Farquhar, and Anca Dragan

It’s been a while since our last update, so we wanted to share a recap of our recent outputs. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours.

Who are we?

We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We’re part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism.

What have we been up to?

Here is some key work published in 2023 and the first part of 2024, grouped by topic / sub-team.

Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don’t pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make.

Frontier Safety

The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat mod

... (truncated, 18 KB total)

Resource ID: 6374381b5ec386d1 | Stable ID: sid_YVGTVch0VX