Not Covered: October 2024 Alignment - Bluedot Blog

web

blog.bluedot.org·blog.bluedot.org/p/not-covered-2410-alignment

A supplementary blog post from BlueDot Impact's alignment course, useful for learners wanting brief introductions to developmental interpretability, agent foundations, and shard theory beyond the core curriculum.

Metadata

Importance: 35/100blog posteducational

Summary

BlueDot Impact introduces three AI alignment research areas omitted from their 8-week October 2024 course: developmental interpretability (studying model structure changes during training), agent foundations (theoretical research on agentic AI drawing from math and philosophy), and shard theory (framing RL agents as driven by multiple contextual 'shards' rather than unified goals). Each section provides brief overviews and pointers to deeper resources.

Key Points

•Developmental interpretability studies phases and phase transitions during training, inspired by developmental biology, rather than static model features and circuits.
•Agent foundations is theoretical research providing building blocks for understanding powerful agentic AI, drawing from mathematics, philosophy, and theoretical computer science.
•Shard theory proposes RL agents are better understood as driven by many contextual 'shards' influencing decisions, rather than following specified goals.
•These areas were excluded from the 8-week course due to scope constraints but represent active and emerging research agendas in alignment.
•Each topic includes pointers to foundational reading (e.g., MIRI's Technical Agenda for agent foundations, LessWrong posts for dev interp and shard theory).

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20269 KB

What we didn&#x27;t cover in our October 2024 AI Alignment course 
 
 
 
 
 

 

 

 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 

 

 
 
 

 

 

 

 

 
 

 
 

 

 

 

 
 BlueDot Impact 

 Subscribe Sign in Blog What we didn't cover in our October 2024 AI Alignment course

 Adam Jones Dec 12, 2024 Share An 8-week part-time course can’t cover everything there is to know about AI alignment , especially given it’s a fast-moving field with many budding research agendas. 

 This resource gives a brief introduction into a few areas we didn’t touch on, with pointers to resources if you want to explore them further.

 Developmental interpretability 

 We covered mechanistic interpretability (or “mech interp”) in session 5 . To recap, this is an approach that ‘zooms in’ to models to make sense of their learned representations and weights through methods like feature or circuit analysis. 

 Developmental interpretability (or “dev interp”) instead studies how the structure of models changes as they are trained. Rather than features and circuits, here the object of study is phases and phase transitions during the training process. This takes inspiration from developmental biology, which increased our understanding of biology by studying the key steps of how living organisms develop.

 Read more . 

 Agent foundations 

 Most of this course has explored fairly empirical research agendas that have been targeting the kinds of AI systems we have today: neural network architectures trained with gradient descent. However, it’s unclear whether we’ll continue to build future AI systems with similar technology. You’ll also have likely realised that a lot of the fundamentals underpinning the field can be unclear: the field does not have an agreed definition of what alignment is, or the kinds of properties we might want from an aligned system.

 Agent foundations research is primarily theoretical research that aims to give us the building blocks to solve key problems in the field, particularly around how powerful agentic AI systems might behave. It draws a lot from maths, philosophy, and theoretical computer science, and in some ways is the basic science of alignment. 

 Read more . 

 Shard theory 

 Shard theory is a research approach that rejects the notion agents are following specified goals, and instead suggests that reinforcement learning agents are better understood as being driven by many ‘shards’. Each shard then influences the model’s decision process in certain contexts.

 This takes inspiration from how humans appear to learn. Most people do not optimise some specific objective function all the time, but behave differently in different contexts: for example when their ‘hunger’ shard activates this might influence a decisi

... (truncated, 9 KB total)

Resource ID: 27d9202885cc6bcf | Stable ID: sid_8gXDRuCzp3