Back
Not Covered: October 2024 Alignment - Bluedot Blog
webblog.bluedot.org·blog.bluedot.org/p/not-covered-2410-alignment
A supplementary blog post from BlueDot Impact's alignment course, useful for learners wanting brief introductions to developmental interpretability, agent foundations, and shard theory beyond the core curriculum.
Metadata
Importance: 35/100blog posteducational
Summary
BlueDot Impact introduces three AI alignment research areas omitted from their 8-week October 2024 course: developmental interpretability (studying model structure changes during training), agent foundations (theoretical research on agentic AI drawing from math and philosophy), and shard theory (framing RL agents as driven by multiple contextual 'shards' rather than unified goals). Each section provides brief overviews and pointers to deeper resources.
Key Points
- •Developmental interpretability studies phases and phase transitions during training, inspired by developmental biology, rather than static model features and circuits.
- •Agent foundations is theoretical research providing building blocks for understanding powerful agentic AI, drawing from mathematics, philosophy, and theoretical computer science.
- •Shard theory proposes RL agents are better understood as driven by many contextual 'shards' influencing decisions, rather than following specified goals.
- •These areas were excluded from the 8-week course due to scope constraints but represent active and emerging research agendas in alignment.
- •Each topic includes pointers to foundational reading (e.g., MIRI's Technical Agenda for agent foundations, LessWrong posts for dev interp and shard theory).
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
HTTP 200Fetched Apr 7, 20269 KB
What we didn't cover in our October 2024 AI Alignment course
BlueDot Impact
Subscribe Sign in Blog What we didn't cover in our October 2024 AI Alignment course
Adam Jones Dec 12, 2024 Share An 8-week part-time course can’t cover everything there is to know about AI alignment , especially given it’s a fast-moving field with many budding research agendas.
This resource gives a brief introduction into a few areas we didn’t touch on, with pointers to resources if you want to explore them further.
Developmental interpretability
We covered mechanistic interpretability (or “mech interp”) in session 5 . To recap, this is an approach that ‘zooms in’ to models to make sense of their learned representations and weights through methods like feature or circuit analysis.
Developmental interpretability (or “dev interp”) instead studies how the structure of models changes as they are trained. Rather than features and circuits, here the object of study is phases and phase transitions during the training process. This takes inspiration from developmental biology, which increased our understanding of biology by studying the key steps of how living organisms develop.
Read more .
Agent foundations
Most of this course has explored fairly empirical research agendas that have been targeting the kinds of AI systems we have today: neural network architectures trained with gradient descent. However, it’s unclear whether we’ll continue to build future AI systems with similar technology. You’ll also have likely realised that a lot of the fundamentals underpinning the field can be unclear: the field does not have an agreed definition of what alignment is, or the kinds of properties we might want from an aligned system.
Agent foundations research is primarily theoretical research that aims to give us the building blocks to solve key problems in the field, particularly around how powerful agentic AI systems might behave. It draws a lot from maths, philosophy, and theoretical computer science, and in some ways is the basic science of alignment.
Read more .
Shard theory
Shard theory is a research approach that rejects the notion agents are following specified goals, and instead suggests that reinforcement learning agents are better understood as being driven by many ‘shards’. Each shard then influences the model’s decision process in certain contexts.
This takes inspiration from how humans appear to learn. Most people do not optimise some specific objective function all the time, but behave differently in different contexts: for example when their ‘hunger’ shard activates this might influence a decisi
... (truncated, 9 KB total)Resource ID:
27d9202885cc6bcf | Stable ID: sid_8gXDRuCzp3