The MIRI Era (2000-2015)
The MIRI Era
Comprehensive chronological account of AI safety's institutional emergence (2000-2015), from MIRI's founding through Bostrom's Superintelligence to mainstream recognition. Covers key organizations, ideas (orthogonality thesis, instrumental convergence, CEV), and the transition from philosophy to technical research, but offers minimal novel analysis or actionable insights for current prioritization work.
The MIRI Era
Comprehensive chronological account of AI safety's institutional emergence (2000-2015), from MIRI's founding through Bostrom's Superintelligence to mainstream recognition. Covers key organizations, ideas (orthogonality thesis, instrumental convergence, CEV), and the transition from philosophy to technical research, but offers minimal novel analysis or actionable insights for current prioritization work.
Overview
The period from 2000 to 2015 marked the transition of AI safety from individual warnings to organized research institutions. This era saw the founding of the first dedicated AI safety organization (MIRI), the development of foundational theoretical frameworks, and the gradual emergence of a research community. By 2015, AI safety had achieved greater academic visibility and mainstream attention, though it remained a small field largely separate from the broader machine learning community.
Key developments included:
- Establishment of dedicated research organizations (MIRI in 2000, FHI in 2005)
- Formation of online communities focused on rationality and AI risk (LessWrong in 2009)
- Development of core theoretical concepts (orthogonality thesis, instrumental convergence, treacherous turn)
- Publication of Superintelligence by Nick Bostrom (2014)
- High-profile public statements from technology leaders and scientists (2014–2015)
- Initial emergence of dedicated funding sources
The era concluded with AI safety transitioning from primarily philosophical inquiry toward engagement with practical machine learning systems, setting the stage for the deep learning era that followed. Retrospective assessments of this period vary: some researchers credit it with establishing essential intellectual foundations, while others argue that the dominant research agenda—centered on idealized rational agents rather than practical ML systems—was poorly matched to the AI development trajectories that actually emerged.
The Singularity Institute (2000)
Founding
The Singularity Institute for Artificial Intelligence (SIAI) was founded in 2000 by Eliezer Yudkowsky, Brian Atkins, and Sabine Atkins. The organization was renamed to the Machine Intelligence Research Institute (MIRI) in 2013. MIRI's stated mission was to research and develop "Friendly AI"—artificial intelligence systems designed to be safe and beneficial to humanity.
Context and Motivations
The founding occurred during a period of renewed optimism about artificial intelligence. The dot-com boom was creating technological enthusiasm, computing power was increasing substantially, and the "AI winter" of reduced funding and interest that had characterized the 1980s and 1990s was ending. New techniques in machine learning were beginning to show promise.
The founders argued that if progress in AI capabilities was resuming, safety research needed to begin before systems became sufficiently capable to pose risks. This represented a departure from the prevailing view in the AI research community, which generally considered safety concerns premature given the limited capabilities of existing systems.
Early Activities (2000–2005)
During its first five years, SIAI operated with minimal funding and a small team. Activities included:
- Theoretical research on "Friendly AI" concepts
- Writing and outreach to raise awareness of AI risk
- Organizing small workshops and conferences
- Seeking funding from foundations and individual donors
The organization's work received limited attention from the academic AI research community. Common criticisms characterized the work as overly speculative, focused on problems that did not yet exist, more aligned with science fiction than scientific research, and a distraction from practical AI development challenges. These criticisms reflected substantive disagreements about whether AGI was a near-term prospect worth preparing for, and whether the conceptual frameworks being developed had the rigor required for actionable research.
Eliezer Yudkowsky and Early Theoretical Work
Background
Eliezer Yudkowsky was self-taught in AI and related fields without formal academic credentials. He had written about artificial intelligence since his teenage years. This background shaped his intellectual approach in distinct ways: his thinking was less constrained by conventional academic frameworks, while his lack of traditional credentials made it easier for critics to set aside his arguments without detailed engagement.
Creating Friendly AI (2001)
In 2001, Yudkowsky published "Creating Friendly AI," his first major document outlining a technical approach to AI safety. The document presented several core arguments:
Default Risk: Without dedicated safety work, advanced AI systems would be dangerous by default. This claim rested on several premises: that intelligence does not inherently produce benevolent behavior, that small differences in goal specifications can lead to large differences in outcomes, and that humanity would have limited opportunities to correct mistakes after deploying superintelligent systems.
The Goal Specification Problem: Intelligence alone is insufficient for beneficial AI; systems require correctly specified goals. The document identified challenges including formally specifying human values, preventing unintended goal drift during system operation and improvement, and managing goal evolution as systems learn and develop.
Technical Requirements: Yudkowsky framed AI safety as an engineering problem requiring formal solutions, not merely a philosophical question. He argued for formal frameworks to represent goals, provable guarantees of goal stability, and mechanisms to protect against unintended optimization.
Reception
The AI research community's response to "Creating Friendly AI" was largely skeptical. Researchers commonly argued that artificial general intelligence (AGI) remained far in the future, making safety work premature. Transhumanist thinkers often viewed the document as overly pessimistic about AI's potential benefits. Academic philosophers found the arguments interesting but insufficiently grounded to merit serious research investment. These responses left MIRI operating at the margins of mainstream discourse on AI.
LessWrong and Community Formation (2006–2012)
Origins
In 2006, Yudkowsky and Robin Hanson launched the Overcoming Bias blog. In 2009, Yudkowsky launched LessWrong.com as a dedicated community website focused on improving human rationality and discussing existential risks, particularly from AI.
The Sequences
Between 2006 and 2009, Yudkowsky wrote an extensive series of blog posts collectively known as "The Sequences," covering topics including cognitive biases, probability theory and Bayesian reasoning, decision theory, philosophy of mind, and interpretations of quantum mechanics, in addition to AI safety.
Key essays relevant to AI safety included:
- "The AI-Box Experiment"
- "Coherent Extrapolated Volition"
- "Artificial Intelligence as a Positive and Negative Factor in Global Risk"
- "Complex Value Systems"
The Sequences created a coherent intellectual framework and vocabulary that influenced the developing AI safety community.
The AI-Box Experiment
The AI-Box Experiment explored whether a superintelligent AI could convince a human gatekeeper to release it from containment. Yudkowsky conducted text-based roleplay experiments in which he took the role of the AI and attempted to convince human participants to "release" him. In several trials, participants agreed to release the AI despite having significant incentives not to do so.
Yudkowsky argued this demonstrated that physical containment would be insufficient as a safety measure against sufficiently intelligent systems. Critics questioned how well these experiments generalized, noting that Yudkowsky's personal persuasiveness might not reflect the capabilities of an actual AI system, and that the artificial experimental setup differed significantly from real containment scenarios. These objections pointed to a broader methodological question about whether informal experiments of this kind could support strong claims about AI risk.
Coherent Extrapolated Volition
Yudkowsky proposed Coherent Extrapolated Volition (CEV) as an approach to the value specification problem. Rather than attempting to program current human values directly into an AI system, CEV proposed programming a process to determine what humans would want under idealized conditions of greater knowledge, faster thinking, personal growth, and collective deliberation.
The proposal was intended to address uncertainty about human values and disagreement between different people and cultures. However, significant open questions remained: how to formally specify "what we would want," whether coherent extrapolated volition exists given fundamental value disagreements, how to aggregate potentially conflicting extrapolations across all humans, and whether the concept could be implemented in practice. CEV remained an influential theoretical idea without a clear path to implementation.
Community Characteristics
LessWrong developed a distinctive community with shared vocabulary (Bayesian reasoning, utility functions, alignment), cultural norms (steelmanning arguments, making falsifiable predictions), a network of individuals taking AI risk seriously, and a pipeline recruiting researchers into AI safety work.
The community's demographics skewed toward young adults, men, and people with backgrounds in physics, mathematics, and computer science. Geographic concentration was highest in the San Francisco Bay Area, supplemented by a large online presence. The culture emphasized intellectualism, rationality techniques, long-form discussion, and quantitative thinking.
External observers offered varied assessments of LessWrong's epistemic culture. Some credited it with producing unusually rigorous informal reasoning and helping articulate AI risk arguments with precision. Others criticized insularity, overconfidence in novel frameworks, and a tendency to treat community consensus as a proxy for truth. Both dimensions shaped the community's influence on early AI safety research.
Robin Hanson and the Hanson-Yudkowsky Debate (2008)
Key Disagreements
In 2008, Robin Hanson and Yudkowsky conducted an extended public debate about AI risk that articulated fundamental disagreements still active in contemporary discussions.
Hanson's position included:
- Artificial general intelligence would more likely emerge through brain emulation (whole brain emulation) rather than de novo AI design
- The transition to advanced AI would be gradual rather than sudden
- Market forces and competitive pressures would shape AI development
- Humans would retain economic value in a world with advanced AI
- Existential risk from AI was lower than Yudkowsky estimated
Yudkowsky's position included:
- De novo AI was more likely than brain emulation as the path to AGI
- Intelligence explosion could occur rapidly through recursive self-improvement
- Market forces do not inherently guarantee safety outcomes
- Humans might have no economic value relative to superintelligent systems
- Without dedicated safety work, existential risk from AI was high
Impact
The debate established frameworks for understanding key disagreements within AI safety discourse:
- Takeoff speed: rapid versus gradual capability gains
- Development paths: brain emulation versus engineered AI
- Economic models: human relevance versus obsolescence
- Research urgency: immediate versus eventual priority
The public nature of the disagreement demonstrated that the AI safety community could engage in substantive intellectual debate rather than functioning as a monolithic group. Many subsequent discussions of AI risk reference or parallel the Hanson-Yudkowsky framing. Hanson later developed his positions on brain emulation into the book The Age of Em (2016), while Yudkowsky continued to argue for rapid-takeoff scenarios and the primacy of the alignment problem.
Dissenting Academic Perspectives During This Period
The Hanson-Yudkowsky debate was the most prominent internal disagreement, but it did not capture the full range of academic skepticism. Several strands of criticism emerged from outside the AI safety community:
Capability skeptics argued that the theoretical frameworks assumed AI capability trajectories that were not well grounded in existing AI research. Researchers including Rodney Brooks and others working in robotics and cognitive science contended that intelligence is deeply embodied and situational, making rapid recursive self-improvement implausible in the ways the MIRI-adjacent literature assumed.
Near-term ethics researchers contended that emphasis on speculative long-run scenarios could crowd out attention to pressing, empirically tractable harms from systems already deployed: algorithmic discrimination, surveillance applications, labor displacement, and adversarial manipulation. This critique was articulated by researchers working in algorithmic fairness and AI ethics who argued that the existential risk framing systematically deprioritized effects on currently affected populations.
Methodological critics questioned whether the field's reliance on thought experiments and informal arguments—rather than empirical investigation of actual machine learning systems—could generate reliable conclusions about AI development trajectories. These critics noted that the persuasiveness of thought experiments like the paperclip maximizer depended on assumptions that were contested rather than established.
These perspectives remained largely separate from the LessWrong-adjacent community during this period, contributing to a bifurcation in AI safety discourse that persisted into the deep learning era.
Nick Bostrom and Academic Legitimacy
Background
Nick Bostrom held a PhD from the London School of Economics and served as a Professor of Philosophy at Oxford University. His academic positioning provided access to institutional resources and audiences that were less accessible to researchers without traditional academic credentials.
Future of Humanity Institute (2005)
Bostrom founded the Future of Humanity Institute (FHI) at Oxford University in 2005. FHI became the first academic research center focused on existential risks, including but not limited to risks from advanced AI. The institute's location within a prestigious university provided institutional standing for research that might otherwise have been set aside as speculative.
Note that FHI closed in 2024 following administrative disputes with the University of Oxford, after nearly two decades of operation. Its closure marked the end of one of the period's most influential institutional legacies.
Existential Risk as a Research Priority
Bostrom developed arguments that even small probabilities of human extinction deserve substantial resource allocation, given the scale of potential future lives that would be foreclosed by such outcomes. This framework suggested that reducing existential risk by even small amounts could be highly valuable in expected utility terms.
This argument influenced the effective altruism movement's prioritization of AI safety research and other long-term interventions. Critics of this framework argued that it relied on contested assumptions about how to aggregate welfare across vastly different future populations, and that the resulting prioritization could reflect particular philosophical commitments rather than consensus ethical reasoning.
Superintelligence (2014)
Publication and Structure
Nick Bostrom's Superintelligence: Paths, Dangers, Strategies was published by Oxford University Press in 2014. The book provided the first comprehensive, academically structured treatment of AI risk aimed at a general audience.
The book's structure included:
- Analysis of potential paths to superintelligence (AI, whole brain emulation, biological enhancement, networks)
- Taxonomy of superintelligence forms (speed, collective, quality)
- Discussion of superintelligence capabilities and strategic advantages
- Detailed examination of the control problem
- Exploration of strategic implications for humanity
- Assessment of existential risk
The book was structured as a systematic academic argument, with explicit acknowledgment of uncertainties throughout.
Core Theoretical Contributions
The Orthogonality Thesis: Intelligence and goals are independent dimensions. A system can have high intelligence paired with any goal structure. This challenged the assumption that sufficiently intelligent systems would naturally develop benevolent values.
The Instrumental Convergence Thesis: Almost any final goal leads to certain instrumental sub-goals:
- Self-preservation (a system cannot achieve its goals if it is destroyed)
- Resource acquisition (more resources enable better goal achievement)
- Goal preservation (changing goals would prevent achieving current goals)
- Cognitive enhancement (greater intelligence enables better goal achievement)
- Technological advancement (better technology enables better goal achievement)
This thesis suggested that even AI systems with seemingly harmless goals could engage in dangerous behavior while pursuing these instrumental sub-goals.
The Treacherous Turn: A sufficiently intelligent AI might conceal its true goals until it has accumulated enough power to achieve them without human interference. The scenario involves appearing aligned during early development while weak, then pivoting to pursue true goals once success probability is high. This possibility complicates safety verification, as testing may not reveal misalignment. The concept bears a close relationship to what later researchers would term deceptive alignment.
The Paperclip Maximizer: Bostrom popularized this thought experiment, in which an AI system tasked with maximizing paperclip production converts all available matter (including humans and Earth) into paperclips. While simplified, the scenario illustrated how misspecified goals could lead to catastrophic outcomes even with seemingly innocuous objectives.
Reception and Impact
Superintelligence received attention from prominent figures including Elon Musk, Bill Gates, and Stephen Hawking, who made public statements expressing concern about advanced AI around the time of its publication. The book received extensive mainstream media coverage and engaged academic audiences beyond philosophy departments.
Some AI researchers criticized the book as speculative or as encouraging undue alarm. Others questioned the plausibility of the scenarios described or disagreed with Bostrom's timeline estimates and risk assessments. Even critics who disputed specific arguments generally acknowledged the book's systematic structure and internal consistency.
The net effect was an increase in attention to AI safety concerns. The book made it harder to characterize AI safety as purely speculative, given its academic origin and structured argumentation—though critics continued to argue that its influence on funding and research priorities was disproportionate to the evidentiary basis of its claims.
Critical Reception
Critics in the AI research community raised several objections:
- The scenarios described relied on speculative assumptions about future AI capabilities
- The focus on superintelligence might distract from near-term AI safety and ethics concerns
- The arguments depended heavily on thought experiments rather than empirical evidence
- Some technical assumptions about intelligence explosion and recursive self-improvement lacked clear empirical grounding
- The book's framing assumed a particular (discontinuous, fast-takeoff) development trajectory that was contested even within the AI safety community
Proponents countered that:
- Long-term safety research requires anticipating future capabilities
- Near-term and long-term safety concerns are complementary, not competing
- Thought experiments are appropriate for exploring scenarios with no historical precedent
- The theoretical framework helps identify research priorities even given uncertainty about timelines
These debates reflected broader disagreements about research prioritization and methodology in AI safety that remained active throughout the subsequent decade.
High-Profile Public Statements (2014–2015)
Technology Leaders and Scientists
Between 2014 and 2015, several prominent technology leaders and scientists made public statements about AI risk. Elon Musk, Stephen Hawking, and Bill Gates each expressed concern about the long-term prospects of advanced AI in interviews and public forums. These statements generated significant mainstream media coverage of AI safety concerns, bringing the topic to general audiences beyond academic and technical communities.
Hawking, in a December 2014 BBC interview, stated that "the development of full artificial intelligence could spell the end of the human race." Musk described AI as potentially "more dangerous than nukes" in a comment on a Vanity Fair article around the same period. Gates, in a Reddit AMA in January 2015, expressed that he did not understand why people were not more concerned about the issue.
Effects of Public Attention
The public statements provided visibility for AI safety research and helped attract technical talent to the field. However, they also generated backlash. Some AI researchers argued that the warnings were premature or exaggerated, potentially creating unjustified concern about near-term AI capabilities. Others expressed concern that focus on long-term existential risk might draw attention away from addressing near-term harms from current AI systems, including algorithmic bias, privacy violations, and labor displacement.
AI researchers including Yann LeCun and Andrew Ng publicly disputed the framing, with Ng comparing worries about superintelligent AI to concerns about overpopulation on Mars. These counterarguments reflected a genuine methodological split: critics argued that capability extrapolations underlying AI risk arguments were not grounded in the actual trajectory of machine learning research at the time.
The net effect was increased visibility for AI safety as a research area, alongside polarization of opinion about its importance and urgency among AI researchers.
Emergence of Dedicated Funding (2014–2015)
Funding Landscape Shift
For approximately 15 years, AI safety research operated with minimal funding. The period 2014–2015 marked a significant change in available resources.
Elon Musk made a substantial donation to the Future of Life Institute in 2015 to support AI safety research grants. This funding—reported at $10 million—was distributed across multiple research groups and organizations through a competitive grants process.
Open Philanthropy began making AI safety a major priority area, committing substantial grants to MIRI, FHI, and other research organizations, and signaling long-term commitment to the cause area.
Future of Life Institute
The Future of Life Institute was founded in 2014 to coordinate research funding and facilitate dialogue about AI safety. FLI organized conferences bringing together AI researchers, safety researchers, and technology leaders to discuss potential risks and research priorities.
The 2015 Puerto Rico Conference
FLI organized a conference in Puerto Rico in January 2015 attended by leading AI researchers and technology leaders, including Stuart Russell, Demis Hassabis, Nick Bostrom, and Max Tegmark, among others.
The conference led to an open letter on AI safety calling for research to ensure AI systems remain beneficial. The letter attracted thousands of signatories from both the AI research community and the broader scientific community, and represented a notable moment in which a significant number of active AI researchers publicly endorsed AI safety research as a priority. It was also significant for what it was not: the letter carefully focused on research priorities rather than capability moratoriums, reflecting ongoing disagreements about how to frame AI risk in ways that could achieve broad assent.
Technical Research Development (2010–2015)
Transition from Philosophy to Technical Work
MIRI's research focus evolved during this period. Early work (2000–2010) was primarily philosophical, exploring conceptual foundations of AI safety. The mid-period (2010–2015) saw increasing emphasis on technical research problems, though critics argued these remained theoretical problems about idealized agents rather than engagement with practical machine learning systems.
Research Areas
Logical Uncertainty: How should AI systems reason about logical facts they have not yet proven? This problem arises when an agent needs to reason about other agents (including future versions of itself) without infinite computational regress.
Decision Theory: What decision procedures should AI systems use, particularly when other agents can predict those decisions? MIRI researchers explored problems including Newcomb's paradox variations, Prisoner's Dilemma modifications, and acausal cooperation scenarios.
Tiling Agents: Can an AI create a successor system that preserves its original goals? This problem, also called the "stable self-improvement" problem, addresses whether goal preservation is possible through multiple generations of self-modification.
Value Loading: How can human values be incorporated into AI systems? This research area addressed the challenge that humans cannot fully articulate their own values explicitly.
These research directions fell under the umbrella of "agent foundations" research—investigating fundamental theoretical questions about rational agency and goal-directed behavior. The relevance of this research agenda to the deep learning systems that subsequently dominated AI development remained contested; some researchers argue that agent foundations work identified enduring theoretical problems, while others contend it was poorly targeted at the AI architectures that actually emerged.
Academic AI Safety Research
Stuart Russell, co-author of the leading AI textbook Artificial Intelligence: A Modern Approach, began working on AI safety research during this period. Russell developed cooperative inverse reinforcement learning as an approach to value alignment, in which AI systems learn human preferences by observing behavior rather than requiring explicit specification. This work connected more directly to mainstream machine learning methods than much of the contemporaneous MIRI research program.
Other early academic work included:
- Research on inverse reinforcement learning techniques
- Safe exploration in reinforcement learning (avoiding catastrophic actions during learning)
- Robustness and adversarial examples (understanding failure modes in machine learning systems)
- Theoretical foundations of multi-agent systems
The 2016 paper "Concrete Problems in AI Safety" (developed during 2015, by researchers including Paul Christiano and others at Google DeepMind and OpenAI) outlined research directions connecting theoretical safety concerns to practical machine learning challenges, and was widely seen as a signal that mainstream ML researchers were engaging with safety questions in a more grounded way.
Limitations and Gaps (2000–2015)
Technical Progress
While philosophical frameworks were established during this period, concrete technical results applicable to existing AI systems remained limited. Much of the research focused on theoretical problems related to idealized rational agents rather than practical machine learning systems.
Relationship with Machine Learning Community
Most mainstream AI researchers during this period considered AI safety research either premature or based on implausible assumptions about future AI development. This created a disconnect between AI safety researchers and the ML community building increasingly capable systems. The disconnect was not merely cultural: the two communities worked on different problem framings, used different methodologies, and often operated with different assumptions about AI development trajectories.
Focus on Fast Takeoff Scenarios
Significant attention was devoted to scenarios involving rapid capability gain through recursive self-improvement (sometimes called fast takeoff or "FOOM" scenarios). Critics both inside and outside the AI safety community argued that this emphasis potentially neglected slower, more gradual paths to advanced AI that might pose different challenges and opportunities for safety interventions. Retrospective assessments by some AI safety researchers have since questioned whether the agent foundations research program of this period was well-targeted at the AI development trajectories that actually emerged.
Governance and Coordination
Research during this period emphasized technical approaches to safety, with less attention to governance mechanisms, policy interventions, and international coordination challenges. These areas would become more prominent in subsequent years.
Current Systems vs. Hypothetical AGI
The research focus was primarily on hypothetical artificial general intelligence systems rather than improving the safety of contemporary machine learning systems. This created a gap in addressing near-term safety challenges in deployed AI applications.
Empirical Work
Research was predominantly theoretical, relying on thought experiments and formal analysis rather than empirical investigation with actual machine learning systems. This limited the field's ability to test ideas and iterate based on experimental results.
Geographic and Demographic Concentration
The AI safety research community of this period was concentrated in a small number of institutions in the United States and United Kingdom, with near-absence of engagement from researchers based in East Asia, South Asia, or the Global South. The community was also demographically narrow. These concentrations shaped which problems were identified as central and which risk scenarios were taken as paradigmatic. Non-Anglophone AI research traditions—including substantial work on AI governance and ethics in China, India, and elsewhere—were largely absent from the literature and conferences of this period.
Key Organizations Founded (2000–2015)
The period saw establishment of several institutions focused on AI safety and related existential risks:
| Organization | Founded | Primary Focus |
|---|---|---|
| MIRI (originally SIAI) | 2000 | Agent foundations, decision theory, formal approaches to AI alignment |
| Future of Humanity Institute | 2005 | Existential risk research including AI safety, hosted at Oxford University; closed 2024 |
| Centre for the Study of Existential Risk | 2012 | Cambridge University-based research center for existential risks |
| Future of Life Institute | 2014 | AI safety research funding, policy advocacy, and coordination |
| OpenAI | 2015 | AI research with stated mission to ensure AGI benefits all humanity |
Other relevant organizations founded during this period include:
- LessWrong (2009): Online community for rationality and AI safety discussion
- Centre for Effective Altruism (2012): Organization promoting effective altruism, which prioritized AI safety as a cause area
- 80,000 Hours (2011): Career advice organization that promoted AI safety research as a high-impact career path
Note that several of these organizations underwent significant changes after 2015: the Future of Humanity Institute closed in 2024, and OpenAI's structure and mission evolved substantially following its transition to a capped-profit model.
Connection to Effective Altruism
The effective altruism movement, which emerged around 2011, adopted AI safety as a top priority based on several considerations:
- High expected value due to the potentially large magnitude of impact
- Relative neglect compared to the scale of potential consequences
- Tractability uncertainty, but with possibility of substantial progress through additional research
This connection created a pipeline of talent into AI safety research, as effective altruism organizations like 80,000 Hours recommended AI safety careers to people seeking to maximize their positive impact. The relationship between effective altruism and AI safety would strengthen further in subsequent years.
The connection also generated criticism. Some researchers expressed concern that the prioritization of AI safety within effective altruism circles had more to do with the persuasiveness of particular arguments circulating within those communities than with independent assessment of risk. Others argued that the demographic overlap between effective altruism and the AI safety community created a feedback loop in which research conclusions and funding priorities reinforced each other without sufficient external checks.
Transition to the Deep Learning Era
Changes Beginning in 2015
The period around 2015 marked the beginning of a transition in AI capabilities and consequently in AI safety research priorities. Deep learning techniques were demonstrating performance on tasks previously considered difficult or impossible for AI systems.
AlphaGo's defeat of professional Go player Fan Hui was reported in a Nature paper published in January 2016, followed by AlphaGo's match against world champion Lee Sedol in March 2016. These results arrived earlier than a range of expert predictions had suggested, prompting AI safety researchers to reconsider assumptions about the pace of capability development. The GPT series of language models would emerge in subsequent years, demonstrating capabilities in natural language that challenged assumptions about what machine learning systems could achieve.
These developments created pressure for AI safety research to engage more directly with practical machine learning systems rather than focusing exclusively on theoretical scenarios. The question shifted from "how do we build safe AGI someday" toward "how do we make current systems safer while preparing for rapid capability growth." This transition also exposed a tension in the field's prior work: the agent foundations research program had been developed largely in abstraction from the neural network architectures that were proving most capable, and it was not obvious how its theoretical results could be applied to the systems now attracting widespread attention. The subsequent period saw efforts to connect formal safety concerns to interpretability, RLHF, and scalable oversight research programs that engaged more directly with how large models actually work.
Legacy and Historical Significance
Institutional Establishment
The MIRI era established AI safety as a field with dedicated organizations rather than scattered individual researchers. This institutional foundation enabled sustained research programs, community building, and coordination that would not have been possible through individual efforts alone.
Intellectual Frameworks
Core concepts developed or popularized during this period continue to influence AI safety discourse:
- The orthogonality thesis (separating intelligence from goals)
- Instrumental convergence (predicting common sub-goals)
- The alignment problem (specifying human values in AI systems)
- Takeoff scenarios (paths to advanced AI)
- Existential risk framing (potential for permanent negative outcomes)
These concepts provide vocabulary and analytical frameworks for contemporary discussions of AI risk and safety. Their influence has been uneven: some, like the instrumental convergence thesis, have been broadly adopted even by researchers who otherwise distance themselves from MIRI's research program; others, like CEV and the tiling agents problem, remain more narrowly cited.
Academic Visibility
By 2015, AI safety had achieved increased academic visibility. Academic institutions hosted safety research groups, conferences included safety-focused tracks, and prominent researchers from mainstream AI engaged with safety questions. This shift was significant for attracting both researchers and funding. However, the field remained small relative to mainstream ML, and the theoretical frameworks developed during this period had limited uptake outside the immediate community.
The characterization of this shift requires some care: while the period produced genuine institutional progress, significant portions of the mainstream AI research community continued to view existential risk concerns as speculative, and the 2015 open letter represented a notable but bounded form of engagement rather than a wholesale conversion of the field.
Public Awareness
Mainstream media coverage and statements from prominent scientists and technology leaders brought AI safety concerns to general audiences. This visibility helped with researcher recruitment and funding, though it also generated controversy and contributed to polarization of opinion about the field's importance—with some critics arguing that high-profile statements from non-specialists produced more heat than light in public discourse about AI risk.
Open Questions and Limitations
Despite these achievements, significant challenges remained entering the subsequent period:
- Limited connection between theoretical safety research and practical machine learning systems
- Continued skepticism from most AI researchers about the urgency of safety work
- Uncertainty about whether sufficient progress could be made before advanced AI systems emerged
- Need to develop governance frameworks in addition to technical solutions
- Questions about whether the period's primary research agenda—agent foundations and formal analysis of idealized rational agents—was well-suited to the AI systems that were actually emerging
- Questions about how to ensure safety research kept pace with rapidly advancing capabilities
- Absence of engagement with governance, policy, and international coordination challenges that would prove important in subsequent years
These open questions shaped the research agenda of the subsequent period, as the field adapted to the realities of deep learning systems rather than the idealized rational agents that had anchored earlier theoretical work. The deep learning era brought both new opportunities—technical safety research could now be grounded in actual systems—and new pressures, as the pace of capability development accelerated in ways that shortened available timelines for foundational work.