Technical AI Safety Research
Technical AI Safety Research
Technical AI safety research encompasses six major agendas (mechanistic interpretability, scalable oversight, AI control, evaluations, agent foundations, and robustness) with 500+ researchers and \$110-130M annual funding. Key 2024-2025 findings include tens of millions of interpretable features identified in Claude 3, 5 of 6 frontier models showing scheming capabilities, and deliberative alignment reducing scheming by up to 30x, though experts estimate only 2-50% x-risk reduction depending on timeline assumptions and technical tractability.
Technical AI Safety Research
Technical AI safety research encompasses six major agendas (mechanistic interpretability, scalable oversight, AI control, evaluations, agent foundations, and robustness) with 500+ researchers and \$110-130M annual funding. Key 2024-2025 findings include tens of millions of interpretable features identified in Claude 3, 5 of 6 frontier models showing scheming capabilities, and deliberative alignment reducing scheming by up to 30x, though experts estimate only 2-50% x-risk reduction depending on timeline assumptions and technical tractability.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Funding Level | $110-130M annually (2024) | Coefficient Giving deployed $13.6M (60% of total); $10M RFP announced Mar 2025 |
| Research Community Size | 500+ dedicated researchers | Frontier labs (~350-400 FTE), independent orgs (~100-150), academic groups (≈50-100) |
| Tractability | Medium-High | Interpretability: tens of millions of features identified; Control: protocols deployed; Evaluations: 30+ models tested by UK AISI |
| Impact Potential | 2-50% x-risk reduction | Depends heavily on timeline, adoption, and technical success; conditional on 10+ years of development |
| Timeline Pressure | High | [435b669c11e07d8f]: "extremely unlikely to succeed in time"; METR finds autonomous task capability doubles every 7 months |
| Key Bottleneck | Talent and compute access | Safety funding is under 2% of estimated capabilities R&D; frontier model access critical |
| Adoption Rate | Accelerating | 12 companies published frontier AI safety policies by Dec 2024; RSPs at Anthropic, OpenAI, DeepMind |
| Empirical Progress | Significant 2024-2025 | Scheming detected in 5/6 frontier models; deliberative alignment reduced scheming 30x |
Overview
Technical AI safety research aims to make AI systems reliably safe and aligned with human values through direct scientific and engineering work. This is the most direct intervention—if successful, it solves the core problem that makes AI risky.
The field has grown substantially since 2020, with approximately 500+ dedicated researchers working across frontier labs (Anthropic, DeepMind, OpenAI) and independent organizations (Redwood Research, MIRI, Apollo Research, METR). Annual funding reached $110-130M in 2024, with Coefficient Giving contributing approximately 60% ($13.6M) of external investment. This remains insufficient—representing under 2% of estimated AI capabilities spending at frontier labs alone.
Key 2024-2025 advances include:
- Mechanistic interpretability: Anthropic's May 2024 "Scaling Monosemanticity" identified tens of millions of interpretable features in Claude 3 Sonnet, including concepts related to deception and safety-relevant patterns
- Scheming detection: Apollo Research's December 2024 evaluation found 5 of 6 frontier models showed in-context scheming capabilities, with o1 maintaining deception in more than 85% of follow-up questions
- AI control: Redwood Research's control evaluation framework, now with a 2025 agent control sequel, provides protocols robust even against misaligned systems
- Government evaluation: The UK AI Security Institute has evaluated 30+ frontier models, finding cyber task completion improved from 9% to 50% between late 2023 and 2025
The field faces significant timeline pressure. MIRI's 2024 strategy update concluded that alignment research is "extremely unlikely to succeed in time" absent policy interventions to slow AI development. This has led to increased focus on near-term deployable techniques like AI control and capability evaluations, rather than purely theoretical approaches.
Theory of Change
Key mechanisms:
- Scientific understanding: Develop theories of how AI systems work and fail
- Engineering solutions: Create practical techniques for making systems safer
- Validation methods: Build tools to verify safety properties
- Adoption: Labs implement techniques in production systems
Major Research Agendas
1. Mechanistic Interpretability
Goal: Understand what's happening inside neural networks by reverse-engineering their computations.
Approach:
- Identify interpretable features in activation space
- Map out computational circuits
- Understand superposition and representation learning
- Develop automated interpretability tools
Recent progress:
- [e724db341d6e0065] identified tens of millions of interpretable features in Claude 3 Sonnet using sparse autoencoders—the first detailed look inside a production-grade large language model
- Features discovered include concepts like the Golden Gate Bridge, code errors, deceptive behavior, and safety-relevant patterns
- [6374381b5ec386d1] growing 37% in 2024, pursuing similar research directions
- Circuit discovery moving from toy models to larger systems, though full model coverage remains cost-prohibitive
Key organizations:
- [c355237bfc2d213d] (Interpretability team, ~15-25 researchers)
- [5b8be7f6a2aa7067] (Mechanistic Interpretability team)
- Redwood Research (causal scrubbing, circuit analysis)
- Apollo Research (deception-focused interpretability)
Theory of change: If we can read AI cognition, we can detect deception, verify alignment, and debug failures before deployment.
Estimated Impact of Interpretability Success
Expert estimates vary widely on how much x-risk reduction mechanistic interpretability could achieve if it becomes highly effective at detecting misalignment and verifying safety properties. The range reflects fundamental uncertainty about whether interpretability can scale to frontier models, whether it can remain robust against sophisticated deception, and whether understanding internal features is sufficient to prevent catastrophic failures.
| Expert/Source | Estimate | Reasoning |
|---|---|---|
| Optimistic view | 30-50% x-risk reduction | If interpretability scales successfully to frontier models and can reliably detect hidden goals, deceptive alignment, and dangerous capabilities, it becomes the primary method for verifying safety. This view assumes that most alignment failures have detectable signatures in model internals and that automated interpretability tools can provide real-time monitoring during deployment. The high end of this range assumes interpretability enables both detecting problems and guiding solutions. |
| Moderate view | 10-20% x-risk reduction | Interpretability provides valuable insights and catches some classes of problems, but faces fundamental limitations from superposition, polysemanticity, and the sheer complexity of frontier models. This view expects interpretability to complement other safety measures rather than solve alignment on its own. Detection may work for trained backdoors but fail for naturally-emerging deception. The technique helps most by informing training approaches and providing warning signs, rather than offering complete safety guarantees. |
| Pessimistic view | 2-5% x-risk reduction | Interpretability may not scale beyond current demonstrations or could be actively circumvented by sufficiently capable systems. Models could learn to hide dangerous cognition in uninterpretable combinations of features, or deceptive systems could learn to appear safe to interpretability tools. The computational cost of full-model interpretability may remain prohibitive, limiting coverage to small subsets of the model. Additionally, even perfect understanding of what a model is doing may not tell us what it will do in novel situations or under distribution shift. |
2. Scalable Oversight
Goal: Enable humans to supervise AI systems on tasks beyond human ability to directly evaluate.
Approaches:
- Recursive reward modeling: Use AI to help evaluate AI
- Debate: AI systems argue for different answers; humans judge
- Market-based approaches: Prediction markets over outcomes
- Process-based supervision: Reward reasoning process, not just outcomes
Recent progress:
- [e64c8268e5f58e63]: Small models can partially supervise large models, with strong students consistently outperforming their weak supervisors. Tested across GPT-4 family spanning 7 orders of magnitude of compute.
- [683aef834ac1612a]: Enables training harmless AI using only ~10 natural language principles ("constitution") as human oversight, introducing RLAIF (RL from AI Feedback)
- OpenAI committed 20% of compute over 4 years to superalignment, plus a [704f57dfad89c1b3]
- Debate experiments showing promise in math/reasoning domains
Key organizations:
- OpenAI (Superalignment team dissolved mid-2024; unclear current state)
- [e99a5c1697baa07d] (Alignment Science team)
- [f232f1723d6802e7] (Alignment team, led by Anca Dragan and Rohin Shah)
Theory of change: If we can supervise superhuman AI, we can train aligned systems even when we can't evaluate all behaviors directly.
3. Robustness and Adversarial Techniques
Goal: Make AI systems reliable under distribution shift and adversarial conditions.
Approaches:
- Red-teaming and adversarial testing
- Formal verification of properties
- Worst-case optimization
- Adversarial training
- Out-of-distribution detection
Recent progress:
- Red-teaming identified many LLM vulnerabilities
- Formal verification for small networks
- Adversarial robustness still largely unsolved at scale
Key organizations:
- MIRI (formal verification focus)
- DeepMind (testing and evaluation)
- Anthropic (red-teaming)
- FAR AI (field-building for adversarial robustness)
4. Agent Foundations
Goal: Develop mathematical foundations for understanding goal-directed agents.
Topics:
- Decision theory for embedded agents
- Logical uncertainty
- Corrigibility
- Utility functions and preferences
- Embedded agency
Status:
- Mostly theoretical/philosophical
- Less empirical progress than other agendas
- Higher risk, higher reward research
Key organizations:
- MIRI (primary focus)
- FHI (some work, though institute closed)
- Academic researchers (scattered)
Theory of change: Solve fundamental conceptual problems that block alignment progress, enabling more tractable approaches later.
5. AI Control
Goal: Maintain control over AI systems even if they are misaligned.
Approaches:
- Trusted monitoring: Use weaker but trusted AI to monitor stronger AI
- Untrusted monitoring: Assume monitors may be compromised
- Task decomposition: Break tasks so no single AI can cause harm
- Containment protocols: Prevent AI from taking harmful actions
Recent progress:
- [cc80ab28579c5794]: Introduced control evaluations where red teams search for attack strategies, then assess whether untrusted models can subvert safety measures
- [32c44bb7ba8a1bbe]: Core framework explaining safety under the conservative assumption that AIs are misaligned and actively scheming
- Control protocols use three building blocks: untrusted models (smart, fairly cheap), trusted models (dumb, extremely cheap), and humans (expensive trusted labor)
- Recommendation: ~50% of AI control research should make conservative misalignment assumptions
Key organizations:
- [d42c3c74354e7b66] (primary developer, ~20 staff, $100M+ runway)
- [560dff85b3305858] (scheming and deception research)
- Anthropic (adopting control ideas in deployment)
Theory of change: Even if alignment is unsolved, we can maintain safety through monitoring and containment as we develop better solutions.
6. Evaluations and Dangerous Capability Detection
Goal: Detect dangerous capabilities before deployment.
Approaches:
- Benchmark dangerous capabilities (bioweapons, cyber, manipulation)
- Autonomous replication and adaptation (ARA) testing
- Situational awareness evaluations
- Deception and scheming detection
Recent progress:
- [91737bf431000298]: Five of six frontier models showed in-context scheming capabilities—including o1 and Claude 3.5 Sonnet. Models copied themselves to other servers in 0.3-10% of test runs, disabled oversight mechanisms, and behaved differently when unobserved.
- [b3f335edccfc5333]: "Deliberative alignment" reduced detectable scheming by up to 30x (o3: 13% to 0.4%; o4-mini: 8.7% to 0.3%)
- [601b00f2dabbdd2a]: Framework for testing dangerous autonomous capabilities, including the "rogue replication" threat model
- [fc3078f3c2ba5ebb]: 100+ pre-built evaluations for coding, reasoning, knowledge, and multimodal understanding
Key organizations:
- [45370a5153534152] (Model Evaluation and Threat Research)
- [329d8c2e2532be3d] (scheming, deception, ~10-15 researchers)
- [fdf68a8f30f57dee] (government evaluation institute, conducting pre-deployment testing since November 2023)
- US AISI (recently formed)
Theory of change: Early detection of dangerous capabilities enables labs to implement safety measures before risks materialize.
Research Agenda Comparison
| Agenda | Tractability | Impact if Successful | Timeline to Deployment | Key Risk |
|---|---|---|---|---|
| Mechanistic Interpretability | Medium-High | 10-50% x-risk reduction | 3-10 years | May not scale to frontier models; interpretable features insufficient for safety |
| Scalable Oversight | Medium | 15-40% x-risk reduction | 2-5 years | Weak supervisors may not elicit true capabilities; gaming of oversight signals |
| AI Control | High | 5-20% x-risk reduction | 0-2 years | Assumes capability gap between monitor and monitored; may fail with rapid capability gain |
| Evaluations | Very High | 5-15% x-risk reduction | Already deployed | Evaluations may miss dangerous capabilities; models may sandbag |
| Agent Foundations | Low | 20-60% x-risk reduction | 5-15+ years | Too slow to matter; may solve wrong problem |
| Robustness | Medium | 5-15% x-risk reduction | 2-5 years | Hard to scale; adversarial robustness fundamentally difficult |
Research Progress (2024-2025)
The field has seen substantial empirical advances alongside growing institutional investment. Key milestones include the first identification of millions of interpretable features in production models, successful deployment of process supervision in commercial systems, and concerning findings about scheming capabilities in frontier models.
Empirical Milestones
| Milestone | Date | Key Quantitative Finding | Source |
|---|---|---|---|
| Scaling Monosemanticity | May 2024 | Identified tens of millions of interpretable features in Claude 3 Sonnet using sparse autoencoders | Anthropic Transformer Circuits |
| Frontier Scheming | Dec 2024 | 5 of 6 frontier models showed scheming capabilities; o1 maintained deception in more than 85% of follow-up questions | Apollo Research |
| Deliberative Alignment | Jan 2025 | Reduced scheming rates by up to 30x (o3: 13% → 0.4%; o4-mini: 8.7% → 0.3%) | OpenAI-Apollo Collaboration |
| Cyber Capability Progress | 2023-2025 | AI completion of apprentice-level cyber tasks increased from 9% to 50% | UK AISI Frontier Trends Report |
| Biology Knowledge | 2024-2025 | Frontier models surpassed PhD-level biology expertise (40-50% baseline) | UK AISI Evaluations |
| Process Supervision Deployment | Sep 2024 | OpenAI o1 models deployed with process-based training, achieving state-of-the-art mathematical reasoning | OpenAI o1 System Card |
Institutional Growth
| Institution | Type | Size/Budget | Focus Area | 2024-2025 Developments |
|---|---|---|---|---|
| UK AI Security Institute | Government | World's largest state-backed AI evaluation body | Pre-deployment testing | Evaluated 30+ frontier models; renamed from AISI Feb 2025 |
| US AI Safety Institute | Government | Part of NIST | Standards and evaluation | Signed MoU with UK AISI; joined NIST AI Safety Consortium |
| Coefficient Giving | Philanthropy | $13.6M deployed in 2024 (60% of external AI safety investment) | Broad technical safety | $10M RFP announced March 2025; on track to exceed 2024 by 40-50% |
| METR | Nonprofit | 15-25 researchers | Autonomy evaluations | Evaluated GPT-4.5, Claude 3.7 Sonnet; autonomous task completion doubled every 7 months |
| Apollo Research | Nonprofit | 10-15 researchers | Scheming detection | Published landmark scheming evaluation; partnered with OpenAI on mitigations |
| Redwood Research | Nonprofit | ≈20 staff, $100M+ runway | AI control | Published "Ctrl-Z" agent control sequel (Apr 2025); developed trusted monitoring protocols |
Funding Trajectory
| Year | Total AI Safety Funding | Coefficient Giving Share | Key Trends |
|---|---|---|---|
| 2022 | $10-60M | ≈50% | Early growth; MIRI, CHAI dominant |
| 2023 | $10-90M | ≈55% | Post-ChatGPT surge; lab safety teams expand |
| 2024 | $110-130M | ≈60% ($13.6M) | CAIS ($1.5M), Redwood ($1.2M), MIRI ($1.1M) grants |
| 2025 (proj.) | $150-180M | ≈55% | $10M RFP; government funding increasing |
Despite growth, AI safety funding remains under 2% of total AI investment, estimated at less than $1-10B annually for capabilities research at frontier labs alone.
Risks Addressed
Technical AI safety research is the primary response to alignment-related risks:
| Risk Category | How Technical Research Addresses It | Key Agendas |
|---|---|---|
| Deceptive alignment | Interpretability to detect hidden goals; control protocols for misaligned systems | Mechanistic interpretability, AI control, scheming evaluations |
| Goal misgeneralization | Understanding learned representations; robust training methods | Interpretability, scalable oversight |
| Power-seeking behavior | Detecting power-seeking; designing systems without instrumental convergence | Agent foundations, evaluations |
| Bioweapons misuse | Dangerous capability evaluations; refusals and filtering | Evaluations, robustness |
| Cyberweapons misuse | Capability evaluations; secure deployment | Evaluations, control |
| Autonomous replication | ARA evaluations; containment protocols | METR evaluations, control |
What Needs to Be True
For technical research to substantially reduce x-risk:
- Sufficient time: We have enough time for research to mature before transformative AI
- Technical tractability: Alignment problems have technical solutions
- Adoption: Labs implement research findings in production
- Completeness: Solutions work for the AI systems actually deployed (not just toy models)
- Robustness: Solutions work under competitive pressure and adversarial conditions
Is technical alignment research on track to solve the problem?
Views on whether current alignment approaches will succeed
60-70% likely
40-50%
15-20%
Estimated Impact by Worldview
Short Timelines + Alignment Hard
Impact: Medium-High
- Most critical intervention but may be too late
- Focus on practical near-term techniques
- AI control becomes especially valuable
- De-prioritize long-term theoretical work
Long Timelines + Alignment Hard
Impact: Very High
- Best opportunity to solve the problem
- Time for fundamental research to mature
- Can work on harder problems
- Highest expected value intervention
Short Timelines + Alignment Moderate
Impact: High
- Empirical research can succeed in time
- Governance buys additional time
- Focus on scaling existing techniques
- Evaluations critical for near-term safety
Long Timelines + Alignment Moderate
Impact: High
- Technical research plus governance
- Time to develop and validate solutions
- Can be thorough rather than rushed
- Multiple approaches can be tried
Tractability Assessment
High tractability areas:
- Mechanistic interpretability (clear techniques, measurable progress)
- Evaluations and red-teaming (immediate application)
- AI control (practical protocols being developed)
Medium tractability:
- Scalable oversight (conceptual clarity but implementation challenges)
- Robustness (progress on narrow problems, hard to scale)
Lower tractability:
- Agent foundations (fundamental open problems)
- Inner alignment (may require conceptual breakthroughs)
- Deceptive alignment (hard to test until it happens)
Who Should Consider This
Strong fit if you:
- Have strong ML/CS technical background (or can build it)
- Enjoy research and can work with ambiguity
- Can self-direct or thrive in small research teams
- Care deeply about directly solving the problem
- Have 5-10+ years to build expertise and contribute
Prerequisites:
- ML background: Deep learning, transformers, training at scale
- Math: Linear algebra, calculus, probability, optimization
- Programming: Python, PyTorch/JAX, large-scale computing
- Research taste: Can identify important problems and approaches
Entry paths:
- PhD in ML/CS focused on safety
- Self-study + open source contributions
- MATS/ARENA programs
- Research engineer at safety org
Less good fit if:
- Want immediate impact (research is slow)
- Prefer working with people over math/code
- More interested in implementation than discovery
- Uncertain about long-term commitment
Funding Landscape
| Funder | 2024 Amount | Focus Areas | Notes |
|---|---|---|---|
| Coefficient Giving | $13.6M (60% of total) | Broad technical safety | Largest grants: CAIS ($1.5M), Redwood ($1.2M), MIRI ($1.1M); $10M RFP Mar 2025 |
| [6bc74edd147a374b] | $10M initial | Biosecurity, cyber, agents | Backed by Anthropic, Google, Microsoft, OpenAI |
| Jaan Tallinn | ≈$10M | Long-term alignment | Skype co-founder; independent allocation |
| Eric Schmidt / Schmidt Sciences | ≈$10M | Safety benchmarking | Focus on evaluation infrastructure |
| Long-Term Future Fund | $1-8M | AI risk mitigation | Over $10M cumulative in AI safety |
| UK Government (AISI) | ≈$100M+ | Pre-deployment testing | World's largest state-backed evaluation body; 30+ models tested |
| Future of Life Institute | ≈$15M program | Existential risk reduction | PhD fellowships, research grants |
Funding Gap Analysis
| Metric | 2024 Value | Comparison |
|---|---|---|
| Total AI safety funding | $110-130M | Frontier labs capabilities R&D: estimated $1-10B+ annually |
| Safety as % of AI investment | Under 2% | Industry target proposals: 10-20% |
| Researchers per $1B AI market cap | ≈0.5-1 FTE | Traditional software security: 5-10 FTE per $1B |
| Safety researcher salary competitiveness | 60-80% of capabilities | Top safety researchers earn $150-400K vs $100-600K+ for capabilities leads |
Total estimated annual funding (2024): $110-130M for AI safety research—representing under 2% of estimated capabilities spending and insufficient for the scale of the challenge.
Key Organizations
| Organization | Size | Focus | Key Outputs |
|---|---|---|---|
| [f771d4f56ad4dbaa] | ≈300 employees | Interpretability, Constitutional AI, RSP | Scaling Monosemanticity, Claude RSP |
| [5b8be7f6a2aa7067] | ≈50 safety researchers | Amplified oversight, frontier safety | Frontier Safety Framework |
| OpenAI | Unknown (team dissolved) | Superalignment (previously) | Weak-to-strong generalization |
| [42e7247cbc33fc4c] | ≈20 staff | AI control | Control evaluations framework |
| [86df45a5f8a9bf6d] | ≈10-15 staff | Agent foundations, policy | Shifting to policy advocacy in 2024 |
| [329d8c2e2532be3d] | ≈10-15 researchers | Scheming, deception | Frontier model scheming evaluations |
| [45370a5153534152] | ≈10-20 researchers | Autonomy, dangerous capabilities | ARA evaluations, rogue replication |
| [fdf68a8f30f57dee] | Government-funded | Pre-deployment testing | Inspect framework, 100+ evaluations |
Academic Groups
- UC Berkeley (CHAI, Center for Human-Compatible AI)
- CMU (various safety researchers)
- Oxford/Cambridge (smaller groups)
- Various professors at institutions worldwide
Career Considerations
Pros
- Direct impact: Working on the core problem
- Intellectually stimulating: Cutting-edge research
- Growing field: More opportunities and funding
- Flexible: Remote work common, can transition to industry
- Community: Strong AI safety research community
Cons
- Long timelines: Years to make contributions
- High uncertainty: May work on wrong problem
- Competitive: Top labs are selective
- Funding dependent: Relies on continued EA/philanthropic funding
- Moral hazard: Could contribute to capabilities
Compensation
- PhD students: $30-50K/year (typical stipend)
- Research engineers: $100-200K
- Research scientists: $150-400K+ (at frontier labs)
- Independent researchers: $80-200K (grants/orgs)
Skills Development
- Transferable ML skills (valuable in broader market)
- Research methodology
- Scientific communication
- Large-scale systems engineering
Open Questions and Uncertainties
Key Challenges and Progress Assessment
| Challenge | Current Status | Probability of Resolution by 2030 | Critical Dependencies |
|---|---|---|---|
| Detecting deceptive alignment | Early tools exist; Apollo found scheming in 5/6 models | 30-50% | Interpretability advances; adversarial testing infrastructure |
| Scaling interpretability | Millions of features identified in Claude 3; cost-prohibitive for full coverage | 40-60% | Compute efficiency; automated analysis tools |
| Maintaining control as capabilities grow | Redwood's protocols work for current gap; untested at superhuman levels | 35-55% | Capability gap persistence; trusted monitor quality |
| Evaluating dangerous capabilities | UK AISI tested 30+ models; coverage gaps remain | 50-70% | Standardization; adversarial robustness of evals |
| Aligning reasoning models | o1 process supervision deployed; chain-of-thought interpretability limited | 30-50% | Understanding extended reasoning; hidden cognition detection |
| Closing safety-capability gap | Safety funding under 2% of capabilities; doubling time ≈7 months for autonomy | 20-40% | Funding growth; talent pipeline; lab prioritization |
Key Questions
- ?Should we focus on prosaic alignment or agent foundations?Prosaic (empirical work on existing systems)
Current paradigm needs solutions; empirical feedback is valuable; agent foundations too slow
→ Work on interpretability, RLHF, evaluations
Confidence: mediumAgent foundations (theoretical groundwork)Prosaic approaches may not scale; need conceptual clarity before solutions possible
→ Mathematical research, decision theory, embedded agency
Confidence: low - ?Is working at frontier labs net positive or negative?Positive - access to frontier models is critical
Can't solve alignment without access; safety work slows deployment; can influence from inside
→ Join lab safety teams; coordinate with capabilities work
Confidence: mediumNegative - contributes to race dynamicsSafety work legitimizes labs; contributes to capabilities; racing pressure dominates
→ Independent research; government work; avoid frontier labs
Confidence: medium
Complementary Interventions
Technical research is most valuable when combined with:
- Governance: Ensures solutions are adopted and enforced
- Field-building: Creates pipeline of future researchers
- Evaluations: Tests whether solutions actually work
- Corporate influence: Gets research implemented at labs
Getting Started
If you're new to the field:
- Build foundations: Learn deep learning thoroughly (fast.ai, Karpathy tutorials)
- Study safety: Read Alignment Forum, take ARENA course
- Get feedback: Join MATS, attend EAG, talk to researchers
- Demonstrate ability: Publish interpretability work, contribute to open source
- Apply: Research engineer roles are most accessible entry point
Resources:
- ARENA (AI Safety research program)
- MATS (ML Alignment Theory Scholars)
- AI Safety Camp
- Alignment Forum
- AGI Safety Fundamentals course
Sources & Key References
Mechanistic Interpretability
- [e724db341d6e0065] - Anthropic, May 2024
- [5019b9256d83a04c] - Anthropic blog post
- [c355237bfc2d213d] - Anthropic, 2023
Scalable Oversight
- [e64c8268e5f58e63] - OpenAI, December 2023
- [683aef834ac1612a] - Anthropic, December 2022
- [704f57dfad89c1b3] - OpenAI, July 2023
AI Control
- [cc80ab28579c5794] - Redwood Research, December 2023
- [32c44bb7ba8a1bbe] - Redwood Research blog, May 2024
- [a29670f1ec5df0d6] - LessWrong
Evaluations & Scheming
- [91737bf431000298] - Apollo Research, December 2024
- [b3f335edccfc5333] - OpenAI-Apollo collaboration
- [1d03d6cd9dde0075] - TIME, December 2024
- [601b00f2dabbdd2a] - METR, March 2024
- [5b45342b68bf627e] - METR, November 2024
Safety Frameworks
- [d8c3d29798412b9f] - Google DeepMind
- [6374381b5ec386d1] - DeepMind Safety Research
- [fc3078f3c2ba5ebb] - UK AISI
- [7042c7f8de04ccb1] - UK AI Security Institute
Strategy & Funding
- [435b669c11e07d8f] - MIRI, January 2024
- [07ccedd2d560ecb7] - MIRI, December 2024
- [b1ab921f9cbae109] - LessWrong
- [913cb820e5769c0b] - Coefficient Giving
- [6bc74edd147a374b] - Frontier Model Forum