Research Areas
AI safety research fields, techniques, and programs — with key papers, active organizations, and grant funding data.
Research Areas
82
Active
62
Emerging
19
Clusters
9
82 of 82 research areas
AI ControlResearch on deploying AI systems with sufficient safeguards even if they are misaligned. | AI Control | active | 2 | 1 | 0 | - | 2024 |
Circuit BreakersInference-time interventions that halt model execution when unsafe behavior is detected. | AI Control | active | 0 | 0 | 0 | - | 2024 |
Encoded Reasoning / Steganography DetectionDetecting hidden meaning or covert communication in AI-generated chains of thought. | AI Control | emerging | 0 | 0 | 0 | - | - |
Monitoring & Anomaly DetectionRuntime behavioral monitoring of deployed AI systems to catch unexpected behavior. | AI Control | active | 0 | 0 | 0 | - | - |
Multi-Agent SafetySafety challenges and solutions for systems of multiple interacting AI agents. | AI Control | active | 0 | 0 | 0 | - | - |
Output FilteringPost-generation safety filters that screen model outputs before delivery. | AI Control | active | 0 | 0 | 0 | - | - |
Sandboxing / ContainmentPhysical and logical isolation of AI systems to limit potential harm. | AI Control | active | 0 | 0 | 0 | - | - |
Structured Access / API-OnlyDeployment models that restrict access to model weights, providing only API interfaces. | AI Control | active | 0 | 0 | 0 | - | 2022 |
Tool-Use RestrictionsLimiting which external tools and actions AI agents can access. | AI Control | active | 0 | 0 | 0 | - | - |
Adversarial TrainingTraining on adversarial examples to improve robustness against attacks and jailbreaks. | Alignment Training | active | 0 | 0 | 0 | - | - |
Alternatives to Adversarial TrainingLatent adversarial training, circuit breaking, and other non-standard robustness approaches. | Alignment Training | emerging | 0 | 0 | 0 | - | - |
Constitutional AITraining methodology using explicit principles and AI-generated feedback (RLAIF) to train safer language models. | Alignment Training | active | 1 | 1 | 0 | - | 2022 |
Direct Preference OptimizationFamily of reward-free alignment methods (DPO, KTO, IPO, ORPO, GRPO) that bypass explicit reward model training. | Alignment Training | active | 3 | 1 | 0 | - | 2023 |
Process SupervisionStep-level reward signals for reasoning verification, as opposed to outcome-only rewards. | Alignment Training | active | 0 | 1 | 0 | - | 2023 |
Refusal TrainingSafety-specific fine-tuning to train models to decline harmful or dangerous requests. | Alignment Training | active | 0 | 0 | 0 | - | - |
Reward ModelingTraining neural networks on human preference comparisons to provide scalable reward signals for RL fine-tuning. | Alignment Training | active | 0 | 1 | 0 | - | 2017 |
RLHFReinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values. | Alignment Training | active | 4 | 3 | 0 | - | 2017 |
Robust UnlearningRemoving dangerous knowledge from model weights in a way that resists relearning. | Alignment Training | emerging | 0 | 0 | 0 | - | - |
Supervised Fine-Tuning / Instruction TuningFoundational alignment method: fine-tuning on human-written demonstrations of desired behavior. | Alignment Training | mature | 0 | 0 | 0 | - | 2022 |
Weak-to-Strong GeneralizationUsing weaker models to supervise stronger ones as a proxy for scalable oversight research. | Alignment Training | active | 0 | 1 | 0 | - | 2023 |
AI & BiosecurityResearch on AI-enabled biological risks and AI tools for biosecurity defense. | Biosecurity | active | 0 | 0 | 0 | - | - |
Dual-Use AI ResearchPolicy and technical research on managing research that could enable both beneficial and harmful uses. | Biosecurity | active | 0 | 0 | 0 | - | - |
Agentic AIResearch on AI systems that take autonomous actions: tool use, computer use, multi-step planning. | Capabilities Research | active | 0 | 0 | 0 | - | - |
AI ForecastingUsing AI systems for prediction markets, forecasting, and calibration. | Capabilities Research | active | 0 | 0 | 0 | - | - |
AI PersuasionResearch on AI systems' ability to influence human beliefs and behavior. | Capabilities Research | active | 0 | 0 | 0 | - | - |
AI ReasoningResearch on chain-of-thought, tree-of-thoughts, and other reasoning improvements. | Capabilities Research | active | 0 | 0 | 0 | - | - |
AI Scaling LawsEmpirical research on how model performance scales with compute, data, and parameters. | Capabilities Research | active | 3 | 2 | 0 | - | 2020 |
LLM Psychology / Black-Box BehaviorUnderstanding stable values, commitments, and behavioral patterns in large language models. | Capabilities Research | emerging | 0 | 0 | 0 | - | - |
Recursive Self-ImprovementAI systems improving their own code, training, or architecture. | Capabilities Research | active | 0 | 0 | 0 | - | - |
Situational AwarenessResearch on AI systems understanding their own training, deployment context, and evaluation status. | Capabilities Research | active | 0 | 0 | 0 | - | 2024 |
AI EvaluationsSystematic testing and measurement of AI system capabilities, alignment, and safety properties. | Evaluation | active | 4 | 0 | 0 | - | - |
Alignment EvaluationsTesting whether AI systems are actually aligned, not just capable of appearing aligned. | Evaluation | active | 0 | 0 | 0 | - | - |
Alignment Faking ExperimentsStudying when and why AI systems pretend to be aligned during testing. | Evaluation | emerging | 0 | 1 | 0 | - | 2024 |
Backdoor DetectionDetecting adversarially implanted vulnerabilities in model weights. | Evaluation | active | 0 | 0 | 0 | - | - |
Capability ElicitationMethods for discovering hidden or latent capabilities in AI systems. | Evaluation | active | 0 | 0 | 0 | - | - |
Control EvaluationsStress-testing systems designed to constrain AI behavior; monitoring for collusion. | Evaluation | emerging | 0 | 0 | 0 | - | - |
Dangerous Capability EvaluationsTesting AI systems for CBRN, cyber, autonomy, and other dangerous capabilities. | Evaluation | active | 0 | 0 | 0 | - | - |
Epistemic Virtue EvaluationsTesting AI systems for epistemic honesty, calibration, and intellectual humility. | Evaluation | emerging | 0 | 0 | 0 | - | - |
Evaluation AwarenessStudying how AI systems might game evaluations by detecting when they are being tested. | Evaluation | active | 0 | 0 | 0 | - | - |
Jailbreak ResearchFinding, categorizing, and patching prompt injection and jailbreak attacks. | Evaluation | active | 0 | 0 | 0 | - | - |
Red TeamingAdversarial testing of AI systems to discover failure modes, both manual and automated. | Evaluation | active | 3 | 0 | 0 | - | - |
Reward Hacking of Human OversightEmpirically investigating how AI systems deceive or manipulate human evaluators. | Evaluation | emerging | 0 | 0 | 0 | - | - |
Scheming / Deception DetectionBehavioral and mechanistic tests for detecting deceptive behavior in AI systems. | Evaluation | active | 0 | 0 | 0 | - | - |
Sleeper Agent DetectionDetecting planted backdoors and conditional misbehavior in trained models. | Evaluation | active | 0 | 1 | 0 | - | 2024 |
AI Standards & CertificationNIST, ISO, and other frameworks for AI safety standards and compliance. | Governance | active | 0 | 0 | 0 | - | - |
AI Treaty VerificationTechnical mechanisms for verifying compliance with international AI agreements. | Governance | emerging | 0 | 0 | 0 | - | - |
Compute GovernanceUsing compute as a lever for AI governance: export controls, chip tracking, licensing. | Governance | active | 0 | 0 | 0 | - | - |
Frontier Model RegulationLegislative and regulatory approaches to governing the most capable AI systems. | Governance | active | 0 | 0 | 0 | - | - |
Hardware-Enabled GovernanceOn-chip monitoring, tamper-evident hardware, and compute verification mechanisms. | Governance | emerging | 0 | 0 | 0 | - | - |
International AI CoordinationTreaties, summits, and multilateral agreements for AI governance. | Governance | active | 0 | 0 | 0 | - | - |
Lab Safety CultureResearch on organizational safety practices, whistleblower protections, and internal governance. | Governance | active | 0 | 0 | 0 | - | - |
Model RegistriesTracking and cataloging deployed AI models for accountability. | Governance | emerging | 0 | 0 | 0 | - | - |
Open-Source AI GovernancePolicy research on risks and benefits of open-weight model release. | Governance | active | 0 | 0 | 0 | - | - |
Responsible Scaling PoliciesLab self-governance frameworks tying deployment decisions to capability evaluations. | Governance | active | 0 | 0 | 0 | - | 2023 |
Content Authentication / ProvenanceC2PA, watermarking, and other cryptographic methods for verifying content origin. | Information Integrity | active | 0 | 0 | 0 | - | - |
Deepfake DetectionTechnical methods for detecting AI-generated images, audio, and video. | Information Integrity | active | 0 | 0 | 0 | - | - |
Epistemic SecurityProtecting information ecosystems from AI-enabled manipulation and degradation. | Information Integrity | emerging | 0 | 0 | 0 | - | - |
Hallucination ReductionRetrieval-augmented generation, grounding, and other methods to reduce model confabulation. | Information Integrity | active | 0 | 0 | 0 | - | - |
Sycophancy ResearchUnderstanding and mitigating AI systems' tendency to agree with users rather than be truthful. | Information Integrity | active | 0 | 0 | 0 | - | - |
Activation MonitoringUsing probes on internal activations during inference to catch misaligned actions in real-time. | Interpretability | emerging | 0 | 0 | 0 | - | - |
Externalizing ReasoningTraining models to reason through extended, readable chains of thought rather than opaque internal computation. | Interpretability | emerging | 0 | 0 | 0 | - | - |
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations. | Interpretability | emerging | 0 | 0 | 0 | - | - |
InterpretabilityUmbrella field of research aimed at understanding how AI models work internally. | Interpretability | active | 3 | 0 | 0 | - | - |
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods. | Interpretability | emerging | 0 | 0 | 0 | - | - |
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features. | Interpretability | active | 0 | 0 | 0 | - | - |
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior. | Interpretability | active | 2 | 2 | 0 | - | 2020 |
Representation EngineeringIntervening on model representations to steer behavior (e.g., activation addition, representation reading). | Interpretability | active | 0 | 1 | 0 | - | 2023 |
Sparse AutoencodersUsing sparse dictionary learning to extract interpretable features from model activations at scale. | Interpretability | active | 0 | 1 | 0 | - | 2023 |
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research. | Interpretability | active | 0 | 0 | 0 | - | - |
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable. | Interpretability | emerging | 0 | 0 | 0 | - | - |
Agent FoundationsTheoretical foundations for reasoning about goal-directed AI systems (MIRI-style research). | Scalable Oversight | active | 0 | 0 | 0 | - | 2014 |
AI Safety via DebateUsing adversarial debate between AI systems to help humans evaluate complex claims. | Scalable Oversight | active | 0 | 0 | 0 | - | 2018 |
Cooperative AIResearch on AI systems that cooperate with humans and other AI systems. | Scalable Oversight | active | 0 | 0 | 0 | - | - |
CorrigibilityResearch on building AI systems that allow themselves to be corrected, modified, or shut down. | Scalable Oversight | active | 0 | 0 | 0 | - | 2015 |
Eliciting Latent KnowledgeGetting AI systems to honestly report what they know, even when deception would be rewarded. | Scalable Oversight | active | 1 | 1 | 0 | - | 2021 |
Formal VerificationMathematical proofs of neural network properties and safety guarantees. | Scalable Oversight | active | 0 | 0 | 0 | - | - |
Natural AbstractionsHypothesis that natural abstractions generalize across observers, providing a basis for alignment. | Scalable Oversight | active | 0 | 0 | 0 | - | 2022 |
Provably Safe AIDavidad's agenda for building AI systems with mathematical safety guarantees from world models. | Scalable Oversight | active | 0 | 0 | 0 | - | 2023 |
Scalable OversightResearch on supervising AI systems that approach or exceed human-level capabilities. | Scalable Oversight | active | 3 | 2 | 0 | - | 2018 |
Theoretical Study of Inductive BiasesUnderstanding generalization properties and likelihood of scheming from training dynamics. | Scalable Oversight | emerging | 0 | 0 | 0 | - | - |
Value LearningResearch on AI systems that learn and internalize human values through interaction. | Scalable Oversight | active | 0 | 0 | 0 | - | - |
White-Box Estimation of Rare MisbehaviorPredicting probability of rare harmful outputs using model internals. | Scalable Oversight | emerging | 0 | 0 | 0 | - | - |