Longterm Wiki

Research Areas

AI safety research fields, techniques, and programs — with key papers, active organizations, and grant funding data.

Research Areas
82
Active
62
Emerging
19
Clusters
9
82 of 82 research areas
AI ControlResearch on deploying AI systems with sufficient safeguards even if they are misaligned.
AI Controlactive210-2024
Circuit BreakersInference-time interventions that halt model execution when unsafe behavior is detected.
AI Controlactive000-2024
Encoded Reasoning / Steganography DetectionDetecting hidden meaning or covert communication in AI-generated chains of thought.
AI Controlemerging000--
Monitoring & Anomaly DetectionRuntime behavioral monitoring of deployed AI systems to catch unexpected behavior.
AI Controlactive000--
Multi-Agent SafetySafety challenges and solutions for systems of multiple interacting AI agents.
AI Controlactive000--
Output FilteringPost-generation safety filters that screen model outputs before delivery.
AI Controlactive000--
Sandboxing / ContainmentPhysical and logical isolation of AI systems to limit potential harm.
AI Controlactive000--
Structured Access / API-OnlyDeployment models that restrict access to model weights, providing only API interfaces.
AI Controlactive000-2022
Tool-Use RestrictionsLimiting which external tools and actions AI agents can access.
AI Controlactive000--
Adversarial TrainingTraining on adversarial examples to improve robustness against attacks and jailbreaks.
Alignment Trainingactive000--
Alternatives to Adversarial TrainingLatent adversarial training, circuit breaking, and other non-standard robustness approaches.
Alignment Trainingemerging000--
Constitutional AITraining methodology using explicit principles and AI-generated feedback (RLAIF) to train safer language models.
Alignment Trainingactive110-2022
Direct Preference OptimizationFamily of reward-free alignment methods (DPO, KTO, IPO, ORPO, GRPO) that bypass explicit reward model training.
Alignment Trainingactive310-2023
Process SupervisionStep-level reward signals for reasoning verification, as opposed to outcome-only rewards.
Alignment Trainingactive010-2023
Refusal TrainingSafety-specific fine-tuning to train models to decline harmful or dangerous requests.
Alignment Trainingactive000--
Reward ModelingTraining neural networks on human preference comparisons to provide scalable reward signals for RL fine-tuning.
Alignment Trainingactive010-2017
RLHFReinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.
Alignment Trainingactive430-2017
Robust UnlearningRemoving dangerous knowledge from model weights in a way that resists relearning.
Alignment Trainingemerging000--
Supervised Fine-Tuning / Instruction TuningFoundational alignment method: fine-tuning on human-written demonstrations of desired behavior.
Alignment Trainingmature000-2022
Weak-to-Strong GeneralizationUsing weaker models to supervise stronger ones as a proxy for scalable oversight research.
Alignment Trainingactive010-2023
AI & BiosecurityResearch on AI-enabled biological risks and AI tools for biosecurity defense.
Biosecurityactive000--
Dual-Use AI ResearchPolicy and technical research on managing research that could enable both beneficial and harmful uses.
Biosecurityactive000--
Agentic AIResearch on AI systems that take autonomous actions: tool use, computer use, multi-step planning.
Capabilities Researchactive000--
AI ForecastingUsing AI systems for prediction markets, forecasting, and calibration.
Capabilities Researchactive000--
AI PersuasionResearch on AI systems' ability to influence human beliefs and behavior.
Capabilities Researchactive000--
AI ReasoningResearch on chain-of-thought, tree-of-thoughts, and other reasoning improvements.
Capabilities Researchactive000--
AI Scaling LawsEmpirical research on how model performance scales with compute, data, and parameters.
Capabilities Researchactive320-2020
LLM Psychology / Black-Box BehaviorUnderstanding stable values, commitments, and behavioral patterns in large language models.
Capabilities Researchemerging000--
Recursive Self-ImprovementAI systems improving their own code, training, or architecture.
Capabilities Researchactive000--
Situational AwarenessResearch on AI systems understanding their own training, deployment context, and evaluation status.
Capabilities Researchactive000-2024
AI EvaluationsSystematic testing and measurement of AI system capabilities, alignment, and safety properties.
Evaluationactive400--
Alignment EvaluationsTesting whether AI systems are actually aligned, not just capable of appearing aligned.
Evaluationactive000--
Alignment Faking ExperimentsStudying when and why AI systems pretend to be aligned during testing.
Evaluationemerging010-2024
Backdoor DetectionDetecting adversarially implanted vulnerabilities in model weights.
Evaluationactive000--
Capability ElicitationMethods for discovering hidden or latent capabilities in AI systems.
Evaluationactive000--
Control EvaluationsStress-testing systems designed to constrain AI behavior; monitoring for collusion.
Evaluationemerging000--
Dangerous Capability EvaluationsTesting AI systems for CBRN, cyber, autonomy, and other dangerous capabilities.
Evaluationactive000--
Epistemic Virtue EvaluationsTesting AI systems for epistemic honesty, calibration, and intellectual humility.
Evaluationemerging000--
Evaluation AwarenessStudying how AI systems might game evaluations by detecting when they are being tested.
Evaluationactive000--
Jailbreak ResearchFinding, categorizing, and patching prompt injection and jailbreak attacks.
Evaluationactive000--
Red TeamingAdversarial testing of AI systems to discover failure modes, both manual and automated.
Evaluationactive300--
Reward Hacking of Human OversightEmpirically investigating how AI systems deceive or manipulate human evaluators.
Evaluationemerging000--
Scheming / Deception DetectionBehavioral and mechanistic tests for detecting deceptive behavior in AI systems.
Evaluationactive000--
Sleeper Agent DetectionDetecting planted backdoors and conditional misbehavior in trained models.
Evaluationactive010-2024
AI Standards & CertificationNIST, ISO, and other frameworks for AI safety standards and compliance.
Governanceactive000--
AI Treaty VerificationTechnical mechanisms for verifying compliance with international AI agreements.
Governanceemerging000--
Compute GovernanceUsing compute as a lever for AI governance: export controls, chip tracking, licensing.
Governanceactive000--
Frontier Model RegulationLegislative and regulatory approaches to governing the most capable AI systems.
Governanceactive000--
Hardware-Enabled GovernanceOn-chip monitoring, tamper-evident hardware, and compute verification mechanisms.
Governanceemerging000--
International AI CoordinationTreaties, summits, and multilateral agreements for AI governance.
Governanceactive000--
Lab Safety CultureResearch on organizational safety practices, whistleblower protections, and internal governance.
Governanceactive000--
Model RegistriesTracking and cataloging deployed AI models for accountability.
Governanceemerging000--
Open-Source AI GovernancePolicy research on risks and benefits of open-weight model release.
Governanceactive000--
Responsible Scaling PoliciesLab self-governance frameworks tying deployment decisions to capability evaluations.
Governanceactive000-2023
Content Authentication / ProvenanceC2PA, watermarking, and other cryptographic methods for verifying content origin.
Information Integrityactive000--
Deepfake DetectionTechnical methods for detecting AI-generated images, audio, and video.
Information Integrityactive000--
Epistemic SecurityProtecting information ecosystems from AI-enabled manipulation and degradation.
Information Integrityemerging000--
Hallucination ReductionRetrieval-augmented generation, grounding, and other methods to reduce model confabulation.
Information Integrityactive000--
Sycophancy ResearchUnderstanding and mitigating AI systems' tendency to agree with users rather than be truthful.
Information Integrityactive000--
Activation MonitoringUsing probes on internal activations during inference to catch misaligned actions in real-time.
Interpretabilityemerging000--
Externalizing ReasoningTraining models to reason through extended, readable chains of thought rather than opaque internal computation.
Interpretabilityemerging000--
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations.
Interpretabilityemerging000--
InterpretabilityUmbrella field of research aimed at understanding how AI models work internally.
Interpretabilityactive300--
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods.
Interpretabilityemerging000--
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features.
Interpretabilityactive000--
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior.
Interpretabilityactive220-2020
Representation EngineeringIntervening on model representations to steer behavior (e.g., activation addition, representation reading).
Interpretabilityactive010-2023
Sparse AutoencodersUsing sparse dictionary learning to extract interpretable features from model activations at scale.
Interpretabilityactive010-2023
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research.
Interpretabilityactive000--
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable.
Interpretabilityemerging000--
Agent FoundationsTheoretical foundations for reasoning about goal-directed AI systems (MIRI-style research).
Scalable Oversightactive000-2014
AI Safety via DebateUsing adversarial debate between AI systems to help humans evaluate complex claims.
Scalable Oversightactive000-2018
Cooperative AIResearch on AI systems that cooperate with humans and other AI systems.
Scalable Oversightactive000--
CorrigibilityResearch on building AI systems that allow themselves to be corrected, modified, or shut down.
Scalable Oversightactive000-2015
Eliciting Latent KnowledgeGetting AI systems to honestly report what they know, even when deception would be rewarded.
Scalable Oversightactive110-2021
Formal VerificationMathematical proofs of neural network properties and safety guarantees.
Scalable Oversightactive000--
Natural AbstractionsHypothesis that natural abstractions generalize across observers, providing a basis for alignment.
Scalable Oversightactive000-2022
Provably Safe AIDavidad's agenda for building AI systems with mathematical safety guarantees from world models.
Scalable Oversightactive000-2023
Scalable OversightResearch on supervising AI systems that approach or exceed human-level capabilities.
Scalable Oversightactive320-2018
Theoretical Study of Inductive BiasesUnderstanding generalization properties and likelihood of scheming from training dynamics.
Scalable Oversightemerging000--
Value LearningResearch on AI systems that learn and internalize human values through interaction.
Scalable Oversightactive000--
White-Box Estimation of Rare MisbehaviorPredicting probability of rare harmful outputs using model internals.
Scalable Oversightemerging000--