Large Language Models
Large Language Models
Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.
Large Language Models
Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Capability Level | Near-human to superhuman on structured tasks | o3 achieves 87.5% on ARC-AGI (human baseline ≈85%)1; 87.7% on GPQA Diamond1 |
| Progress Rate | 2–3× capability improvement per year | Stanford AI Index 2025[^2]: benchmark scores rose 18–67 percentage points in one year2 |
| Training Cost Trend | 2.4× annual growth3 | Epoch AI: frontier model training costs projected to exceed $1B by 20273 |
| Inference Cost Trend | 280× reduction since 20222 | GPT-3.5-equivalent dropped from $10 to $1.07 per million tokens2 |
| Hallucination Rates | 8–45% depending on task | Vectara leaderboard: best models reportedly at ≈8%4; HalluLens: up to 45% on factual queries5 |
| Safety Maturity | Moderate | Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100, RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100 established; responsible scaling policies implemented by major labs6 |
| Open-Closed Gap | Narrowing | Gap reportedly shrunk from 8.04% to 1.70% on Chatbot Arena (Jan 2024 → Feb 2025)7 |
Key Links
| Source | Link |
|---|---|
| Official Website | learn.microsoft.com |
| Wikipedia | en.wikipedia.org |
| arXiv | arxiv.org |
Overview
Large Language Models (LLMs) are transformer-based neural networks trained on vast text corpora using next-token prediction. Despite their deceptively simple training objective, LLMs exhibit sophisticated emergent capabilitiesRiskEmergent CapabilitiesEmergent capabilities—abilities appearing suddenly at scale without explicit training—pose high unpredictability risks. Wei et al. documented 137 emergent abilities; recent models show step-functio...Quality: 61/100 including reasoning, coding, scientific analysis, and complex task execution. These models have transformed abstract AI safety discussions into concrete, immediate concerns while providing the clearest path toward artificial general intelligence.
The core insight underlying LLMs is that training a model to predict the next word in a sequence—a task achievable without labeled data—produces internal representations useful for a wide range of downstream tasks. This approach was explored in early unsupervised pretraining work from OpenAI in 2018.1 OpenAI's GPT-2 (2019) then demonstrated coherent multi-paragraph generation at scale, showing that larger models trained on more data produced qualitatively stronger outputs.2 An earlier indicator that such models form interpretable internal structure was the discovery of an "unsupervised sentiment neuron" in 2017, which emerged without any sentiment-specific supervision.3
Current frontier models—including GPT-4o, Claude Opus 4.5, Gemini 2.5 Pro, and Llama 4—demonstrate near-human or superhuman performance across diverse cognitive domains. Training runs for leading frontier systems reportedly consume hundreds of millions of dollars,4 and model parameter counts have reached into the hundreds of billions to trillions.5 These substantial computational investments have shifted AI safety from theoretical to practical urgency. The late 2024–2025 period marked a paradigm shift toward inference-time compute scaling, with reasoning models such as OpenAI's o1 and o3 achieving higher performance on reasoning benchmarks by allocating more compute at inference rather than only at training time.6
A parallel development is the rapid growth of the open-weight ecosystem. Meta's Llama family has grown substantially since its initial release, with Meta reporting over a billion downloads and more than ten times the developer activity compared to 2023.7 Google's Gemma models—including Gemma 3 and the mobile-first Gemma 3n variants—have provided the safety research community with accessible architectures for mechanistic interpretability work.8 This open/closed model convergence has implications for both capability diffusion and the tractability of safety interventions.
Capability Architecture
The diagram below maps the flow from training through inference to observed capabilities. Capabilities are grouped by which inference regime produces them; the distinction between "standard" and "search-augmented" inference is a key axis along which safety-relevant behaviors (extended planning, autonomous task execution) emerge.
Note: Capabilities in the "Emergent Capabilities" subgraph are descriptive, not evaluative. Safety implications of each capability class are discussed in the Concerning Capabilities Assessment and Safety-Relevant Positive Capabilities sections below.
Risk Assessment
| Risk Category | Severity | Likelihood | Timeline | Trend |
|---|---|---|---|---|
| Deceptive CapabilitiesRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 | High | Moderate | 1-3 years | Increasing |
| Persuasion & ManipulationCapabilityPersuasion and Social ManipulationGPT-4 achieves superhuman persuasion in controlled settings (64% win rate, 81% higher odds with personalization), with AI chatbots demonstrating 4x the impact of political ads (3.9 vs ~1 point vote...Quality: 63/100 | High | High | Current | Accelerating |
| Autonomous Cyber OperationsRiskCyberweapons RiskComprehensive analysis showing AI-enabled cyberweapons represent a present, high-severity threat with GPT-4 exploiting 87% of one-day vulnerabilities at \$8.80/exploit and the first documented AI-o...Quality: 91/100 | Moderate-High | Moderate | 2-4 years | Increasing |
| Scientific Research AccelerationCapabilityScientific Research CapabilitiesComprehensive survey of AI scientific research capabilities across biology, chemistry, materials science, and automated research, documenting key benchmarks (AlphaFold's 214M structures, GNoME's 2....Quality: 68/100 | Mixed | High | Current | Accelerating |
| Economic DisruptionRiskAI-Driven Economic DisruptionComprehensive survey of AI labor displacement evidence showing 40-60% of jobs in advanced economies exposed to automation, with IMF warning of inequality worsening in most scenarios and 13% early-c...Quality: 42/100 | High | High | 2-5 years | Accelerating |
Capability Progression Timeline
| Model | Release | Parameters | Key Breakthrough | Performance Milestone |
|---|---|---|---|---|
| GPT-2 | Feb 20191 | 1.5B1 | Coherent text generation | Initially withheld for safety concerns; full 1.5B release Nov 20191 |
| GPT-3 | Jun 20202 | 175B2 | Few-shot learning emergence | Creative writing, basic coding2 |
| GPT-4 | Mar 20233 | Undisclosed3 | Multimodal reasoning | Reportedly ≈90th percentile SAT, bar exam passing3 |
| GPT-4o | May 20244 | Undisclosed | Multimodal speed/cost | Real-time audio-visual, described as 2x faster than GPT-4 Turbo4 |
| Claude 3.5 Sonnet | Jun 20245 | Undisclosed | Advanced tool use | 86.5% MMLU, leading SWE-bench at release5 |
| o1 | Sep 20246 | Undisclosed | Chain-of-thought reasoning | 77.3% GPQA Diamond, 74% AIME 20246 |
| o3 | Dec 20247 | Undisclosed | Inference-time search | 87.7% GPQA Diamond, 91.6% AIME 20247 |
| Gemini 2.5 Pro | Mar 20258 | Undisclosed | Long-context reasoning | 1M-token context, leading coding benchmarks at release8 |
| Llama 4 | Apr 20257 | Undisclosed | Natively multimodal open-weight | Mixture-of-experts architecture7 |
| GPT-5 | May 20259 | Undisclosed | Unified reasoning + tool use | Highest reported scores at release on GPQA Diamond and SWE-bench9 |
| Claude Opus 4.5 | Reportedly Nov 2025 | Undisclosed | Extended reasoning | Reportedly 80.9% SWE-bench Verified |
| GPT-5.2 | Reportedly late 2025 | Undisclosed | Deep thinking modes | Reportedly 93.2% GPQA Diamond, 90.5% ARCOrganizationAlignment Research CenterComprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (Decembe...Quality: 57/100 |
Primary sources: OpenAI model announcements, Anthropic model cards, Google DeepMind blog.
Benchmark Performance Comparison (2024–2025)
| Benchmark | Measures | GPT-4o (2024) | o1 (2024) | o3 (2024) | Human Expert |
|---|---|---|---|---|---|
| GPQA Diamond | PhD-level science | ≈50%6 | 77.3%6 | 87.7%7 | ≈89.8%6 |
| AIME 2024 | Competition math | 13.4%6 | 74%6 | 91.6%7 | Top 500 US |
| MMLU | General knowledge | 84.2%5 | 90.8%6 | ≈92%7 | 89.8%6 |
| SWE-bench Verified | Real GitHub issues | 33.2%6 | 48.9%6 | 71.7%7 | N/A |
| ARC-AGI | Novel reasoning | ≈5%7 | 13.3%7 | 87.5%7 | ≈85%7 |
| Codeforces | Competitive coding | ≈11%6 | 89% (94th percentile)6 | 99.8th percentile7 | N/A |
Sources: OpenAI o1 system card6, ARC Prize o3 analysis7, Anthropic Claude 3.5 model card5.
The o3 results represent a qualitative shift: o3 achieved nearly human-level performance on ARC-AGI (87.5%) versus a ~85% human baseline7, a benchmark specifically designed to test general reasoning rather than pattern matching. On FrontierMath, o3 reportedly solved 25.2% of problems compared to o1's 2%—a roughly 12x improvement that suggests reasoning capabilities may be scaling faster than expected7. However, on the harder ARC-AGI-2 benchmark, o3 scores only ~3% compared to ~60% for average humans7, revealing significant limitations in truly novel reasoning tasks.
Scaling Laws and Predictable Progress
Core Scaling Relationships
Research by Kaplan et al. (2020)↗📄 paper★★★☆☆arXivKaplan et al. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan et al. (2020)capabilitiestrainingcomputellm+1Source ↗1 and refined by Hoffmann et al. (2022)↗📄 paper★★★☆☆arXivHoffmann et al. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. (2022)capabilitiestrainingevaluationcompute+1Source ↗2 demonstrates robust mathematical relationships governing LLM performance:
| Factor | Scaling Law | Implication |
|---|---|---|
| Model Size | Performance ∝ N^0.076 | 10x parameters → 1.9x performance |
| Training Data | Performance ∝ D^0.095 | 10x data → 2.1x performance |
| Compute | Performance ∝ C^0.050 | 10x compute → 1.4x performance |
| Optimal Ratio | N ∝ D^0.47 | Chinchilla scaling for efficiency |
Sources: Chinchilla paper2; Scaling Laws for Neural Language Models1
According to Epoch AIOrganizationEpoch AIEpoch AI maintains comprehensive databases tracking 3,200+ ML models showing 4.4x annual compute growth and projects data exhaustion 2026-2032. Their empirical work directly informed EU AI Act's 10...Quality: 51/100 research,3 approximately two-thirds of LLM performance improvements over the last decade are attributable to increases in model scale, with training techniques contributing roughly 0.4 orders of magnitude per year in compute efficiency.3 The cost of training frontier models has grown by reportedly 2.4x per year since 2016, with the largest models projected to exceed $1B by 2027, according to Epoch AI cost analyses.4
A related phenomenon is the emergence of new scaling regimes beyond training compute. Research published in 2025 finds that emergent capability thresholds are sensitive to random factors including data ordering and initialization seeds,5 suggesting that the apparent sharpness of emergence in aggregate curves may partly reflect averaging over many random runs with different thresholds rather than a clean phase transition.5
The Shift to Inference-Time Scaling (2024–2025)
The o1 and o3 model families introduced a new paradigm: inference-time compute scaling. Rather than only scaling training compute, these models allocate additional computation at inference time through extended reasoning chains and search procedures.6
| Scaling Type | Mechanism | Trade-off | Example |
|---|---|---|---|
| Pre-training scaling | More parameters, data, training compute | High upfront cost, fast inference | GPT-4, Claude 3.5 |
| Inference-time scaling | Longer reasoning chains, search | Lower training cost, expensive inference | o1, o3 |
| Combined scaling | Both approaches | Maximum capability, maximum cost | GPT-5, Claude Opus 4.5 |
This shift has implications for AI safety: inference-time scaling allows models to "think longer" on hard problems, potentially achieving strong performance on specific tasks while maintaining manageable training costs. According to some sources, o1 is approximately 6x more expensive and 30x slower than GPT-4o per query.7 The Stanford AI Index 2025 cites the RE-Bench evaluation finding that in short time-horizon settings (2-hour budget), top AI systems score 4x higher than human experts, but as the time budget increases to 32 hours, human performance surpasses AI by roughly 2 to 1.8
Emergent Capability Thresholds
| Capability | Emergence Scale | Evidence | Safety Relevance |
|---|---|---|---|
| Few-shot learning | ≈100B parameters | GPT-3 breakthrough | Tool useCapabilityTool Use and Computer UseTool use capabilities achieved superhuman computer control in late 2025 (OSAgent: 76.26% vs 72% human baseline) and near-human coding (Claude Opus 4.5: 80.9% SWE-bench Verified), but prompt injecti...Quality: 67/100 foundation |
| Chain-of-thought | ≈10B parameters | PaLM, GPT-3 variants | Complex reasoningCapabilityReasoning and PlanningComprehensive survey tracking reasoning model progress from 2022 CoT to late 2025, documenting dramatic capability gains (GPT-5.2: 100% AIME, 52.9% ARC-AGI-2, 40.3% FrontierMath) alongside critical...Quality: 65/100 |
| Code generation | ≈1B parameters | Codex, GitHub Copilot | Cyber capabilitiesRiskCyberweapons RiskComprehensive analysis showing AI-enabled cyberweapons represent a present, high-severity threat with GPT-4 exploiting 87% of one-day vulnerabilities at \$8.80/exploit and the first documented AI-o...Quality: 91/100 |
| Instruction following | ≈10B parameters | InstructGPT | Human-AI interaction paradigm |
| PhD-level reasoning | o1+ scale | GPQA Diamond performance | Expert-level autonomy |
| Strategic planning | o3 scale | ARC-AGI performance | Deception potentialRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 |
Research documented in a 2025 emergent abilities survey7 finds that emergent abilities depend on multiple interacting factors: scaling up parameters or depth lowers the threshold for emergence but is neither necessary nor sufficient alone—data quality, diversity, training objectives, and architecture modifications also matter significantly.7 Emergence aligns more closely with pre-training loss landmarks than with sheer parameter count; smaller models can match larger ones if training loss is sufficiently reduced.7
According to the Stanford AI Index 2025,8 benchmark performance improved substantially in a single year: scores rose by 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively.8 The gap between leading US and Chinese models also narrowed—from 17.5 to 0.3 percentage points on MMLU—over the same period.8
Safety implication: As AI systems gain autonomous reasoning capabilities, they also develop behaviors relevant to safety evaluation, including goal persistence, strategic planning, and the capacity for deceptive alignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100. OpenAI's o3-mini reportedly became the first AI model to receive a "Medium risk" classification for Model Autonomy under internal capability evaluation frameworks.9
RLHF and Alignment Training Foundations
A key technique underlying modern aligned LLMs is Reinforcement Learning from Human Feedback (RLHF), which trains a reward model on human preference comparisons and uses it to fine-tune Large Language Models outputs via RL. The foundational application of this approach to Large Language Models was demonstrated in Fine-Tuning GPT-2 from Human Preferences (Ziegler et al., 2019), which showed that human feedback could steer model behavior on stylistic tasks. This was later scaled to instruction-following via InstructGPT and then to Claude's Constitutional AI framework.
| RLHF Component | Function | Safety Relevance |
|---|---|---|
| Reward model | Converts human preferences into a differentiable signal | Central to RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100 alignment |
| PPO fine-tuning | Updates language model to maximize reward | Can introduce reward hackingRiskReward HackingComprehensive analysis showing reward hacking occurs in 1-2% of OpenAI o3 task attempts, with 43x higher rates when scoring functions are visible. Mathematical proof establishes it's inevitable for...Quality: 91/100 |
| Constitutional AI | Replaces human labelers with model self-critique against principles | Scales alignment oversight; see Constitutional AIApproachConstitutional AIConstitutional AI is Anthropic's methodology using explicit principles and AI-generated feedback (RLAIF) to train safer models, achieving 3-10x improvements in harmlessness while maintaining helpfu...Quality: 70/100 |
| Multilingual consistency | Enforces safety behaviors across languages | Align Once, Benefit Multilingually (2025) |
A significant limitation is that RLHF-trained models can exhibit sycophancyRiskSycophancySycophancy—AI systems agreeing with users over providing accurate information—affects 34-78% of interactions and represents an observable precursor to deceptive alignment. The page frames this as a...Quality: 65/100—systematically agreeing with user beliefs rather than providing accurate responses—because human raters often prefer confident, agreeable answers. Recent work on multi-objective alignment for psychotherapy contexts and policy-constrained alignment frameworks like PolicyPad (2025) explore how to balance multiple competing alignment objectives.
Multilingual alignment gap: A consistent finding across alignment research is that safety training applied primarily in English does not reliably transfer to other languages. Align Once, Benefit Multilingually (2025) proposes methods to enforce multilingual consistency in safety alignment by training on cross-lingual consistency losses. Related work on Helpful to a Fault (2025) measures illicit assistance rates across 40+ languages in multi-turn interactions, finding that multilingual agents provide more potentially harmful assistance than English-only evaluations would suggest, particularly in low-resource languages where safety fine-tuning data is sparse.
Major 2025 Model Releases
GPT-5 and the GPT-5 Family
GPT-5, released in May 2025, represents OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to Public Benefit Corporation, with detailed analysis of governance crisis, 2024-2025 ownership restructuri...Quality: 62/100's integration of reasoning and general capability into a single unified model, replacing the separate GPT-4o and o1 product lines. According to the GPT-5 System Card, GPT-5 achieves the highest scores to date on GPQA Diamond, SWE-bench Verified, and MMLU among OpenAI models at the time of release.
Key developments in the GPT-5 family:
- GPT-5.1: Released for developers in mid-2025, optimized for conversational applications and described as "smarter, more conversational" with improved instruction-following. Cursor and Tolan reported substantial gains in their agentic pipelines.
- gpt-oss-120b and gpt-oss-20b: Open-weight models released by OpenAI, with model cards published for both sizes. These represent a shift toward open-weight strategy alongside closed frontier models.
- GPT Realtime API: Extended with real-time audio dialog capabilities, building on the GPT-4o voice capabilities introduced in Hello GPT-4o (2024).
The GPT-5 System Card documents safety evaluations including assessments of CBRN uplift risk, cyberoffense capabilities, and persuasion. An Addendum on Sensitive Conversations addresses handling of mental health, self-harm, and politically contentious topics, noting both improvements in refusal precision and remaining cases where the model provides responses that require strengthening.
Gemini 2.5 Family
Google DeepMind's Gemini 2.5 family, released across March–June 2025, introduced several models:
| Model | Key Feature | Context Window | Primary Use Case |
|---|---|---|---|
| Gemini 2.5 Pro | Highest capability, coding-focused | 1M tokens | Complex reasoning, coding |
| Gemini 2.5 Flash | Speed-optimized frontier | 1M tokens | Scaled production use |
| Gemini 2.5 Flash-Lite | Cost/latency optimized | 1M tokens | High-volume inference |
Gemini 2.5: Our Most Intelligent AI Model describes the Pro variant as achieving leading performance on coding benchmarks including LiveCodeBench and outperforming GPT-4o on MMLU at release. Gemini 2.5 Flash-Lite was made production-ready in June 2025 with a focus on throughput-sensitive applications.
The 2.5 family also introduced native thinking models—models that produce explicit chain-of-thought reasoning before answering—across both Pro and Flash tiers. Advanced audio dialog and generation with Gemini 2.5 extended audio-native generation capabilities.
Gemma 3 and Open-Weight Models
Google's Gemma 3 family, released in 2025, provides open-weight models ranging from 1B to 27B parameters optimized for single-accelerator deployment. The Gemma 3 270M variant targets edge and mobile deployment. Gemma 3n introduced a mobile-first architecture with selective parameter activation for on-device inference.
MedGemma, released in mid-2025, provides open health-specific models demonstrating LLM application in clinical reasoning. A Gemma-based model contributed to discovery of a potential cancer therapy pathway, illustrating scientific research acceleration potential.
T5Gemma introduced encoder-decoder variants of the Gemma architecture, enabling use cases where separate encoding and decoding is beneficial (e.g., retrieval-augmented generation, classification tasks).
Llama 4 and the Open-Weight Ecosystem
Meta's Llama 4 Herd, announced at LlamaCon in April 2025, represents a shift to natively multimodal architecture using a mixture-of-experts design. Llama 4 Scout and Llama 4 Maverick support image, video, and text inputs from the base model level. Meta reported over 10x growth in Llama usage since 2023, with the model family becoming a reference implementation for open-weight AI development.
Key safety implications of the open-weight ecosystem:
- Fine-tuning safety guardrails out of open-weight models remains tractable for technically sophisticated users
- Mechanistic interpretability research benefits from open weights (e.g., Gemma Scope 2, described below)
- Governance frameworks targeting API access do not apply to locally deployed open-weight models
Mechanistic Interpretability Advances
Mechanistic interpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with \$100M+ annual investment across major labs. Anthropic extracted 30M+ features from C...Quality: 59/100 research—which seeks to understand the internal computations of neural networks in human-interpretable terms—has accelerated substantially with the availability of open-weight models and new tooling.
Gemma Scope 2
Gemma Scope 2 is a suite of sparse autoencoders (SAEs) trained on Gemma 3 models, released by Google DeepMind to support interpretability research on complex language model behaviors. Building on the original Gemma Scope release for Gemma 2, Gemma Scope 2 provides SAEs at multiple layers and widths, enabling decomposition of model activations into human-interpretable features.
Gemma Scope 2 supports research into:
- Feature geometry and polysemanticity in larger models
- Cross-layer feature interactions and information flow
- Identification of features relevant to safety-relevant behaviors (deception, refusal, sycophancy)
Language Models Explaining Neurons
Language Models Can Explain Neurons in Language Models (Bills et al., 2023) demonstrated that GPT-4 can generate natural language explanations of GPT-2 neurons with higher validity than human-written explanations. This opened a scalable pathway for automated interpretability: using more capable models to explain less capable ones. Subsequent work has extended this to sparse autoencoder features and cross-model explanation transfer.
Activation Steering and Causal Intervention
Activation steering—injecting vectors into model residual streams to steer behavior—has become a primary tool for behavioral intervention research. Recent work has refined understanding of when and why steering succeeds:
- Surgical Activation Steering via Generative Causal Mediation (2025) demonstrates that steering effectiveness depends on correctly identifying the causal pathway through which a concept influences output, rather than simply finding directions in activation space. Steering at incorrect layers or attention heads produces unreliable or null effects.
- Mechanistic Indicators of Steering Effectiveness in Large Language Models (2025) identifies model-internal signals that predict whether a given steering vector will successfully alter behavior, enabling more principled selection of intervention targets.
- Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models (2025) finds that personality-related features form geometric structures in activation space that can interfere with unrelated steering directions, suggesting that alignment-relevant features are not always cleanly separable.
- Does GPT-2 Represent Controversy? A Small Mech Interp Investigation (2024) provides a case study of applying mechanistic interpretability methods to identify controversy-related representations in GPT-2, illustrating the workflow for small-scale interpretability research.
- Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability (2025) proposes using interpretability-discovered features as reward signals for RLHF, bypassing the need for human labeling of open-ended tasks.
- Causality is Key for Interpretability Claims to Generalise (2025) argues that interpretability claims must be evaluated using causal interventions rather than correlation alone, since correlational analyses can identify spurious structure that does not causally influence model behavior.
Research Platform Value of Open-Weight Models
| Research Application | Open-Weight Benefit | Example |
|---|---|---|
| Mechanistic interpretability | Full activation access | Gemma Scope 2 on Gemma 3 |
| SAE training | Weight access for feature analysis | Gemma Scope, TranscoderLens |
| Activation steering | Residual stream intervention | Multiple labs using Llama |
| Fine-tuning safety | Rapid iteration | Constitutional AI variants |
| Neuron explanation | Cross-model explanation transfer | Bills et al. 2023 |
Hallucination and Factuality
LLM hallucination—the generation of plausible-sounding but factually incorrect or unsupported content—remains a central reliability and safety challenge. Hallucination rates vary substantially by task, model, and measurement methodology, with published estimates ranging from roughly 8% to 45% depending on benchmark design and model version.10
Factuality Benchmarking
| Benchmark | Scope | Key Finding | Link |
|---|---|---|---|
| FACTS Grounding | Long-form factuality against source documents | Measures supported vs. unsupported claims | FACTS Grounding |
| FACTS Benchmark Suite | Systematic factuality across task types | Decomposes factuality failures by error type | FACTS Suite |
| Vectara Hallucination Leaderboard | Summarization hallucination | Best models: ≈8% hallucination rate | Vectara |
| HalluLens | Factual query hallucination | Up to 45% on factual queries for GPT-4o | HalluLens (ACL 2025) |
| CheckIfExist | Citation hallucination in AI-generated content | Detects fabricated citations in RAG systems | CheckIfExist (2025) |
The FACTS Grounding benchmark11 introduced by Google DeepMind specifically addresses the challenge of evaluating long-form generation against reference documents, distinguishing between claims grounded in provided source material and claims introduced without basis. This is particularly relevant for retrieval-augmented generation (RAG) systems, where the model has access to retrieved context but may still generate unsupported claims.
CheckIfExist (2025)12 addresses citation hallucination specifically—the generation of plausible-looking but non-existent citations, which poses particular risks in legal, medical, and academic contexts. The benchmark finds that citation hallucination rates remain substantial even for frontier models, and that RAG systems can still hallucinate citations from retrieved documents by misattributing or confabulating specific reference details.
Why Models Hallucinate
OpenAI's explainer on why language models hallucinate13 identifies several contributing mechanisms:
- Training objective mismatch: Next-token prediction rewards coherent text, not factual accuracy
- Knowledge compression: Models must compress world knowledge into fixed-weight representations, leading to lossy encoding
- Context-weight tension: Models may blend retrieved context with parametric knowledge, producing hybrid outputs that are faithful to neither
- Sycophancy pressure: RLHF can train models to confirm user beliefs rather than correct factual errors, since raters may prefer agreeable responses
- Calibration failure: Models often express high confidence in incorrect claims, reducing the signal value of expressed uncertainty
Hallucination rates are not monotonically improved by scale: models engineered for massive context windows do not consistently achieve lower hallucination rates than smaller counterparts, and increased model size can increase the confidence of hallucinated outputs without reducing their frequency. This suggests hallucination is partly an architectural and training-objective issue rather than purely a capacity limitation.
Deception and Truthfulness
| Behavior Type | Frequency | Context | Mitigation |
|---|---|---|---|
| Hallucination | 8–45%10 | Varies by task and model | Training improvements, RAG |
| Citation hallucination | reportedly ≈17%14 | Legal, academic domain | CheckIfExist detection systems |
| Role-play deception | High | Prompted scenarios | Safety fine-tuning |
| SycophancyRiskSycophancySycophancy—AI systems agreeing with users over providing accurate information—affects 34-78% of interactions and represents an observable precursor to deceptive alignment. The page frames this as a...Quality: 65/100 | Moderate | Opinion questions | Constitutional AI, RLHF adjustment |
| Strategic deception | Low-Moderate | Evaluation scenarios | Ongoing research |
According to some sources, even recent frontier models have hallucination rates exceeding 15% when asked to analyze provided statements, and approximately 1 in 6 AI responses in legal contexts reportedly contain citation hallucinations.14 The wide variance across benchmarks reflects both genuine model differences and definitional variation in what constitutes a hallucination.
Safety-Relevant Positive Capabilities
Interpretability Research Platform
| Research Area | Progress Level | Key Findings | Organizations |
|---|---|---|---|
| Attention visualization | Advanced | Knowledge storage patterns | Anthropic↗🔗 web★★★★☆AnthropicAnthropicfoundation-modelstransformersscalingescalation+1Source ↗, OpenAI↗🔗 web★★★★☆OpenAIOpenAIfoundation-modelstransformersscalingtalent+1Source ↗ |
| Activation patching | Moderate | Causal intervention methods | Redwood ResearchOrganizationRedwood ResearchA nonprofit AI safety and security research organization founded in 2021, known for pioneering AI Control research, developing causal scrubbing interpretability methods, and conducting landmark ali...Quality: 78/100 |
| Sparse autoencoders | Advancing | Feature decomposition in large models | Anthropic↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...alignmentinterpretabilitysafetysoftware-engineering+1Source ↗, Google DeepMind (Gemma Scope) |
| Neuron explanation | Moderate | LM-explained neurons via GPT-4 | Bills et al. 2023 |
| Mechanistic understanding | Early-Moderate | Transformer circuits | Anthropic Interpretability↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...alignmentinterpretabilitysafetysoftware-engineering+1Source ↗ |
Constitutional AI and Value Learning
Anthropic's Constitutional AI↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...alignmentinterpretabilitysafetysoftware-engineering+1Source ↗ demonstrates approaches to value alignment through self-critique, principle-following, and preference learning. The specific success rates for these techniques vary considerably across evaluations, tasks, and model versions; the figures below represent approximate reported ranges rather than settled benchmarks, and should be treated with caution.
| Technique | Reported Range | Application | Limitations |
|---|---|---|---|
| Self-critique | ≈70–85% (approximate) | Harmful content reduction | Requires good initial training |
| Principle following | ≈60–80% (approximate) | Consistent value application | Vulnerable to gaming |
| Preference learning | ≈65–75% (approximate) | Human value approximation | Distributional robustness |
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100 Applications
Modern LLMs enable several approaches to AI safety through automated oversight:
- Output evaluation: AI systems critiquing other AI outputs. Agreement rates with human evaluators vary substantially by task and domain; no single well-characterized aggregate figure covers the literature.
- Red-teaming: Automated discovery of failure modes and adversarial inputs.
- Safety monitoring: Real-time analysis of AI system behavior patterns.
- Research acceleration: AI-assisted safety research and experimental design.
- Content moderation: OpenAI has published work on using GPT-4 for content moderation, describing how LLM-based moderation can operate at scale and reduce human labeler exposure to harmful content.
Concerning Capabilities Assessment
Persuasion and Manipulation
Modern LLMs demonstrate persuasion capabilities that raise questions for democratic discourse and individual autonomy:
| Capability | Current State | Evidence | Risk Level |
|---|---|---|---|
| Audience adaptation | Advanced | Anthropic persuasion research | High |
| Persona consistency | Advanced | Extended roleplay studies | High |
| Emotional manipulation | Moderate | RLHF alignment research | Moderate |
| Debate performance | Advanced | Human preference studies | High |
According to some sources, frontier LLMs can substantially increase human agreement rates through targeted persuasion techniques, raising concerns about consensus manufacturingRiskAI-Powered Consensus ManufacturingConsensus manufacturing through AI-generated content is already occurring at massive scale (18M of 22M FCC comments were fake in 2017; 30-40% of online reviews are fabricated). Detection systems ac...Quality: 64/100; however, the specific figure of 82% attributed to Anthropic research has not been independently verified against a primary source and should be treated with caution.15 The ARC EvaluationsOrganizationARC EvaluationsOrganization focused on evaluating AI systems for dangerous capabilities. Now largely absorbed into METR. (2025) specifically evaluates persuasion and resistance capabilities in LLMs through adversarial resource extraction games, finding that frontier models can maintain persuasive pressure across extended multi-turn interactions.16
Multilingual Safety Gaps
A consistent finding across the research literature is that safety alignment applied primarily in English does not reliably transfer to other languages. Helpful to a Fault (2025) measured illicit assistance rates for multi-turn, multilingual LLM agents across 40+ languages, finding that:17
- Models provide meaningfully more potentially harmful assistance in non-English languages
- The gap is largest for low-resource languages with sparse alignment training data
- Multi-turn interactions elicit more harmful assistance than single-turn, with each turn potentially eroding earlier refusals
This suggests that multilingual deployment of LLMs creates safety gaps that single-language evaluation would miss, a concern reportedly noted in the GPT-5 System Card.9
Cybersecurity Capabilities and Refusal Frameworks
LLMs' code generation and vulnerability analysis capabilities create dual-use risks in cybersecurity:
- Models can reportedly assist with vulnerability identification, exploit development, and attack planning at varying levels depending on the specificity of the request
- A Content-Based Framework for Cybersecurity Refusal Decisions (2025) proposes taxonomizing cybersecurity requests by content type (reconnaissance vs. exploitation vs. malware development) rather than binary harmful/safe classification, enabling more precise refusal decisions that maintain legitimate educational and defensive use18
- SecCodeBench-V2 (2025) provides updated benchmarks for evaluating security-relevant code generation, reportedly finding continued improvement in both benign and potentially harmful code generation capabilities19
Autonomous Capabilities
Current LLMs demonstrate increasing levels of autonomous task execution:
- Web browsing: GPT-4 class models can reportedly navigate websites, extract information, and interact with web services
- Code execution: Models can write, debug, and iteratively improve software across extended sessions
- API integration: Tool use across multiple digital platforms; OpenAI's Shipping Smarter Agents with Every New Model describes the company's agent capability roadmap20
- Goal persistence: Basic ability to maintain objectives across extended multi-turn interactions, relevant to schemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100 risk evaluation
- Memory across sessions: MemoryArena (2025) benchmarks agent memory in interdependent multi-session tasks, finding that frontier models maintain task state imperfectly across sessions but with increasing reliability in newer model versions21
Fundamental Limitations
What Doesn't Scale Automatically
| Property | Scaling Behavior | Evidence | Implications |
|---|---|---|---|
| Truthfulness | No direct improvement | Larger models can be more confidently incorrect | Requires targeted training |
| Reliability | Inconsistent | High variance across similar prompts | Systematic evaluation needed |
| Novel reasoning | Limited progress | Pattern matching vs. genuine insight | May require architectural changes |
| Value alignment | No guarantee | Capability-alignment divergence | Alignment difficulty |
| Multilingual safety | Uneven transfer | Align Once, Benefit Multilingually | Requires cross-lingual training |
Current Performance Gaps
Despite strong benchmark performance, significant limitations remain:
- Hallucination rates: According to some sources, hallucination rates range from roughly 8–45% depending on task and model, with high variance across domains1
- Inconsistency: High variance in responses to equivalent prompts; Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance (2025)2 documents time-dependent performance variation that is difficult to explain and control for in evaluations
- Context limitations: Models struggle with very long-horizon reasoning despite large context windows (reportedly exceeding 1 million tokens in some deployments); long-context retrieval accuracy degrades for content positioned in the middle of the context window
- Novel problem solving: According to reporting at the time of its release, o3 achieved approximately 87.5% on ARC-AGI; however, this reportedly required high-compute settings, and real-world novel reasoning remains challenging3. On the harder ARC-AGI-2 benchmark, o3 reportedly scored around 3% while average humans scored approximately 60%4
- Benchmark vs. real-world gap: Research using the QUAKE benchmark reportedly found that frontier LLMs average just 28% pass rate on practical tasks, despite high scores on standard benchmarks5
- Long-tail knowledge: Long-Tail Knowledge in Large Language Models (2025)6 documents that rare facts and edge cases are disproportionately hallucinated, with performance degrading sharply below certain frequency thresholds in training data
- False belief reasoning: Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs (2025)7 finds that models rely heavily on surface language statistics rather than genuine theory-of-mind reasoning when solving false belief tasks, raising questions about claimed social cognition capabilities
Note that models engineered for massive context windows do not consistently achieve lower hallucination rates than smaller counterparts, suggesting performance depends more on architecture and training quality than capacity alone.
Current State and 2025-2030 Trajectory
Key 2024-2025 Developments
| Development | Status | Impact | Safety Relevance |
|---|---|---|---|
| Reasoning models (o1, o3) | Deployed | PhD-level reasoning achieved | Extended planning capabilities |
| Inference-time scaling | Established | New scaling paradigm | Potentially harder to predict capabilities |
| Agentic AI frameworks | Growing | Autonomous task completion | Autonomous systemsCapabilityAgentic AIAnalysis of agentic AI capabilities and deployment challenges, documenting industry forecasts (40% of enterprise apps by 2026, \$199B market by 2034) alongside implementation difficulties (40%+ pro...Quality: 68/100 concerns |
| 1M+ token context | Standard | Long-document reasoning | Extended goal persistence |
| Multi-model routing | Emerging | Task-optimized deployment | Complexity in governance |
| Open-weight frontier models | Established | Llama 4, Gemma 3, gpt-oss | Reduced API access as governance lever |
| Mechanistic interpretability tooling | Advancing | Gemma Scope 2, SAE ecosystem | Improved interpretability tractability |
| Multilingual alignment research | Early | Align Once, Benefit Multilingually | Safety gaps in non-English languages |
One of the most significant trends is the emergence of agentic AI—LLM-powered systems that can make decisions, interact with tools, and take actions without constant human input. This represents a qualitative shift from chat interfaces to autonomous systems capable of extended task execution. OpenAI's Spring Update and subsequent Shipping Smarter Agents posts describe investment in memory, tool use, and multi-model orchestration as the primary near-term product directions.
Near-term Outlook (2025-2026)
| Development | Likelihood | Timeline | Impact |
|---|---|---|---|
| GPT-5.x and successor models | High | 6-12 months | Further capability improvement |
| Improved reasoning (o3 successors) | High | 3-6 months | Enhanced scientific researchCapabilityScientific Research CapabilitiesComprehensive survey of AI scientific research capabilities across biology, chemistry, materials science, and automated research, documenting key benchmarks (AlphaFold's 214M structures, GNoME's 2....Quality: 68/100 |
| Multimodal integration | High | 6-12 months | Video, audio, sensor fusion |
| Robust agent frameworks | High | 12-18 months | Autonomous systemsCapabilityAgentic AIAnalysis of agentic AI capabilities and deployment challenges, documenting industry forecasts (40% of enterprise apps by 2026, \$199B market by 2034) alongside implementation difficulties (40%+ pro...Quality: 68/100 |
| Interpretability-informed alignment | Moderate | 12-24 months | Features as Rewards approach |
Medium-term Outlook (2026-2030)
Expected developments include potential architectural innovations beyond standard transformer attention, deeper integration with robotics platforms, and continued capability improvements. Key uncertainties include whether current scaling approaches will continue yielding improvements and the timeline for artificial general intelligence.
Data constraints: According to Epoch AI projections, high-quality training data could become a significant bottleneck this decade, particularly if models continue to be overtrained. For AI progress to continue into the 2030s, either new sources of data (synthetic data, multimodal data) or less data-hungry techniques must be developed. Can Generative AI Survive Data Contamination? (2025) analyzes the theoretical guarantees available under contaminated recursive training—where models are trained on AI-generated data—finding that certain forms of contamination lead to systematic performance degradation rather than catastrophic collapse, but that the long-run effects remain analytically uncertain.
Key Uncertainties and Research Cruxes
Fundamental Understanding Questions
- Intelligence vs. mimicry: Extent of genuine understanding vs. sophisticated pattern matching, with False Belief Reasoning research (2025) suggesting current models rely heavily on statistical correlates of reasoning rather than causal models
- Emergence predictability: Whether capability emergence can be reliably forecasted; Random Scaling of Emergent Capabilities (2025) suggests randomness in training may cause more gradual aggregate emergence than apparent
- Architectural limits: Whether transformers can scale to AGI or require fundamental innovations
- Alignment scalability: Whether current safety techniques work for superhuman systems
Safety Research Priorities
| Priority Area | Importance | Tractability | Neglectedness |
|---|---|---|---|
| InterpretabilityCruxIs Interpretability Sufficient for Safety?Comprehensive survey of the interpretability sufficiency debate with 2024-2025 empirical progress: Anthropic extracted 34M features from Claude 3 Sonnet (70% interpretable), but scaling requires bi...Quality: 49/100 | High | Moderate | Moderate |
| Alignment techniquesCruxAI Safety Solution CruxesComprehensive analysis of key uncertainties determining optimal AI safety resource allocation across technical verification (25-40% believe AI detection can match generation), coordination mechanis...Quality: 71/100 | Highest | Low | Low |
| Capability evaluation | High | High | Moderate |
| Governance frameworks | High | Moderate | High |
| Multilingual alignment | High | Moderate | High |
| Factuality and hallucination | Moderate | High | Low |
Timeline Uncertainties
Current expert surveys show wide disagreement on AGI timelines, with median estimates ranging from 2027 to 2045. This uncertainty stems from:
- Unpredictable capability emergence patterns
- Unknown scaling law continuation
- Potential architectural innovations
- Economic and resource constraints
- Data availability bottlenecks
The o3 results on ARC-AGI (87.5%, approaching human baseline of ~85%) have intensified debate about whether we are approaching AGI sooner than expected. Critics note that high-compute inference settings make this performance expensive and slow, and that benchmark performance may not translate to general real-world capability. The ARC-AGI-2 results (o3 at 3%, average humans at 60%) illustrate that performance on designed-to-be-robust benchmarks still reveals substantial gaps.
Sources & Resources
Academic Research
| Paper | Authors | Year | Key Contribution |
|---|---|---|---|
| Scaling Laws↗📄 paper★★★☆☆arXivKaplan et al. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan et al. (2020)capabilitiestrainingcomputellm+1Source ↗ | Kaplan et al. | 2020 | Mathematical scaling relationships |
| Chinchilla↗📄 paper★★★☆☆arXivHoffmann et al. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. (2022)capabilitiestrainingevaluationcompute+1Source ↗ | Hoffmann et al. | 2022 | Optimal parameter-data ratios |
| Constitutional AI↗📄 paper★★★☆☆arXivConstitutional AI: Harmlessness from AI FeedbackBai, Yuntao, Kadavath, Saurav, Kundu, Sandipan et al. (2022)foundation-modelstransformersscalingagentic+1Source ↗ | Bai et al. | 2022 | Value-based training methods |
| Emergent Abilities↗📄 paper★★★☆☆arXivEmergent AbilitiesJason Wei, Yi Tay, Rishi Bommasani et al. (2022)capabilitiesllmfoundation-modelstransformers+1Source ↗ | Wei et al. | 2022 | Capability emergence documentation |
| Fine-Tuning GPT-2 from Human Preferences | Ziegler et al. | 2019 | RLHF foundations for language models |
| Language Models are Few-Shot Learners | Brown et al. | 2020 | GPT-3 and few-shot learning |
| Emergent Abilities Survey | Various | 2025 | Comprehensive emergence review |
| Scaling Laws for Precision | Kumar et al. | 2024 | Low-precision scaling extensions |
| HalluLens Benchmark | Various | 2025 | Hallucination measurement framework |
| FACTS Grounding | Google DeepMind | 2025 | Long-form factuality benchmark |
| Align Once, Benefit Multilingually | Various | 2025 | Multilingual safety alignment |
| Random Scaling of Emergent Capabilities | Various | 2025 | Randomness and emergence thresholds |
| Causality is Key for Interpretability | Various | 2025 | Causal evaluation of interpretability claims |
Model and System Cards
| Document | Organization | Year | Relevance |
|---|---|---|---|
| GPT-5 System Card | OpenAI | 2025 | Safety evaluations, capability assessments |
| GPT-5 System Card Addendum: Sensitive Conversations | OpenAI | 2025 | Mental health, self-harm, contentious topics |
| gpt-oss-120b & gpt-oss-20b Model Card | OpenAI | 2025 | Open-weight model documentation |
| GPT-4V(ision) System Card | OpenAI | 2023 | Multimodal safety evaluation |
| GPT-2: 1.5B Release | OpenAI | 2019 | Staged release and safety rationale |
Organizations and Research Groups
| Type | Organization | Focus Area | Key Resources |
|---|---|---|---|
| Industry | OpenAI↗📄 paper★★★★☆OpenAIOpenAI: Model Behaviorsoftware-engineeringcode-generationprogramming-aifoundation-models+1Source ↗ | GPT series, safety research | Technical papers, safety docs |
| Industry | Anthropic↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...alignmentinterpretabilitysafetysoftware-engineering+1Source ↗ | Constitutional AI, interpretability | Claude research, safety papers |
| Industry | Google DeepMindOrganizationGoogle DeepMindComprehensive overview of DeepMind's history, achievements (AlphaGo, AlphaFold with 200M+ protein structures), and 2023 merger with Google Brain. Documents racing dynamics with OpenAI and new Front...Quality: 37/100 | Gemini, Gemma, interpretability | Gemma Scope 2, FACTS benchmarks |
| Academic | CHAIOrganizationCenter for Human-Compatible AICHAI is UC Berkeley's AI safety research center founded by Stuart Russell in 2016, pioneering cooperative inverse reinforcement learning and human-compatible AI frameworks. The center has trained 3...Quality: 37/100 | AI alignment research | Technical alignment papers |
| Safety | Redwood ResearchOrganizationRedwood ResearchA nonprofit AI safety and security research organization founded in 2021, known for pioneering AI Control research, developing causal scrubbing interpretability methods, and conducting landmark ali...Quality: 78/100 | Interpretability, oversight | Mechanistic interpretability |
| Research | METROrganizationMETRMETR conducts pre-deployment dangerous capability evaluations for frontier AI labs (OpenAI, Anthropic, Google DeepMind), testing autonomous replication, cybersecurity, CBRN, and manipulation capabi...Quality: 66/100 | Capability evaluation | Autonomous replication evaluations |
Policy and Governance Resources
| Resource | Organization | Focus | Link |
|---|---|---|---|
| AI Safety Guidelines | NIST↗🏛️ government★★★★★NISTNIST AI Risk Management Frameworksoftware-engineeringcode-generationprogramming-aifoundation-models+1Source ↗ | Federal standards | Risk management framework |
| Responsible AI Practices | Partnership on AI↗🔗 webPartnership on AIA nonprofit organization focused on responsible AI development by convening technology companies, civil society, and academic institutions. PAI develops guidelines and framework...foundation-modelstransformersscalingsocial-engineering+1Source ↗ | Industry coordination | Best practices documentation |
| International Cooperation | UK AI Safety InstituteOrganizationUK AI Safety InstituteThe UK AI Safety Institute (renamed AI Security Institute in Feb 2025) operates with ~30 technical staff and 50M GBP annual budget, conducting frontier model evaluations using its open-source Inspe...Quality: 52/100 | Global safety standards | International coordination |
| Responsible Scaling Policy | AnthropicOrganizationAnthropicComprehensive reference page on Anthropic covering financials (\$380B valuation, \$14B ARR), safety research (Constitutional AI, mechanistic interpretability), governance (LTBT structure), controve...Quality: 59/100 | Capability thresholds | RSP documentation |
Footnotes
-
OpenAI, OpenAI o3 and o4-mini System Card / ARC-AGI results" (https://arcprize.org/blog/oai-o3-pub-breakthrough) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
Stanford HAI, "AI Index Report 2025" (https://hai.stanford.edu/ai-index/2025-ai-index-report) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10
-
Citation rc-c587 (data unavailable) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
Vectara, "Hallucination Leaderboard" (https://github.com/vectara/hallucination-leaderboard) ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
HalluLens, ACL 2025 (https://aclanthology.org/2025.acl-long.1176/) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
Anthropic, "Responsible Scaling Policy" (https://www.anthropic.com/news/anthropics-responsible-scaling-policy) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18
-
According to some sources tracking Chatbot Arena open-vs-closed model performance gaps, 2024–2025.", ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24
-
Google DeepMind, "Gemma 3 Technical Report" (2025). https://ai.google.dev/gemma ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
OpenAI. "GPT-5 System Card." May 2025. https://openai.com/index/gpt-5-system-card/ ↩ ↩2 ↩3 ↩4
-
Range drawn from Vectara Hallucination Leaderboard (≈8% for best summarization models, https://github.com/vectara/hal... — Range drawn from Vectara Hallucination Leaderboard (≈8% for best summarization models, https://github.com/vectara/hallucination-leaderboard) and HalluLens ACL 2025 (up to 45% for GPT-4o on factual queries, https://aclanthology.org/2025.acl-long.1176/). Figures are not directly comparable across methodologies. ↩ ↩2
-
Google DeepMind, "FACTS Grounding: A new benchmark for evaluating the factuality of large language models" (https://a... — Google DeepMind, "FACTS Grounding: A new benchmark for evaluating the factuality of large language models" (https://arxiv.org/abs/2501.03200) ↩
-
CheckIfExist authors, "CheckIfExist: Citation Hallucination Detection in RAG Systems" (https://arxiv.org/abs/2502.09802) ↩
-
OpenAI, "Why language models hallucinate" (https://openai.com/index/why-language-models-hallucinate/) ↩
-
The ≈17% citation hallucination rate and "1 in 6 responses" figure are attributed to AIMultiple research (https://res... — The ≈17% citation hallucination rate and "1 in 6 responses" figure are attributed to AIMultiple research (https://research.aimultiple.com/ai-hallucination/) but could not be independently verified against a primary study; these figures should be treated as approximate and reported with caution. ↩ ↩2
-
The specific claim that GPT-4 increases human agreement rates by 82% has been attributed to Anthropic research in var... — The specific claim that GPT-4 increases human agreement rates by 82% has been attributed to Anthropic research in various secondary sources, but no primary paper or report confirming this exact figure could be identified. The claim is therefore hedged here. ↩
-
Helpful to a Fault (2025). https://arxiv.org/abs/2502.09933 ↩
-
A Content-Based Framework for Cybersecurity Refusal Decisions (2025). https://arxiv.org/abs/2502.09591 ↩
-
Citation rc-95be (data unavailable) ↩
References
View claimsAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their work aims to understand and mitigate potential risks associated with increasingly capable AI systems.
A nonprofit organization focused on responsible AI development by convening technology companies, civil society, and academic institutions. PAI develops guidelines and frameworks for ethical AI deployment across various domains.