Longterm Wiki
Navigation
Updated 2026-03-08HistoryDataStatementsClaims
Page StatusContent
Edited today6.1k words52 backlinksUpdated every 3 weeksDue in 3 weeks
60QualityGood •94ImportanceEssential76.5ResearchHigh
Summary

Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.

Content9/13
LLM summaryScheduleEntityEdit history2Overview
Tables24/ ~24Diagrams1/ ~2Int. links53/ ~48Ext. links73/ ~30Footnotes1/ ~18References35/ ~18Quotes0Accuracy0RatingsN:4.5 R:6.5 A:5 C:7.5Backlinks52
Change History2
Auto-improve (standard): Large Language Models1 day ago

Improved "Large Language Models" via standard pipeline (1306.4s). Quality score: 71. Issues resolved: Frontmatter: 'llmSummary' contains an escaped dollar sign (\; Capability Progression Timeline table: 'Claude Opus 4.5' row; Benchmark Performance Comparison table: ARC-AGI-2 row shows .

1306.4s · $5-8

Surface tacticalValue in /wiki table and score 53 pages3 weeks ago

Added `tacticalValue` to `ExploreItem` interface, `getExploreItems()` mappings, the `/wiki` explore table (new sortable "Tact." column), and the card view sort dropdown. Scored 49 new pages with tactical values (4 were already scored), bringing total to 53.

sonnet-4 · ~30min

Issues2
QualityRated 60 but structure suggests 100 (underrated by 40 points)
Links19 links could use <R> components

Large Language Models

Capability

Large Language Models

Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.

First MajorGPT-2 (2019)
Key LabsOpenAI, Anthropic, Google
Related
Capabilities
Reasoning and PlanningAgentic AI
Organizations
OpenAI
6.1k words · 52 backlinks
Capability

Large Language Models

Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.

First MajorGPT-2 (2019)
Key LabsOpenAI, Anthropic, Google
Related
Capabilities
Reasoning and PlanningAgentic AI
Organizations
OpenAI
6.1k words · 52 backlinks

Quick Assessment

DimensionAssessmentEvidence
Capability LevelNear-human to superhuman on structured taskso3 achieves 87.5% on ARC-AGI (human baseline ≈85%)1; 87.7% on GPQA Diamond1
Progress Rate2–3× capability improvement per yearStanford AI Index 2025[^2]: benchmark scores rose 18–67 percentage points in one year2
Training Cost Trend2.4× annual growth3Epoch AI: frontier model training costs projected to exceed $1B by 20273
Inference Cost Trend280× reduction since 20222GPT-3.5-equivalent dropped from $10 to $1.07 per million tokens2
Hallucination Rates8–45% depending on taskVectara leaderboard: best models reportedly at ≈8%4; HalluLens: up to 45% on factual queries5
Safety MaturityModerateConstitutional AI, RLHF established; responsible scaling policies implemented by major labs6
Open-Closed GapNarrowingGap reportedly shrunk from 8.04% to 1.70% on Chatbot Arena (Jan 2024 → Feb 2025)7
SourceLink
Official Websitelearn.microsoft.com
Wikipediaen.wikipedia.org
arXivarxiv.org

Overview

Large Language Models (LLMs) are transformer-based neural networks trained on vast text corpora using next-token prediction. Despite their deceptively simple training objective, LLMs exhibit sophisticated emergent capabilities including reasoning, coding, scientific analysis, and complex task execution. These models have transformed abstract AI safety discussions into concrete, immediate concerns while providing the clearest path toward artificial general intelligence.

The core insight underlying LLMs is that training a model to predict the next word in a sequence—a task achievable without labeled data—produces internal representations useful for a wide range of downstream tasks. This approach was explored in early unsupervised pretraining work from OpenAI in 2018.1 OpenAI's GPT-2 (2019) then demonstrated coherent multi-paragraph generation at scale, showing that larger models trained on more data produced qualitatively stronger outputs.2 An earlier indicator that such models form interpretable internal structure was the discovery of an "unsupervised sentiment neuron" in 2017, which emerged without any sentiment-specific supervision.3

Current frontier models—including GPT-4o, Claude Opus 4.5, Gemini 2.5 Pro, and Llama 4—demonstrate near-human or superhuman performance across diverse cognitive domains. Training runs for leading frontier systems reportedly consume hundreds of millions of dollars,4 and model parameter counts have reached into the hundreds of billions to trillions.5 These substantial computational investments have shifted AI safety from theoretical to practical urgency. The late 2024–2025 period marked a paradigm shift toward inference-time compute scaling, with reasoning models such as OpenAI's o1 and o3 achieving higher performance on reasoning benchmarks by allocating more compute at inference rather than only at training time.6

A parallel development is the rapid growth of the open-weight ecosystem. Meta's Llama family has grown substantially since its initial release, with Meta reporting over a billion downloads and more than ten times the developer activity compared to 2023.7 Google's Gemma models—including Gemma 3 and the mobile-first Gemma 3n variants—have provided the safety research community with accessible architectures for mechanistic interpretability work.8 This open/closed model convergence has implications for both capability diffusion and the tractability of safety interventions.

Capability Architecture

The diagram below maps the flow from training through inference to observed capabilities. Capabilities are grouped by which inference regime produces them; the distinction between "standard" and "search-augmented" inference is a key axis along which safety-relevant behaviors (extended planning, autonomous task execution) emerge.

Loading diagram...

Note: Capabilities in the "Emergent Capabilities" subgraph are descriptive, not evaluative. Safety implications of each capability class are discussed in the Concerning Capabilities Assessment and Safety-Relevant Positive Capabilities sections below.

Risk Assessment

Risk CategorySeverityLikelihoodTimelineTrend
Deceptive CapabilitiesHighModerate1-3 yearsIncreasing
Persuasion & ManipulationHighHighCurrentAccelerating
Autonomous Cyber OperationsModerate-HighModerate2-4 yearsIncreasing
Scientific Research AccelerationMixedHighCurrentAccelerating
Economic DisruptionHighHigh2-5 yearsAccelerating

Capability Progression Timeline

ModelReleaseParametersKey BreakthroughPerformance Milestone
GPT-2Feb 201911.5B1Coherent text generationInitially withheld for safety concerns; full 1.5B release Nov 20191
GPT-3Jun 20202175B2Few-shot learning emergenceCreative writing, basic coding2
GPT-4Mar 20233Undisclosed3Multimodal reasoningReportedly ≈90th percentile SAT, bar exam passing3
GPT-4oMay 20244UndisclosedMultimodal speed/costReal-time audio-visual, described as 2x faster than GPT-4 Turbo4
Claude 3.5 SonnetJun 20245UndisclosedAdvanced tool use86.5% MMLU, leading SWE-bench at release5
o1Sep 20246UndisclosedChain-of-thought reasoning77.3% GPQA Diamond, 74% AIME 20246
o3Dec 20247UndisclosedInference-time search87.7% GPQA Diamond, 91.6% AIME 20247
Gemini 2.5 ProMar 20258UndisclosedLong-context reasoning1M-token context, leading coding benchmarks at release8
Llama 4Apr 20257UndisclosedNatively multimodal open-weightMixture-of-experts architecture7
GPT-5May 20259UndisclosedUnified reasoning + tool useHighest reported scores at release on GPQA Diamond and SWE-bench9
Claude Opus 4.5Reportedly Nov 2025UndisclosedExtended reasoningReportedly 80.9% SWE-bench Verified
GPT-5.2Reportedly late 2025UndisclosedDeep thinking modesReportedly 93.2% GPQA Diamond, 90.5% ARC

Primary sources: OpenAI model announcements, Anthropic model cards, Google DeepMind blog.

Benchmark Performance Comparison (2024–2025)

BenchmarkMeasuresGPT-4o (2024)o1 (2024)o3 (2024)Human Expert
GPQA DiamondPhD-level science≈50%677.3%687.7%7≈89.8%6
AIME 2024Competition math13.4%674%691.6%7Top 500 US
MMLUGeneral knowledge84.2%590.8%6≈92%789.8%6
SWE-bench VerifiedReal GitHub issues33.2%648.9%671.7%7N/A
ARC-AGINovel reasoning≈5%713.3%787.5%7≈85%7
CodeforcesCompetitive coding≈11%689% (94th percentile)699.8th percentile7N/A

Sources: OpenAI o1 system card6, ARC Prize o3 analysis7, Anthropic Claude 3.5 model card5.

The o3 results represent a qualitative shift: o3 achieved nearly human-level performance on ARC-AGI (87.5%) versus a ~85% human baseline7, a benchmark specifically designed to test general reasoning rather than pattern matching. On FrontierMath, o3 reportedly solved 25.2% of problems compared to o1's 2%—a roughly 12x improvement that suggests reasoning capabilities may be scaling faster than expected7. However, on the harder ARC-AGI-2 benchmark, o3 scores only ~3% compared to ~60% for average humans7, revealing significant limitations in truly novel reasoning tasks.

Scaling Laws and Predictable Progress

Core Scaling Relationships

Research by Kaplan et al. (2020)1 and refined by Hoffmann et al. (2022)2 demonstrates robust mathematical relationships governing LLM performance:

FactorScaling LawImplication
Model SizePerformance ∝ N^0.07610x parameters → 1.9x performance
Training DataPerformance ∝ D^0.09510x data → 2.1x performance
ComputePerformance ∝ C^0.05010x compute → 1.4x performance
Optimal RatioN ∝ D^0.47Chinchilla scaling for efficiency

Sources: Chinchilla paper2; Scaling Laws for Neural Language Models1

According to Epoch AI research,3 approximately two-thirds of LLM performance improvements over the last decade are attributable to increases in model scale, with training techniques contributing roughly 0.4 orders of magnitude per year in compute efficiency.3 The cost of training frontier models has grown by reportedly 2.4x per year since 2016, with the largest models projected to exceed $1B by 2027, according to Epoch AI cost analyses.4

A related phenomenon is the emergence of new scaling regimes beyond training compute. Research published in 2025 finds that emergent capability thresholds are sensitive to random factors including data ordering and initialization seeds,5 suggesting that the apparent sharpness of emergence in aggregate curves may partly reflect averaging over many random runs with different thresholds rather than a clean phase transition.5

The Shift to Inference-Time Scaling (2024–2025)

The o1 and o3 model families introduced a new paradigm: inference-time compute scaling. Rather than only scaling training compute, these models allocate additional computation at inference time through extended reasoning chains and search procedures.6

Scaling TypeMechanismTrade-offExample
Pre-training scalingMore parameters, data, training computeHigh upfront cost, fast inferenceGPT-4, Claude 3.5
Inference-time scalingLonger reasoning chains, searchLower training cost, expensive inferenceo1, o3
Combined scalingBoth approachesMaximum capability, maximum costGPT-5, Claude Opus 4.5

This shift has implications for AI safety: inference-time scaling allows models to "think longer" on hard problems, potentially achieving strong performance on specific tasks while maintaining manageable training costs. According to some sources, o1 is approximately 6x more expensive and 30x slower than GPT-4o per query.7 The Stanford AI Index 2025 cites the RE-Bench evaluation finding that in short time-horizon settings (2-hour budget), top AI systems score 4x higher than human experts, but as the time budget increases to 32 hours, human performance surpasses AI by roughly 2 to 1.8

Emergent Capability Thresholds

CapabilityEmergence ScaleEvidenceSafety Relevance
Few-shot learning≈100B parametersGPT-3 breakthroughTool use foundation
Chain-of-thought≈10B parametersPaLM, GPT-3 variantsComplex reasoning
Code generation≈1B parametersCodex, GitHub CopilotCyber capabilities
Instruction following≈10B parametersInstructGPTHuman-AI interaction paradigm
PhD-level reasoningo1+ scaleGPQA Diamond performanceExpert-level autonomy
Strategic planningo3 scaleARC-AGI performanceDeception potential

Research documented in a 2025 emergent abilities survey7 finds that emergent abilities depend on multiple interacting factors: scaling up parameters or depth lowers the threshold for emergence but is neither necessary nor sufficient alone—data quality, diversity, training objectives, and architecture modifications also matter significantly.7 Emergence aligns more closely with pre-training loss landmarks than with sheer parameter count; smaller models can match larger ones if training loss is sufficiently reduced.7

According to the Stanford AI Index 2025,8 benchmark performance improved substantially in a single year: scores rose by 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively.8 The gap between leading US and Chinese models also narrowed—from 17.5 to 0.3 percentage points on MMLU—over the same period.8

Safety implication: As AI systems gain autonomous reasoning capabilities, they also develop behaviors relevant to safety evaluation, including goal persistence, strategic planning, and the capacity for deceptive alignment. OpenAI's o3-mini reportedly became the first AI model to receive a "Medium risk" classification for Model Autonomy under internal capability evaluation frameworks.9

RLHF and Alignment Training Foundations

A key technique underlying modern aligned LLMs is Reinforcement Learning from Human Feedback (RLHF), which trains a reward model on human preference comparisons and uses it to fine-tune Large Language Models outputs via RL. The foundational application of this approach to Large Language Models was demonstrated in Fine-Tuning GPT-2 from Human Preferences (Ziegler et al., 2019), which showed that human feedback could steer model behavior on stylistic tasks. This was later scaled to instruction-following via InstructGPT and then to Claude's Constitutional AI framework.

RLHF ComponentFunctionSafety Relevance
Reward modelConverts human preferences into a differentiable signalCentral to RLHF alignment
PPO fine-tuningUpdates language model to maximize rewardCan introduce reward hacking
Constitutional AIReplaces human labelers with model self-critique against principlesScales alignment oversight; see Constitutional AI
Multilingual consistencyEnforces safety behaviors across languagesAlign Once, Benefit Multilingually (2025)

A significant limitation is that RLHF-trained models can exhibit sycophancy—systematically agreeing with user beliefs rather than providing accurate responses—because human raters often prefer confident, agreeable answers. Recent work on multi-objective alignment for psychotherapy contexts and policy-constrained alignment frameworks like PolicyPad (2025) explore how to balance multiple competing alignment objectives.

Multilingual alignment gap: A consistent finding across alignment research is that safety training applied primarily in English does not reliably transfer to other languages. Align Once, Benefit Multilingually (2025) proposes methods to enforce multilingual consistency in safety alignment by training on cross-lingual consistency losses. Related work on Helpful to a Fault (2025) measures illicit assistance rates across 40+ languages in multi-turn interactions, finding that multilingual agents provide more potentially harmful assistance than English-only evaluations would suggest, particularly in low-resource languages where safety fine-tuning data is sparse.

Major 2025 Model Releases

GPT-5 and the GPT-5 Family

GPT-5, released in May 2025, represents OpenAI's integration of reasoning and general capability into a single unified model, replacing the separate GPT-4o and o1 product lines. According to the GPT-5 System Card, GPT-5 achieves the highest scores to date on GPQA Diamond, SWE-bench Verified, and MMLU among OpenAI models at the time of release.

Key developments in the GPT-5 family:

  • GPT-5.1: Released for developers in mid-2025, optimized for conversational applications and described as "smarter, more conversational" with improved instruction-following. Cursor and Tolan reported substantial gains in their agentic pipelines.
  • gpt-oss-120b and gpt-oss-20b: Open-weight models released by OpenAI, with model cards published for both sizes. These represent a shift toward open-weight strategy alongside closed frontier models.
  • GPT Realtime API: Extended with real-time audio dialog capabilities, building on the GPT-4o voice capabilities introduced in Hello GPT-4o (2024).

The GPT-5 System Card documents safety evaluations including assessments of CBRN uplift risk, cyberoffense capabilities, and persuasion. An Addendum on Sensitive Conversations addresses handling of mental health, self-harm, and politically contentious topics, noting both improvements in refusal precision and remaining cases where the model provides responses that require strengthening.

Gemini 2.5 Family

Google DeepMind's Gemini 2.5 family, released across March–June 2025, introduced several models:

ModelKey FeatureContext WindowPrimary Use Case
Gemini 2.5 ProHighest capability, coding-focused1M tokensComplex reasoning, coding
Gemini 2.5 FlashSpeed-optimized frontier1M tokensScaled production use
Gemini 2.5 Flash-LiteCost/latency optimized1M tokensHigh-volume inference

Gemini 2.5: Our Most Intelligent AI Model describes the Pro variant as achieving leading performance on coding benchmarks including LiveCodeBench and outperforming GPT-4o on MMLU at release. Gemini 2.5 Flash-Lite was made production-ready in June 2025 with a focus on throughput-sensitive applications.

The 2.5 family also introduced native thinking models—models that produce explicit chain-of-thought reasoning before answering—across both Pro and Flash tiers. Advanced audio dialog and generation with Gemini 2.5 extended audio-native generation capabilities.

Gemma 3 and Open-Weight Models

Google's Gemma 3 family, released in 2025, provides open-weight models ranging from 1B to 27B parameters optimized for single-accelerator deployment. The Gemma 3 270M variant targets edge and mobile deployment. Gemma 3n introduced a mobile-first architecture with selective parameter activation for on-device inference.

MedGemma, released in mid-2025, provides open health-specific models demonstrating LLM application in clinical reasoning. A Gemma-based model contributed to discovery of a potential cancer therapy pathway, illustrating scientific research acceleration potential.

T5Gemma introduced encoder-decoder variants of the Gemma architecture, enabling use cases where separate encoding and decoding is beneficial (e.g., retrieval-augmented generation, classification tasks).

Llama 4 and the Open-Weight Ecosystem

Meta's Llama 4 Herd, announced at LlamaCon in April 2025, represents a shift to natively multimodal architecture using a mixture-of-experts design. Llama 4 Scout and Llama 4 Maverick support image, video, and text inputs from the base model level. Meta reported over 10x growth in Llama usage since 2023, with the model family becoming a reference implementation for open-weight AI development.

Key safety implications of the open-weight ecosystem:

  • Fine-tuning safety guardrails out of open-weight models remains tractable for technically sophisticated users
  • Mechanistic interpretability research benefits from open weights (e.g., Gemma Scope 2, described below)
  • Governance frameworks targeting API access do not apply to locally deployed open-weight models

Mechanistic Interpretability Advances

Mechanistic interpretability research—which seeks to understand the internal computations of neural networks in human-interpretable terms—has accelerated substantially with the availability of open-weight models and new tooling.

Gemma Scope 2

Gemma Scope 2 is a suite of sparse autoencoders (SAEs) trained on Gemma 3 models, released by Google DeepMind to support interpretability research on complex language model behaviors. Building on the original Gemma Scope release for Gemma 2, Gemma Scope 2 provides SAEs at multiple layers and widths, enabling decomposition of model activations into human-interpretable features.

Gemma Scope 2 supports research into:

  • Feature geometry and polysemanticity in larger models
  • Cross-layer feature interactions and information flow
  • Identification of features relevant to safety-relevant behaviors (deception, refusal, sycophancy)

Language Models Explaining Neurons

Language Models Can Explain Neurons in Language Models (Bills et al., 2023) demonstrated that GPT-4 can generate natural language explanations of GPT-2 neurons with higher validity than human-written explanations. This opened a scalable pathway for automated interpretability: using more capable models to explain less capable ones. Subsequent work has extended this to sparse autoencoder features and cross-model explanation transfer.

Activation Steering and Causal Intervention

Activation steering—injecting vectors into model residual streams to steer behavior—has become a primary tool for behavioral intervention research. Recent work has refined understanding of when and why steering succeeds:

Research Platform Value of Open-Weight Models

Research ApplicationOpen-Weight BenefitExample
Mechanistic interpretabilityFull activation accessGemma Scope 2 on Gemma 3
SAE trainingWeight access for feature analysisGemma Scope, TranscoderLens
Activation steeringResidual stream interventionMultiple labs using Llama
Fine-tuning safetyRapid iterationConstitutional AI variants
Neuron explanationCross-model explanation transferBills et al. 2023

Hallucination and Factuality

LLM hallucination—the generation of plausible-sounding but factually incorrect or unsupported content—remains a central reliability and safety challenge. Hallucination rates vary substantially by task, model, and measurement methodology, with published estimates ranging from roughly 8% to 45% depending on benchmark design and model version.10

Factuality Benchmarking

BenchmarkScopeKey FindingLink
FACTS GroundingLong-form factuality against source documentsMeasures supported vs. unsupported claimsFACTS Grounding
FACTS Benchmark SuiteSystematic factuality across task typesDecomposes factuality failures by error typeFACTS Suite
Vectara Hallucination LeaderboardSummarization hallucinationBest models: ≈8% hallucination rateVectara
HalluLensFactual query hallucinationUp to 45% on factual queries for GPT-4oHalluLens (ACL 2025)
CheckIfExistCitation hallucination in AI-generated contentDetects fabricated citations in RAG systemsCheckIfExist (2025)

The FACTS Grounding benchmark11 introduced by Google DeepMind specifically addresses the challenge of evaluating long-form generation against reference documents, distinguishing between claims grounded in provided source material and claims introduced without basis. This is particularly relevant for retrieval-augmented generation (RAG) systems, where the model has access to retrieved context but may still generate unsupported claims.

CheckIfExist (2025)12 addresses citation hallucination specifically—the generation of plausible-looking but non-existent citations, which poses particular risks in legal, medical, and academic contexts. The benchmark finds that citation hallucination rates remain substantial even for frontier models, and that RAG systems can still hallucinate citations from retrieved documents by misattributing or confabulating specific reference details.

Why Models Hallucinate

OpenAI's explainer on why language models hallucinate13 identifies several contributing mechanisms:

  1. Training objective mismatch: Next-token prediction rewards coherent text, not factual accuracy
  2. Knowledge compression: Models must compress world knowledge into fixed-weight representations, leading to lossy encoding
  3. Context-weight tension: Models may blend retrieved context with parametric knowledge, producing hybrid outputs that are faithful to neither
  4. Sycophancy pressure: RLHF can train models to confirm user beliefs rather than correct factual errors, since raters may prefer agreeable responses
  5. Calibration failure: Models often express high confidence in incorrect claims, reducing the signal value of expressed uncertainty

Hallucination rates are not monotonically improved by scale: models engineered for massive context windows do not consistently achieve lower hallucination rates than smaller counterparts, and increased model size can increase the confidence of hallucinated outputs without reducing their frequency. This suggests hallucination is partly an architectural and training-objective issue rather than purely a capacity limitation.

Deception and Truthfulness

Behavior TypeFrequencyContextMitigation
Hallucination8–45%10Varies by task and modelTraining improvements, RAG
Citation hallucinationreportedly ≈17%14Legal, academic domainCheckIfExist detection systems
Role-play deceptionHighPrompted scenariosSafety fine-tuning
SycophancyModerateOpinion questionsConstitutional AI, RLHF adjustment
Strategic deceptionLow-ModerateEvaluation scenariosOngoing research

According to some sources, even recent frontier models have hallucination rates exceeding 15% when asked to analyze provided statements, and approximately 1 in 6 AI responses in legal contexts reportedly contain citation hallucinations.14 The wide variance across benchmarks reflects both genuine model differences and definitional variation in what constitutes a hallucination.

Safety-Relevant Positive Capabilities

Interpretability Research Platform

Research AreaProgress LevelKey FindingsOrganizations
Attention visualizationAdvancedKnowledge storage patternsAnthropic, OpenAI
Activation patchingModerateCausal intervention methodsRedwood Research
Sparse autoencodersAdvancingFeature decomposition in large modelsAnthropic, Google DeepMind (Gemma Scope)
Neuron explanationModerateLM-explained neurons via GPT-4Bills et al. 2023
Mechanistic understandingEarly-ModerateTransformer circuitsAnthropic Interpretability

Constitutional AI and Value Learning

Anthropic's Constitutional AI demonstrates approaches to value alignment through self-critique, principle-following, and preference learning. The specific success rates for these techniques vary considerably across evaluations, tasks, and model versions; the figures below represent approximate reported ranges rather than settled benchmarks, and should be treated with caution.

TechniqueReported RangeApplicationLimitations
Self-critique≈70–85% (approximate)Harmful content reductionRequires good initial training
Principle following≈60–80% (approximate)Consistent value applicationVulnerable to gaming
Preference learning≈65–75% (approximate)Human value approximationDistributional robustness

Scalable Oversight Applications

Modern LLMs enable several approaches to AI safety through automated oversight:

  • Output evaluation: AI systems critiquing other AI outputs. Agreement rates with human evaluators vary substantially by task and domain; no single well-characterized aggregate figure covers the literature.
  • Red-teaming: Automated discovery of failure modes and adversarial inputs.
  • Safety monitoring: Real-time analysis of AI system behavior patterns.
  • Research acceleration: AI-assisted safety research and experimental design.
  • Content moderation: OpenAI has published work on using GPT-4 for content moderation, describing how LLM-based moderation can operate at scale and reduce human labeler exposure to harmful content.

Concerning Capabilities Assessment

Persuasion and Manipulation

Modern LLMs demonstrate persuasion capabilities that raise questions for democratic discourse and individual autonomy:

CapabilityCurrent StateEvidenceRisk Level
Audience adaptationAdvancedAnthropic persuasion researchHigh
Persona consistencyAdvancedExtended roleplay studiesHigh
Emotional manipulationModerateRLHF alignment researchModerate
Debate performanceAdvancedHuman preference studiesHigh

According to some sources, frontier LLMs can substantially increase human agreement rates through targeted persuasion techniques, raising concerns about consensus manufacturing; however, the specific figure of 82% attributed to Anthropic research has not been independently verified against a primary source and should be treated with caution.15 The ARC Evaluations (2025) specifically evaluates persuasion and resistance capabilities in LLMs through adversarial resource extraction games, finding that frontier models can maintain persuasive pressure across extended multi-turn interactions.16

Multilingual Safety Gaps

A consistent finding across the research literature is that safety alignment applied primarily in English does not reliably transfer to other languages. Helpful to a Fault (2025) measured illicit assistance rates for multi-turn, multilingual LLM agents across 40+ languages, finding that:17

  • Models provide meaningfully more potentially harmful assistance in non-English languages
  • The gap is largest for low-resource languages with sparse alignment training data
  • Multi-turn interactions elicit more harmful assistance than single-turn, with each turn potentially eroding earlier refusals

This suggests that multilingual deployment of LLMs creates safety gaps that single-language evaluation would miss, a concern reportedly noted in the GPT-5 System Card.9

Cybersecurity Capabilities and Refusal Frameworks

LLMs' code generation and vulnerability analysis capabilities create dual-use risks in cybersecurity:

  • Models can reportedly assist with vulnerability identification, exploit development, and attack planning at varying levels depending on the specificity of the request
  • A Content-Based Framework for Cybersecurity Refusal Decisions (2025) proposes taxonomizing cybersecurity requests by content type (reconnaissance vs. exploitation vs. malware development) rather than binary harmful/safe classification, enabling more precise refusal decisions that maintain legitimate educational and defensive use18
  • SecCodeBench-V2 (2025) provides updated benchmarks for evaluating security-relevant code generation, reportedly finding continued improvement in both benign and potentially harmful code generation capabilities19

Autonomous Capabilities

Current LLMs demonstrate increasing levels of autonomous task execution:

  • Web browsing: GPT-4 class models can reportedly navigate websites, extract information, and interact with web services
  • Code execution: Models can write, debug, and iteratively improve software across extended sessions
  • API integration: Tool use across multiple digital platforms; OpenAI's Shipping Smarter Agents with Every New Model describes the company's agent capability roadmap20
  • Goal persistence: Basic ability to maintain objectives across extended multi-turn interactions, relevant to scheming risk evaluation
  • Memory across sessions: MemoryArena (2025) benchmarks agent memory in interdependent multi-session tasks, finding that frontier models maintain task state imperfectly across sessions but with increasing reliability in newer model versions21

Fundamental Limitations

What Doesn't Scale Automatically

PropertyScaling BehaviorEvidenceImplications
TruthfulnessNo direct improvementLarger models can be more confidently incorrectRequires targeted training
ReliabilityInconsistentHigh variance across similar promptsSystematic evaluation needed
Novel reasoningLimited progressPattern matching vs. genuine insightMay require architectural changes
Value alignmentNo guaranteeCapability-alignment divergenceAlignment difficulty
Multilingual safetyUneven transferAlign Once, Benefit MultilinguallyRequires cross-lingual training

Current Performance Gaps

Despite strong benchmark performance, significant limitations remain:

  • Hallucination rates: According to some sources, hallucination rates range from roughly 8–45% depending on task and model, with high variance across domains1
  • Inconsistency: High variance in responses to equivalent prompts; Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance (2025)2 documents time-dependent performance variation that is difficult to explain and control for in evaluations
  • Context limitations: Models struggle with very long-horizon reasoning despite large context windows (reportedly exceeding 1 million tokens in some deployments); long-context retrieval accuracy degrades for content positioned in the middle of the context window
  • Novel problem solving: According to reporting at the time of its release, o3 achieved approximately 87.5% on ARC-AGI; however, this reportedly required high-compute settings, and real-world novel reasoning remains challenging3. On the harder ARC-AGI-2 benchmark, o3 reportedly scored around 3% while average humans scored approximately 60%4
  • Benchmark vs. real-world gap: Research using the QUAKE benchmark reportedly found that frontier LLMs average just 28% pass rate on practical tasks, despite high scores on standard benchmarks5
  • Long-tail knowledge: Long-Tail Knowledge in Large Language Models (2025)6 documents that rare facts and edge cases are disproportionately hallucinated, with performance degrading sharply below certain frequency thresholds in training data
  • False belief reasoning: Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs (2025)7 finds that models rely heavily on surface language statistics rather than genuine theory-of-mind reasoning when solving false belief tasks, raising questions about claimed social cognition capabilities

Note that models engineered for massive context windows do not consistently achieve lower hallucination rates than smaller counterparts, suggesting performance depends more on architecture and training quality than capacity alone.

Current State and 2025-2030 Trajectory

Key 2024-2025 Developments

DevelopmentStatusImpactSafety Relevance
Reasoning models (o1, o3)DeployedPhD-level reasoning achievedExtended planning capabilities
Inference-time scalingEstablishedNew scaling paradigmPotentially harder to predict capabilities
Agentic AI frameworksGrowingAutonomous task completionAutonomous systems concerns
1M+ token contextStandardLong-document reasoningExtended goal persistence
Multi-model routingEmergingTask-optimized deploymentComplexity in governance
Open-weight frontier modelsEstablishedLlama 4, Gemma 3, gpt-ossReduced API access as governance lever
Mechanistic interpretability toolingAdvancingGemma Scope 2, SAE ecosystemImproved interpretability tractability
Multilingual alignment researchEarlyAlign Once, Benefit MultilinguallySafety gaps in non-English languages

One of the most significant trends is the emergence of agentic AI—LLM-powered systems that can make decisions, interact with tools, and take actions without constant human input. This represents a qualitative shift from chat interfaces to autonomous systems capable of extended task execution. OpenAI's Spring Update and subsequent Shipping Smarter Agents posts describe investment in memory, tool use, and multi-model orchestration as the primary near-term product directions.

Near-term Outlook (2025-2026)

DevelopmentLikelihoodTimelineImpact
GPT-5.x and successor modelsHigh6-12 monthsFurther capability improvement
Improved reasoning (o3 successors)High3-6 monthsEnhanced scientific research
Multimodal integrationHigh6-12 monthsVideo, audio, sensor fusion
Robust agent frameworksHigh12-18 monthsAutonomous systems
Interpretability-informed alignmentModerate12-24 monthsFeatures as Rewards approach

Medium-term Outlook (2026-2030)

Expected developments include potential architectural innovations beyond standard transformer attention, deeper integration with robotics platforms, and continued capability improvements. Key uncertainties include whether current scaling approaches will continue yielding improvements and the timeline for artificial general intelligence.

Data constraints: According to Epoch AI projections, high-quality training data could become a significant bottleneck this decade, particularly if models continue to be overtrained. For AI progress to continue into the 2030s, either new sources of data (synthetic data, multimodal data) or less data-hungry techniques must be developed. Can Generative AI Survive Data Contamination? (2025) analyzes the theoretical guarantees available under contaminated recursive training—where models are trained on AI-generated data—finding that certain forms of contamination lead to systematic performance degradation rather than catastrophic collapse, but that the long-run effects remain analytically uncertain.

Key Uncertainties and Research Cruxes

Fundamental Understanding Questions

  • Intelligence vs. mimicry: Extent of genuine understanding vs. sophisticated pattern matching, with False Belief Reasoning research (2025) suggesting current models rely heavily on statistical correlates of reasoning rather than causal models
  • Emergence predictability: Whether capability emergence can be reliably forecasted; Random Scaling of Emergent Capabilities (2025) suggests randomness in training may cause more gradual aggregate emergence than apparent
  • Architectural limits: Whether transformers can scale to AGI or require fundamental innovations
  • Alignment scalability: Whether current safety techniques work for superhuman systems

Safety Research Priorities

Priority AreaImportanceTractabilityNeglectedness
InterpretabilityHighModerateModerate
Alignment techniquesHighestLowLow
Capability evaluationHighHighModerate
Governance frameworksHighModerateHigh
Multilingual alignmentHighModerateHigh
Factuality and hallucinationModerateHighLow

Timeline Uncertainties

Current expert surveys show wide disagreement on AGI timelines, with median estimates ranging from 2027 to 2045. This uncertainty stems from:

  • Unpredictable capability emergence patterns
  • Unknown scaling law continuation
  • Potential architectural innovations
  • Economic and resource constraints
  • Data availability bottlenecks

The o3 results on ARC-AGI (87.5%, approaching human baseline of ~85%) have intensified debate about whether we are approaching AGI sooner than expected. Critics note that high-compute inference settings make this performance expensive and slow, and that benchmark performance may not translate to general real-world capability. The ARC-AGI-2 results (o3 at 3%, average humans at 60%) illustrate that performance on designed-to-be-robust benchmarks still reveals substantial gaps.

Sources & Resources

Academic Research

PaperAuthorsYearKey Contribution
Scaling LawsKaplan et al.2020Mathematical scaling relationships
ChinchillaHoffmann et al.2022Optimal parameter-data ratios
Constitutional AIBai et al.2022Value-based training methods
Emergent AbilitiesWei et al.2022Capability emergence documentation
Fine-Tuning GPT-2 from Human PreferencesZiegler et al.2019RLHF foundations for language models
Language Models are Few-Shot LearnersBrown et al.2020GPT-3 and few-shot learning
Emergent Abilities SurveyVarious2025Comprehensive emergence review
Scaling Laws for PrecisionKumar et al.2024Low-precision scaling extensions
HalluLens BenchmarkVarious2025Hallucination measurement framework
FACTS GroundingGoogle DeepMind2025Long-form factuality benchmark
Align Once, Benefit MultilinguallyVarious2025Multilingual safety alignment
Random Scaling of Emergent CapabilitiesVarious2025Randomness and emergence thresholds
Causality is Key for InterpretabilityVarious2025Causal evaluation of interpretability claims

Model and System Cards

DocumentOrganizationYearRelevance
GPT-5 System CardOpenAI2025Safety evaluations, capability assessments
GPT-5 System Card Addendum: Sensitive ConversationsOpenAI2025Mental health, self-harm, contentious topics
gpt-oss-120b & gpt-oss-20b Model CardOpenAI2025Open-weight model documentation
GPT-4V(ision) System CardOpenAI2023Multimodal safety evaluation
GPT-2: 1.5B ReleaseOpenAI2019Staged release and safety rationale

Organizations and Research Groups

TypeOrganizationFocus AreaKey Resources
IndustryOpenAIGPT series, safety researchTechnical papers, safety docs
IndustryAnthropicConstitutional AI, interpretabilityClaude research, safety papers
IndustryGoogle DeepMindGemini, Gemma, interpretabilityGemma Scope 2, FACTS benchmarks
AcademicCHAIAI alignment researchTechnical alignment papers
SafetyRedwood ResearchInterpretability, oversightMechanistic interpretability
ResearchMETRCapability evaluationAutonomous replication evaluations

Policy and Governance Resources

ResourceOrganizationFocusLink
AI Safety GuidelinesNISTFederal standardsRisk management framework
Responsible AI PracticesPartnership on AIIndustry coordinationBest practices documentation
International CooperationUK AI Safety InstituteGlobal safety standardsInternational coordination
Responsible Scaling PolicyAnthropicCapability thresholdsRSP documentation

Footnotes

  1. OpenAI, OpenAI o3 and o4-mini System Card / ARC-AGI results" (https://arcprize.org/blog/oai-o3-pub-breakthrough) 2 3 4 5 6 7 8 9

  2. Stanford HAI, "AI Index Report 2025" (https://hai.stanford.edu/ai-index/2025-ai-index-report) 2 3 4 5 6 7 8 9 10

  3. Citation rc-c587 (data unavailable) 2 3 4 5 6 7 8 9

  4. Vectara, "Hallucination Leaderboard" (https://github.com/vectara/hallucination-leaderboard) 2 3 4 5 6

  5. HalluLens, ACL 2025 (https://aclanthology.org/2025.acl-long.1176/) 2 3 4 5 6 7 8 9

  6. Anthropic, "Responsible Scaling Policy" (https://www.anthropic.com/news/anthropics-responsible-scaling-policy) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

  7. According to some sources tracking Chatbot Arena open-vs-closed model performance gaps, 2024–2025.", 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

  8. Google DeepMind, "Gemma 3 Technical Report" (2025). https://ai.google.dev/gemma 2 3 4 5 6 7

  9. OpenAI. "GPT-5 System Card." May 2025. https://openai.com/index/gpt-5-system-card/ 2 3 4

  10. Range drawn from Vectara Hallucination Leaderboard (≈8% for best summarization models, https://github.com/vectara/hal... — Range drawn from Vectara Hallucination Leaderboard (≈8% for best summarization models, https://github.com/vectara/hallucination-leaderboard) and HalluLens ACL 2025 (up to 45% for GPT-4o on factual queries, https://aclanthology.org/2025.acl-long.1176/). Figures are not directly comparable across methodologies. 2

  11. Google DeepMind, "FACTS Grounding: A new benchmark for evaluating the factuality of large language models" (https://a... — Google DeepMind, "FACTS Grounding: A new benchmark for evaluating the factuality of large language models" (https://arxiv.org/abs/2501.03200)

  12. CheckIfExist authors, "CheckIfExist: Citation Hallucination Detection in RAG Systems" (https://arxiv.org/abs/2502.09802)

  13. OpenAI, "Why language models hallucinate" (https://openai.com/index/why-language-models-hallucinate/)

  14. The ≈17% citation hallucination rate and "1 in 6 responses" figure are attributed to AIMultiple research (https://res... — The ≈17% citation hallucination rate and "1 in 6 responses" figure are attributed to AIMultiple research (https://research.aimultiple.com/ai-hallucination/) but could not be independently verified against a primary study; these figures should be treated as approximate and reported with caution. 2

  15. The specific claim that GPT-4 increases human agreement rates by 82% has been attributed to Anthropic research in var... — The specific claim that GPT-4 increases human agreement rates by 82% has been attributed to Anthropic research in various secondary sources, but no primary paper or report confirming this exact figure could be identified. The claim is therefore hedged here.

  16. AREG Benchmark (2025). https://arxiv.org/abs/2502.09455

  17. Helpful to a Fault (2025). https://arxiv.org/abs/2502.09933

  18. A Content-Based Framework for Cybersecurity Refusal Decisions (2025). https://arxiv.org/abs/2502.09591

  19. SecCodeBench-V2 (2025). https://arxiv.org/abs/2502.09474

  20. Citation rc-95be (data unavailable)

  21. MemoryArena (2025). https://arxiv.org/abs/2502.10392

References

View claims
1Kaplan et al. (2020)arXiv·Jared Kaplan et al.·2020·Paper
★★★☆☆
2Hoffmann et al. (2022)arXiv·Jordan Hoffmann et al.·2022·Paper
★★★☆☆
3AnthropicAnthropic
★★★★☆
4OpenAIOpenAI
★★★★☆

Anthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their work aims to understand and mitigate potential risks associated with increasingly capable AI systems.

★★★★☆
6Constitutional AI: Harmlessness from AI FeedbackarXiv·Bai, Yuntao et al.·2022·Paper
★★★☆☆
7Emergent AbilitiesarXiv·Jason Wei et al.·2022·Paper
★★★☆☆
8OpenAI: Model BehaviorOpenAI·Paper
★★★★☆
★★★★★
10Partnership on AIpartnershiponai.org

A nonprofit organization focused on responsible AI development by convening technology companies, civil society, and academic institutions. PAI develops guidelines and frameworks for ethical AI deployment across various domains.

11DeepMindGoogle DeepMind
★★★★☆
16OpenAI.openai.com
31GPT-4oOpenAI
★★★★☆
32LlamaFirewallMeta AI
★★★★☆
34Brown et al. (2020)arXiv·Tom B. Brown et al.·2020·Paper
★★★☆☆

Related Pages

Top Related Pages

Organizations

AnthropicRedwood Research

Concepts

Scientific Research CapabilitiesLarge Language ModelsAGI Timeline

Risks

Emergent Capabilities

Approaches

AI AlignmentAI Governance Coordination Technologies

Key Debates

AI Safety Solution CruxesAI Alignment Research Agendas

Policy

EU AI ActAI Standards Development

Safety Research

Scalable Oversight

Other

Philip Tetlock (Forecasting Pioneer)Robin Hanson

Models

AI Uplift Assessment ModelDeceptive Alignment Decomposition Model

Analysis

Wikipedia ViewsSquiggleAI

Historical

Deep Learning Revolution Era