Longterm Wiki
Navigation
Updated 2026-03-13HistoryData
Page StatusContent
Edited today7.4k words52 backlinksUpdated every 3 weeksDue in 3 weeks
60QualityGood •94ImportanceEssential76.5ResearchHigh
Summary

Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.

Content8/13
LLM summaryScheduleEntityEdit history6Overview
Tables17/ ~29Diagrams1/ ~3Int. links60/ ~59Ext. links67/ ~37Footnotes1/ ~22References36/ ~22Quotes0Accuracy0RatingsN:4.5 R:6.5 A:5 C:7.5Backlinks52
Change History6
Auto-improve (standard): Large Language Modelstoday

Improved "Large Language Models" via standard pipeline (489.5s). Quality score: 71. Issues resolved: Page is truncated — the section 'Cybersecurity Capabilities ; Frontmatter 'lastEdited' is '2026-03-13', which is a future ; Footnote references such as [^rc-ed93], [^rc-e8f4], [^rc-c58.

489.5s · $5-8

Auto-improve (standard): Large Language Models2 days ago

Improved "Large Language Models" via standard pipeline (515.5s). Quality score: 72. Issues resolved: Frontmatter 'lastEdited' is set to '2026-03-11', which is a ; Capability Progression Timeline table contains a row for 'Cl; The 'FACTS Benchmark Suite' row in the Hallucination benchma.

515.5s · $5-8

Auto-improve (standard): Large Language Models3 days ago

Improved "Large Language Models" via standard pipeline (516.9s). Quality score: 71. Issues resolved: Frontmatter 'lastEdited' date is '2026-03-10', which is a fu; Capability Progression Timeline: '<EntityLink id="E1030">Cla; Benchmark Performance Comparison table: TruthfulQA row has '.

516.9s · $5-8

Auto-improve (standard): Large Language Models4 days ago

Improved "Large Language Models" via standard pipeline (510.4s). Quality score: 71. Issues resolved: Page is truncated mid-sentence at the end: '...achieving com; Frontmatter field 'lastEdited' is dated '2026-03-09' which i; Capability Progression Timeline table contains a duplicate E.

510.4s · $5-8

Auto-improve (standard): Large Language Models6 days ago

Improved "Large Language Models" via standard pipeline (1306.4s). Quality score: 71. Issues resolved: Frontmatter: 'llmSummary' contains an escaped dollar sign (\; Capability Progression Timeline table: 'Claude Opus 4.5' row; Benchmark Performance Comparison table: ARC-AGI-2 row shows .

1306.4s · $5-8

Surface tacticalValue in /wiki table and score 53 pages3 weeks ago

Added `tacticalValue` to `ExploreItem` interface, `getExploreItems()` mappings, the `/wiki` explore table (new sortable "Tact." column), and the card view sort dropdown. Scored 49 new pages with tactical values (4 were already scored), bringing total to 53.

sonnet-4 · ~30min

Issues2
QualityRated 60 but structure suggests 100 (underrated by 40 points)
Links17 links could use <R> components

Large Language Models

Capability

Large Language Models

Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.

First MajorGPT-2 (2019)
Key LabsOpenAI, Anthropic, Google
Related
Capabilities
Reasoning and PlanningAgentic AI
Organizations
OpenAI
7.4k words · 52 backlinks
Capability

Large Language Models

Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.

First MajorGPT-2 (2019)
Key LabsOpenAI, Anthropic, Google
Related
Capabilities
Reasoning and PlanningAgentic AI
Organizations
OpenAI
7.4k words · 52 backlinks

Quick Assessment

DimensionAssessmentEvidence
Capability LevelNear-human to superhuman on structured taskso3-mini achieves 87.5% on ARC-AGI (human baseline ≈85%); 87.7% on GPQA Diamond
Progress Rate2–3× capability improvement per yearStanford AI Index 2025[^2]: benchmark scores rose 18–67 percentage points in one year
Training Cost Trend2.4× annual growthEpoch AI: frontier model training costs projected to exceed $1B by 2027
Inference Cost Trend280× reduction since 2022GPT-4-equivalent dropped from $10 to $1.07 per million tokens
Hallucination Rates8–45% depending on taskVectara leaderboard: best models reportedly at ≈8%; HalluLens: up to 45% on factual queries
Safety MaturityModerateConstitutional AI, RLHF established; Responsible Scaling Policies implemented by major labs
Open-Closed GapNarrowingGap reportedly shrunk from 8.04% to 1.70% on Chatbot Arena (Jan 2024 → Feb 2025)
SourceLink
Official Websitelearn.microsoft.com
Wikipediaen.wikipedia.org
arXivarxiv.org

Overview

Large Language Models (LLMs) are transformer-based neural networks trained on vast text corpora using next-token prediction. Despite their deceptively simple training objective, LLMs exhibit sophisticated emergent capabilities including reasoning, coding, scientific analysis, and complex task execution. These models have transformed abstract AI safety discussions into concrete, immediate concerns while providing the clearest path toward artificial general intelligence.

The core insight underlying LLMs is that training a model to predict the next word in a sequence—a task achievable without labeled data—produces internal representations useful for a wide range of downstream tasks. This approach was explored in early unsupervised pretraining work from OpenAI in 2018. OpenAI's GPT-2 (2019) then demonstrated coherent multi-paragraph generation at scale, showing that larger models trained on more data produced qualitatively stronger outputs. An earlier indicator that such models form interpretable internal structure was the discovery of an "unsupervised sentiment neuron" in 2017, which emerged without any sentiment-specific supervision.

A foundational development alongside raw scale was learning to align model outputs to human preferences. The GPT-2 paper itself acknowledged concerns about potential misuse of fluent generation, leading OpenAI to initially withhold the full 1.5B-parameter model and conduct a staged release to study harms before broader deployment. Subsequent work on instruction-following alignment—culminating in InstructGPT—demonstrated that fine-tuning with human feedback could substantially improve model usefulness and reduce harmful outputs without proportional increases in model size. The complementary technique of RLHF applied to summarization tasks, and later to instruction-following, established the training paradigm that underlies most current aligned frontier models.

Current frontier models—including GPT-4o, Claude Opus 4.5, Gemini 2.5 Pro, and Llama 4—demonstrate near-human or superhuman performance across diverse cognitive domains. Training runs for leading frontier systems reportedly consume hundreds of millions of dollars, and model parameter counts have reached into the hundreds of billions to trillions. These substantial computational investments have shifted AI safety from theoretical to practical urgency. The late 2024–2025 period marked a paradigm shift toward inference-time compute scaling, with reasoning models such as OpenAI's o1 and o3 achieving higher performance on reasoning benchmarks by allocating more compute at inference rather than only at training time.

A parallel development is the rapid growth of the open-weight ecosystem. Meta's Llama family has grown substantially since its initial release, with Meta reporting over a billion downloads and more than ten times the developer activity compared to 2023. Google's Gemma models—including Gemma 3 and the mobile-first Gemma 3n variants—have provided the safety research community with accessible architectures for mechanistic interpretability work. This open/closed model convergence has implications for both capability diffusion and the tractability of safety interventions.

Capability Architecture

The diagram below maps the flow from training through inference to observed capabilities. Capabilities are grouped by which inference regime produces them; the distinction between "standard" and "search-augmented" inference is a key axis along which safety-relevant behaviors (extended planning, autonomous task execution) emerge.

Loading diagram...

Note: Capabilities in the "Emergent Capabilities" subgraph are descriptive, not evaluative. Safety implications of each capability class are discussed in the Concerning Capabilities Assessment and Safety-Relevant Positive Capabilities sections below.

Risk Assessment

Risk CategorySeverityLikelihoodTimelineTrend
Deceptive CapabilitiesHighModerate1-3 yearsIncreasing
Persuasion & ManipulationHighHighCurrentAccelerating
Autonomous Cyber OperationsModerate-HighModerate2-4 yearsIncreasing
Scientific Research AccelerationMixedHighCurrentAccelerating
Economic DisruptionHighHigh2-5 yearsAccelerating

Capability Progression Timeline

ModelReleaseParametersKey BreakthroughPerformance Milestone
GPT-2Feb 20191.5BCoherent text generationInitially withheld for safety concerns; full 1.5B release Nov 2019
GPT-3Jun 2020175BFew-shot learning emergenceCreative writing, basic coding
CodexAug 2021UndisclosedCode generation from natural languagePowered early GitHub Copilot; evaluated on HumanEval benchmark
InstructGPTJan 20221.3B–175BInstruction-following via RLHFPreferred by labelers over 175B GPT-3 despite smaller size
GPT-4Mar 2023UndisclosedMultimodal reasoningReportedly ≈90th percentile SAT, bar exam passing
GPT-4oMay 2024UndisclosedMultimodal speed/costReal-time audio-visual, described as 2x faster than GPT-4 Turbo
Claude 3.5 Sonnet SonnetJun 2024UndisclosedAdvanced tool use86.5% MMLU, leading SWE-bench at release
o1Sep 2024UndisclosedChain-of-thought reasoning77.3% GPQA Diamond, 74% MATH
o3Dec 2024UndisclosedInference-time search87.7% GPQA Diamond, 91.6% AIME 2024
Gemini 2.0 FlashFeb 2025UndisclosedAgentic multimodal capabilitiesNative image generation, real-time audio; positioned as foundation for agentic era
Gemini 2.5 ProMar 2025UndisclosedLong-context reasoning1M-token context, leading coding benchmarks at release
Llama 4Apr 2025UndisclosedNatively multimodal open-weightMixture-of-experts architecture
GPT-5May 2025UndisclosedUnified reasoning + tool useHighest reported scores at release on GPQA Diamond and SWE-bench
Claude Opus 4.5Reportedly Nov 2025UndisclosedExtended reasoningReportedly 80.9% SWE-bench Verified
GPT-5.2Reportedly late 2025UndisclosedDeep thinking modesReportedly 93.2% GPQA Diamond, 90.5% ARC

Primary sources: OpenAI model announcements, Anthropic model cards, Google DeepMind blog.

Benchmark Performance Comparison (2024–2025)

BenchmarkMeasuresGPT-4o (2024)o1 (2024)o3 (2024)Human Expert
GPQA DiamondPhD-level science≈50%77.3%87.7%≈89.8%
AIME 2024Competition math13.4%74%91.6%Top 500 US
MMLUGeneral knowledge84.2%90.8%≈92%89.8%
SWE-bench VerifiedReal GitHub issues33.2%48.9%71.7%N/A
ARC-AGINovel reasoning≈5%13.3%87.5%≈85%
CodeforcesCompetitive coding≈11%89% (94th percentile)99.8th percentileN/A

Sources: OpenAI o1 system card, ARC Prize o3 analysis, Anthropic Claude 3.5 model card.

The o3 results represent a qualitative shift: o3 achieved nearly human-level performance on ARC-AGI (87.5%) versus a ~85% human baseline, a benchmark specifically designed to test general reasoning rather than pattern matching. On FrontierMath, o3 reportedly solved 25.2% of problems compared to o1's 2%—a roughly 12x improvement that suggests reasoning capabilities may be scaling faster than expected. However, on the harder ARC-AGI-2 benchmark, o3 scores only ~3% compared to ~60% for average humans, revealing significant limitations in truly novel reasoning tasks.

Scaling Laws and Predictable Progress

Core Scaling Relationships

Research by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) demonstrates robust mathematical relationships governing LLM performance:

FactorScaling LawImplication
Model SizePerformance ∝ N^0.07610x parameters → 1.9x performance
Training DataPerformance ∝ D^0.09510x data → 2.1x performance
ComputePerformance ∝ C^0.05010x compute → 1.4x performance
Optimal RatioN ∝ D^0.47Chinchilla scaling for efficiency

Sources: Chinchilla paper; Scaling Laws for Neural Language Models

According to Epoch AI research, approximately two-thirds of LLM performance improvements over the last decade are attributable to increases in model scale, with training techniques contributing roughly 0.4 orders of magnitude per year in compute efficiency. The cost of training frontier models has grown by reportedly 2.4x per year since 2016, with the largest models projected to exceed $1B by 2027, according to Epoch AI cost analyses.

A related phenomenon is the emergence of new scaling regimes beyond training compute. Research published in 2025 finds that emergent capability thresholds are sensitive to random factors including data ordering and initialization seeds, suggesting that the apparent sharpness of emergence in aggregate curves may partly reflect averaging over many random runs with different thresholds rather than a clean phase transition.

The Shift to Inference-Time Scaling (2024–2025)

The o1 and o3 model families introduced a new paradigm: inference-time compute scaling. Rather than only scaling training compute, these models allocate additional computation at inference time through extended reasoning chains and search procedures.

Scaling TypeMechanismTrade-offExample
Pre-training scalingMore parameters, data, training computeHigh upfront cost, fast inferenceGPT-4, Claude 3.5
Inference-time scalingLonger reasoning chains, searchLower training cost, expensive inferenceo1, o3
Combined scalingBoth approachesMaximum capability, maximum costGPT-5, Claude Opus 4.5

This shift has implications for AI safety: inference-time scaling allows models to "think longer" on hard problems, potentially achieving strong performance on specific tasks while maintaining manageable training costs. According to some sources, o1 is approximately 6x more expensive and 30x slower than GPT-4o per query. The Stanford AI Index 2025 cites the RE-Bench evaluation finding that in short time-horizon settings (2-hour budget), top AI systems score 4x higher than human experts, but as the time budget increases to 32 hours, human performance surpasses AI by roughly 2 to 1.

Emergent Capability Thresholds

CapabilityEmergence ScaleEvidenceSafety Relevance
Few-shot learning≈100B parametersGPT-3 breakthroughTool use foundation
Chain-of-thought≈10B parametersPaLM, GPT-3 variantsComplex reasoning
Code generation≈1B parametersCodex, GitHub CopilotCyber capabilities
Instruction following≈10B parametersInstructGPTHuman-AI interaction paradigm
PhD-level reasoningo1+ scaleGPQA Diamond performanceExpert-level autonomy
Strategic planningo3 scaleARC-AGI performanceDeception potential

Research documented in a 2025 emergent abilities survey finds that emergent abilities depend on multiple interacting factors: scaling up parameters or depth lowers the threshold for emergence but is neither necessary nor sufficient alone—data quality, diversity, training objectives, and architecture modifications also matter significantly. Emergence aligns more closely with pre-training loss landmarks than with sheer parameter count; smaller models can match larger ones if training loss is sufficiently reduced.

According to the Stanford AI Index 2025, benchmark performance improved substantially in a single year: scores rose by 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively. The gap between leading US and Chinese models also narrowed—from 17.5 to 0.3 percentage points on MMLU—over the same period.

Safety implication: As AI systems gain autonomous reasoning capabilities, they also develop behaviors relevant to safety evaluation, including goal persistence, strategic planning, and the capacity for deceptive alignment. OpenAI's o3-mini reportedly became the first AI model to receive a "Medium risk" classification for Model Autonomy under internal capability evaluation frameworks.

Foundational Safety Research and Alignment Techniques

GPT-2: Staged Release as Safety Practice

The GPT-2 release in 2019 marked the first occasion where an AI lab deliberately staged a major model release over several months specifically to study potential misuse. OpenAI released progressively larger versions of GPT-2 (117M, 345M, 762M, and finally 1.5B parameters) across February–November 2019, monitoring for evidence of misuse between each stage. The six-month review found no strong evidence of misuse attributable to the staged approach, and OpenAI acknowledged the difficulty of empirically evaluating whether staged release achieved its safety goals. This episode established the practice of conducting deliberate capability-safety reviews before broad model deployment, a practice later formalized in responsible scaling policies.

A central concern motivating the staged release was that models capable of coherent, long-form text generation at scale could be used for disinformation, spam, and other misuse at lower cost than previous methods. OpenAI's subsequent analysis noted that the question of whether to release a model cannot be separated from analysis of what specific harms it enables and at what marginal uplift over existing tools.

Instruction Following and InstructGPT

A critical development between GPT-3 and GPT-4 was the discovery that models fine-tuned to follow instructions using human feedback were substantially preferred by users over much larger models trained only on next-token prediction. OpenAI's Aligning Language Models to Follow Instructions (2022)—the InstructGPT paper—demonstrated that a 1.3B-parameter RLHF-tuned model was preferred over the 175B GPT-3 in head-to-head evaluations by human labelers across a range of tasks. This result was significant for safety because it showed that alignment training and capability were not in fundamental tension: smaller, better-aligned models could outperform larger unaligned ones.

InstructGPT also documented a key limitation of RLHF: the fine-tuned models showed measurable performance regression on standard NLP benchmarks (the "alignment tax"), which OpenAI partially mitigated by mixing pretraining data into the RLHF training. The paper identified sycophancy—models outputting what users want to hear rather than accurate information—as a persistent failure mode stemming from human labeler preferences.

Learning to Summarize and RLHF Foundations

Two key papers established the empirical foundations for applying RLHF to language tasks at scale:

The summarization work also introduced the practice of training reward models separately from the policy model and using them as a proxy for human judgment, a design choice that has shaped subsequent RLHF applications. An early application to long-form content was Summarizing Books with Human Feedback (Wu et al., 2021), which used recursive decomposition to enable human labelers to evaluate summaries of book-length texts—a technique relevant to scalable oversight of tasks too long for direct human evaluation.

TruthfulQA and Measuring Falsehood

A persistent failure mode of large language models is that they generate false answers that nonetheless reflect common human misconceptions—a pattern distinct from simple hallucination. TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2022) documented this systematically: the benchmark comprises 817 questions spanning 38 categories where the correct answer is counterintuitive or contradicts popular belief.

Key findings from TruthfulQA:

  • Larger models are not more truthful on this benchmark; in some categories, larger models perform worse because they more fluently reproduce common misconceptions
  • The best models at time of evaluation answered truthfully on roughly 58% of questions, versus a human baseline of 94%
  • Model-generated answers that were false were often confidently stated, reducing the signal value of uncertainty expressions

This work established an important distinction between calibration (knowing what you don't know) and truthfulness (correctly answering questions), and demonstrated that scaling alone does not reliably improve truthfulness. TruthfulQA has been used as a standard evaluation in subsequent model releases, including the GPT-4 technical report.

WebGPT: Factuality Through Web Browsing

WebGPT: Browser-Assisted Question-Answering with Human Feedback (Nakano et al., 2021) addressed factuality from a different angle: instead of improving parametric knowledge, the system learned to browse the web and cite sources when answering long-form questions. WebGPT was fine-tuned using human feedback to reward answers that were both well-supported by retrieved sources and preferred by evaluators.

Key contributions:

  • Demonstrated that retrieval-augmented generation with citation could substantially reduce unsupported factual claims
  • Found that human evaluators preferred WebGPT answers to GPT-3 answers on 56% of ELI5 questions, and that the model's answers were more factual than those produced without web access
  • Identified a persistent failure mode: the model sometimes selected misleading or unrepresentative sources to support a desired conclusion, foreshadowing later concerns about "faithful but misleading" retrieval-augmented generation

WebGPT represents an early instance of tool-augmented LLM deployment and contributed to the design of subsequent browsing-enabled deployments including ChatGPT's web browsing mode.

OpenAI Codex and Code Generation

Evaluating Large Language Models Trained on Code (Chen et al., 2021)—the Codex paper—introduced the HumanEval benchmark and reported that Codex solved 28.8% of programming problems in a zero-shot setting, rising to 70.2% with repeated sampling. Codex was fine-tuned from GPT-3 on publicly available code from GitHub and became the foundation of GitHub Copilot.

From a safety perspective, Codex introduced a new dual-use concern: models capable of generating functional code from natural language instructions could assist with both beneficial software development and potentially harmful applications including vulnerability discovery and exploit development. The Codex paper acknowledged these concerns and noted that the responsible disclosure of such capabilities required consideration of both the benefits to legitimate users and the marginal uplift provided to malicious actors.

The HumanEval benchmark—which tests whether generated code passes unit tests on 164 programming problems—has become a standard evaluation metric, though subsequent work has noted that benchmark performance can overstate practical coding ability on real-world repositories with complex dependencies.

Deliberative Alignment

Deliberative Alignment: Reasoning Enables Safer Language Models (2024) introduced a safety training approach in which models are explicitly trained to reason through safety considerations before generating responses to potentially sensitive queries. Rather than relying solely on learned behavioral patterns to refuse or comply with requests, deliberative alignment trains the model to identify relevant safety principles, reason about their application to the specific context, and generate responses consistent with that reasoning.

Key claims from the paper:

  • Models trained with deliberative alignment show reduced rates of both over-refusal (refusing benign requests) and under-refusal (complying with genuinely harmful ones)
  • The reasoning process is observable and partially auditable, providing some transparency into why a given refusal or compliance decision was made
  • The approach transfers more reliably to novel or edge-case inputs than pure behavioral fine-tuning, because the model reasons about principles rather than matching patterns

Deliberative alignment is described as a component of GPT-4o's safety training and has been cited as a mechanism for improving the precision of safety behaviors—distinguishing between superficially similar benign and harmful requests.

RLHF and Alignment Training Foundations

A key technique underlying modern aligned LLMs is Reinforcement Learning from Human Feedback (RLHF), which trains a reward model on human preference comparisons and uses it to fine-tune Large Language Models outputs via RL. The foundational application of this approach to Large Language Models was demonstrated in Fine-Tuning GPT-2 from Human Preferences (Ziegler et al., 2019), which showed that human feedback could steer model behavior on stylistic tasks. This was later scaled to instruction-following via InstructGPT and then to Claude's Constitutional AI framework.

RLHF ComponentFunctionSafety Relevance
Reward modelConverts human preferences into a differentiable signalCentral to RLHF alignment
PPO fine-tuningUpdates language model to maximize rewardCan introduce reward hacking
Constitutional AIReplaces human labelers with model self-critique against principlesScales alignment oversight; see Constitutional AI
Multilingual consistencyEnforces safety behaviors across languagesAlign Once, Benefit Multilingually (2025)

A significant limitation is that RLHF-trained models can exhibit sycophancy—systematically agreeing with user beliefs rather than providing accurate responses—because human raters often prefer confident, agreeable answers. Recent work on multi-objective alignment for psychotherapy contexts and policy-constrained alignment frameworks like PolicyPad (2025) explore how to balance multiple competing alignment objectives.

Multilingual alignment gap: A consistent finding across alignment research is that safety training applied primarily in English does not reliably transfer to other languages. Align Once, Benefit Multilingually (2025) proposes methods to enforce multilingual consistency in safety alignment by training on cross-lingual consistency losses. Related work on Helpful to a Fault (2025) measures illicit assistance rates across 40+ languages in multi-turn interactions, finding that multilingual agents provide more potentially harmful assistance than English-only evaluations would suggest, particularly in low-resource languages where safety fine-tuning data is sparse.

Evaluation Methodology and Benchmarks

Factuality Benchmarking

A growing area of evaluation research focuses on measuring LLM factuality rigorously and across diverse task types. Two related but distinct benchmarks from Google DeepMind address different aspects of this problem:

  • FACTS Grounding: Introduced to measure whether long-form generated text is supported by provided source documents. It distinguishes between claims grounded in retrieved context and claims introduced without basis—directly relevant to retrieval-augmented generation (RAG) systems where the model has access to retrieved context but may still generate unsupported claims.
  • FACTS Benchmark Suite: A broader systematic evaluation that decomposes factuality failures by error type across diverse task categories, enabling more fine-grained analysis of where and how models generate false information.

The distinction between these two benchmarks matters for practitioners: FACTS Grounding tests faithfulness to provided context, while the FACTS Benchmark Suite tests factual accuracy more broadly, including against parametric knowledge.

AI Psychometrics

AI Psychometrics: Evaluating the Psychological Reasoning of Large Language Models with Psychometric Validities applies psychometric methodology—originally developed for measuring human cognitive and psychological attributes—to evaluate LLMs. The paper distinguishes between tests of maximal performance (what can the model do when trying its best?) and tests of typical performance (what does the model actually do in naturalistic settings?), arguing that most LLM benchmarks measure only the former.

Key findings relevant to safety:

  • Models show substantial divergence between their maximum capability and their typical behavior on the same tasks, suggesting benchmark scores may overstate practical reliability
  • Psychometric validity concepts—including construct validity, convergent validity, and discriminant validity—provide a principled framework for evaluating whether LLM benchmarks actually measure the constructs they claim to measure
  • Standard LLM benchmarks often lack discriminant validity: scores on ostensibly different benchmarks correlate highly, suggesting they may be measuring a common underlying factor rather than distinct capabilities

Jailbreak Scaling Analysis

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models examines how jailbreak attack success rates change as both attacker and defender scale. Key findings:

  • Attack success rates do not consistently decrease as model size increases; some attack categories become more effective against larger models
  • Automated jailbreak generation scales approximately as efficiently as model capability, suggesting the safety-capability gap may not close automatically with scale
  • The paper identifies attack categories that are particularly resistant to safety fine-tuning, including multi-turn attacks that gradually shift model behavior across a conversation

This work is relevant to understanding whether responsible scaling policies based on capability thresholds can adequately capture safety improvements, or whether adversarial robustness requires additional targeted evaluation.

SemBench and Universal Semantic Evaluation

SemBench: A Universal Semantic Framework for LLM Evaluation proposes a framework for evaluating whether LLMs understand meaning in a semantically consistent way, rather than relying on surface form matches. The benchmark tests whether models can recognize paraphrases, semantic equivalences, and entailments across diverse phrasings—capturing a form of understanding that standard task-specific benchmarks may miss.

CLASP: Defense Against Hidden State Poisoning

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks addresses a specific attack surface in LLM deployment: adversarial manipulation of model hidden states through carefully crafted inputs. CLASP proposes detection and defense mechanisms for scenarios where an attacker can influence the model's internal representations, which is particularly relevant for hybrid deployment architectures where models process inputs from multiple sources with varying trust levels.

Gemini 2.0 and the Agentic Era

Google DeepMind's Gemini 2.0 launch in December 2024 marked an explicit strategic pivot from general-purpose language models toward Agentic AI systems capable of multi-step autonomous task execution. The announcement positioned Gemini 2.0 as "our new AI model for the agentic era," signaling that agentic capabilities had become a primary design goal rather than an emergent side effect.

Gemini 2.0 Flash

Gemini 2.0 Flash was the first production-ready model in the 2.0 family, made available to all users in February 2025. Key technical features:

CapabilityDescriptionAgentic Relevance
Native image generationGenerates images directly without a separate modelEnables single-model multimodal task completion
Native audio outputGenerates audio responses directlySupports voice-based agentic interactions
Computer useCan interact with web browsers and desktop applicationsDirect environment manipulation
Real-time APILow-latency streaming for interactive applicationsEnables responsive agentic loops
Tool useStructured function calling across multiple APIsMulti-system orchestration

The native multimodal generation capabilities in Gemini 2.0 Flash—where the model generates images and audio as part of a unified response rather than calling separate specialized models—represented a design shift with implications for agentic deployment. Prior systems required orchestration layers to coordinate between specialized text, image, and audio models; native generation simplified the architecture for agentic pipelines.

From a safety perspective, Gemini 2.0's agentic positioning raised concerns that required explicit treatment in Google DeepMind's safety documentation:

  • Computer use capabilities enable the model to take actions in the real world (browsing, form submission, application interaction) with limited reversibility
  • Multi-step task execution increases the potential for errors to compound before human review
  • Real-time audio interaction reduces the latency available for content filtering

Google's Gemini 2.0 is now available to everyone post described safety testing specific to agentic use cases, including evaluations of whether the model would take irreversible actions without appropriate confirmation and whether it would resist prompt injection attacks from adversarial content in browsed web pages.

Significance for the Gemini 2.5 Family

Gemini 2.0's agentic infrastructure—particularly the real-time API, native multimodal generation, and computer use capabilities—provided the foundation on which Gemini 2.5 Pro and Flash were built. The 2.5 series added chain-of-thought reasoning (the "thinking models" capability) and extended context to 1 million tokens, but the agentic deployment architecture established with 2.0 remained substantially intact. This lineage is relevant for understanding why Gemini 2.5 models are frequently benchmarked on agentic tasks (WebArena, SWE-bench) rather than purely on question-answering or generation metrics.

Major 2025 Model Releases

GPT-5 and the GPT-5 Family

GPT-5, released in May 2025, represents OpenAI's integration of reasoning and general capability into a single unified model, replacing the separate GPT-4o and o1 product lines. According to the GPT-5 System Card, GPT-5 achieves the highest scores to date on GPQA Diamond, SWE-bench Verified, and MMLU among OpenAI models at the time of release.

Key developments in the GPT-5 family:

  • GPT-5.1: Released for developers in mid-2025, optimized for conversational applications and described as "smarter, more conversational" with improved instruction-following. Cursor and Tolan reported substantial gains in their agentic pipelines.
  • GPT-5.2: A subsequent release in late 2025 adding deeper thinking modes, reportedly achieving 93.2% on GPQA Diamond and 90.5% on ARC.
  • GPT-5.3-Codex-Spark: Positioned as a coding-specialized variant within the GPT-5.x family, targeting agentic software development workflows.
  • gpt-oss-120b and gpt-oss-20b: Open-weight models released by OpenAI, with model cards published for both sizes. These represent a shift toward open-weight strategy alongside closed frontier models.
  • GPT Realtime API: Extended with real-time audio dialog capabilities, building on the GPT-4o voice capabilities introduced in Hello GPT-4o (2024).

The GPT-5 System Card documents safety evaluations including assessments of CBRN uplift risk, cyberoffense capabilities, and persuasion. An Addendum on Sensitive Conversations addresses handling of mental health, self-harm, and politically contentious topics, noting both improvements in refusal precision and remaining cases where the model provides responses that require strengthening.

gpt-oss and the Open-Weight Strategy

The gpt-oss model family—comprising gpt-oss-20b and gpt-oss-120b—represents OpenAI's entry into the open-weight model space. This was a notable strategic shift for an organization that had previously released only API-accessible models. The gpt-oss model card documents intended use cases, known limitations, and safety evaluations for both parameter sizes.

A companion release was gpt-oss-safeguard, a smaller model designed specifically for content safety classification, with a technical report providing detailed evaluation methodology. gpt-oss-safeguard is intended to be deployable locally alongside gpt-oss models, enabling operators to run safety classification without API dependency—a design choice that addresses privacy and latency constraints in enterprise deployment but also raises questions about whether safety models deployed without oversight updates will remain effective as threat landscapes evolve.

Key safety implications of OpenAI's open-weight release:

  • Open weights enable mechanistic interpretability research that would otherwise require API-level access
  • Fine-tuning safety guardrails out of open-weight models remains tractable for technically sophisticated users, a concern acknowledged in the gpt-oss model card
  • Governance frameworks targeting API access do not apply to locally deployed open-weight models

Economic Impacts Research

OpenAI's economic impacts research program, described in Economic Impacts Research at OpenAI, examines how LLM deployment affects labor markets, productivity, and economic distribution. Key themes from this research:

  • LLMs disproportionately affect tasks involving information processing, writing, and analysis—including tasks previously considered high-skill
  • Productivity effects are heterogeneous: users who effectively leverage LLM assistance show large gains; those who do not may face competitive disadvantage
  • The research acknowledges uncertainty about whether productivity gains translate to wage increases or are captured primarily by firms and shareholders
  • Studies of LLM adoption in specific domains (customer service, software development, medical documentation) show consistent task completion speed improvements but variable effects on output quality

This research is relevant to the economic disruption risk dimension because it provides empirical grounding for displacement and productivity claims that are often made without supporting data.

Gemini 2.5 Family

Google DeepMind's Gemini 2.5 family, released across March–June 2025, introduced several models:

ModelKey FeatureContext WindowPrimary Use Case
Gemini 2.5 ProHighest capability, coding-focused1M tokensComplex reasoning, coding
Gemini 2.5 FlashSpeed-optimized frontier1M tokensScaled production use
Gemini 2.5 Flash-LiteCost/latency optimized1M tokensHigh-volume inference

Gemini 2.5: Our Most Intelligent AI Model describes the Pro variant as achieving leading performance on coding benchmarks including LiveCodeBench and outperforming GPT-4o on MMLU at release. Gemini 2.5 Flash-Lite was made production-ready in June 2025 with a focus on throughput-sensitive applications.

The 2.5 family also introduced native thinking models—models that produce explicit chain-of-thought reasoning before answering—across both Pro and Flash tiers. Advanced audio dialog and generation capabilities were extended across the Gemini 2.5 family.

Gemma 3 and Open-Weight Models

Google's Gemma 3 family, released in 2025, provides open-weight models ranging from 1B to 27B parameters optimized for single-accelerator deployment. The Gemma 3 270M variant targets edge and mobile deployment. Gemma 3n introduced a mobile-first architecture with selective parameter activation for on-device inference.

MedGemma, released in mid-2025, provides open health-specific models demonstrating LLM application in clinical reasoning. T5Gemma introduced encoder-decoder variants of the Gemma architecture, enabling use cases where separate encoding and decoding is beneficial (e.g., retrieval-augmented generation, classification tasks).

Llama 4 and the Open-Weight Ecosystem

Meta's Llama 4 Herd, announced at LlamaCon in April 2025, represents a shift to natively multimodal architecture using a mixture-of-experts design. Llama 4 Scout and Llama 4 Maverick support image, video, and text inputs from the base model level. Meta reported over 10x growth in Llama usage since 2023, with the model family becoming a reference implementation for open-weight AI development.

Key safety implications of the open-weight ecosystem:

  • Fine-tuning safety guardrails out of open-weight models remains tractable for technically sophisticated users
  • Mechanistic interpretability research benefits from open weights (e.g., Gemma Scope 2, described below)
  • Governance frameworks targeting API access do not apply to locally deployed open-weight models

Differential Privacy and VaultGemma

VaultGemma: The World's Most Capable Differentially Private LLM addresses a constraint on LLM deployment in privacy-sensitive domains: the need for formal privacy guarantees during fine-tuning on user data. Standard LLM fine-tuning risks memorizing and later revealing sensitive training data; differentially private fine-tuning adds formal guarantees that individual training examples cannot be inferred from model outputs.

VaultGemma demonstrates that differentially private fine-tuning can produce capable models—competitive with non-private baselines on several benchmarks—at moderate privacy budget costs. This work is relevant to healthcare, legal, and financial deployments where data protection requirements preclude standard fine-tuning approaches.

Mechanistic Interpretability Advances

Mechanistic interpretability research—which seeks to understand the internal computations of neural networks in human-interpretable terms—has accelerated substantially with the availability of open-weight models and new tooling.

Gemma Scope 2

Gemma Scope 2 is a suite of sparse autoencoders (SAEs) trained on Gemma 3 models, released by Google DeepMind to support interpretability research on complex language model behaviors. Building on the original Gemma Scope release for Gemma 2, Gemma Scope 2 provides SAEs at multiple layers and widths, enabling decomposition of model activations into human-interpretable features.

Gemma Scope 2 supports research into:

  • Feature geometry and polysemanticity in larger models
  • Cross-layer feature interactions and information flow
  • Identification of features relevant to safety-relevant behaviors (deception, refusal, sycophancy)

Language Models Explaining Neurons

Language Models Can Explain Neurons in Language Models (Bills et al., 2023) demonstrated that GPT-4 can generate natural language explanations of GPT-2 neurons with higher validity than human-written explanations. This opened a scalable pathway for automated interpretability: using more capable models to explain less capable ones. Subsequent work has extended this to sparse autoencoder features and cross-model explanation transfer.

Activation Steering and Causal Intervention

Activation steering—injecting vectors into model residual streams to steer behavior—has become a primary tool for behavioral intervention research. Recent work has refined understanding of when and why steering succeeds:

Research Platform Value of Open-Weight Models

Research ApplicationOpen-Weight BenefitExample
Mechanistic interpretabilityFull activation accessGemma Scope 2 on Gemma 3
SAE trainingWeight access for feature analysisGemma Scope, TranscoderLens
Activation steeringResidual stream interventionMultiple labs using Llama
Fine-tuning safetyRapid iterationConstitutional AI variants
Neuron explanationCross-model explanation transferBills et al. 2023

Hallucination and Factuality

LLM hallucination—the generation of plausible-sounding but factually incorrect or unsupported content—remains a central reliability and safety challenge. Hallucination rates vary substantially by task, model, and measurement methodology, with published estimates ranging from roughly 8% to 45% depending on benchmark design and model version.

Factuality Benchmarking

BenchmarkScopeKey FindingLink
TruthfulQAQuestions where correct answers contradict popular misconceptionsBest models ≈58% truthful vs. 94% human baselineTruthfulQA
FACTS GroundingLong-form factuality against source documentsMeasures supported vs. unsupported claimsFACTS Grounding
FACTS Benchmark SuiteSystematic factuality across task typesDecomposes factuality failures by error typeFACTS Suite
Vectara Hallucination LeaderboardSummarization hallucinationBest models: ≈8% hallucination rateVectara
HalluLensFactual query hallucinationUp to 45% on factual queries for GPT-4oHalluLens (ACL 2025)
CheckIfExistCitation hallucination in AI-generated contentDetects fabricated citations in RAG systemsCheckIfExist (2025)

The FACTS Grounding benchmark introduced by Google DeepMind specifically addresses the challenge of evaluating long-form generation against reference documents, distinguishing between claims grounded in provided source material and claims introduced without basis. This is particularly relevant for retrieval-augmented generation (RAG) systems, where the model has access to retrieved context but may still generate unsupported claims.

CheckIfExist (2025) addresses citation hallucination specifically—the generation of plausible-looking but non-existent citations, which poses particular risks in legal, medical, and academic contexts. The benchmark finds that citation hallucination rates remain substantial even for frontier models, and that RAG systems can still hallucinate citations from retrieved documents by misattributing or confabulating specific reference details.

Why Models Hallucinate

OpenAI's explainer on why language models hallucinate identifies several contributing mechanisms:

  1. Training objective mismatch: Next-token prediction rewards coherent text, not factual accuracy
  2. Knowledge compression: Models must compress world knowledge into fixed-weight representations, leading to lossy encoding
  3. Context-weight tension: Models may blend retrieved context with parametric knowledge, producing hybrid outputs that are faithful to neither
  4. Sycophancy pressure: RLHF can train models to confirm user beliefs rather than correct factual errors, since raters may prefer agreeable responses
  5. Calibration failure: Models often express high confidence in incorrect claims, reducing the signal value of expressed uncertainty

Hallucination rates are not monotonically improved by scale: models engineered for massive context windows do not consistently achieve lower hallucination rates than smaller counterparts, and increased model size can increase the confidence of hallucinated outputs without reducing their frequency. This suggests hallucination is partly an architectural and training-objective issue rather than purely a capacity limitation.

The TruthfulQA benchmark specifically documented that larger models can perform worse on questions where the correct answer contradicts common human misconceptions, because larger models more fluently reproduce those misconceptions. This counterintuitive finding challenges the assumption that capability improvements automatically translate to truthfulness improvements.

Compression and Consistency

Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information analyzes why LLMs, when trained on corpora containing both true and false statements, tend to preferentially reproduce consistent patterns rather than factually accurate ones. The key finding is that information compression during training rewards statements that co-occur consistently with many other statements—a property that correlates with truth for facts with strong evidential support, but diverges from truth for contested claims, common misconceptions, or facts that appear rarely in training data.

This theoretical framing helps explain why TruthfulQA finds that larger models more fluently reproduce misconceptions: greater model capacity allows more faithful compression of co-occurrence statistics, including the statistics of false-but-widespread beliefs.

Deception and Truthfulness

Behavior TypeFrequencyContextMitigation
Hallucination8–45%Varies by task and modelTraining improvements, RAG
Citation hallucinationreportedly ≈17%Legal, academic domainCheckIfExist detection systems
Role-play deceptionHighPrompted scenariosSafety fine-tuning
SycophancyModerateOpinion questionsConstitutional AI, RLHF adjustment
Strategic deceptionLow-ModerateEvaluation scenariosOngoing research

According to some sources, even recent frontier models have hallucination rates exceeding 15% when asked to analyze provided statements, and approximately 1 in 6 AI responses in legal contexts reportedly contain citation hallucinations. The wide variance across benchmarks reflects both genuine model differences and definitional variation in what constitutes a hallucination.

Safety-Relevant Positive Capabilities

Interpretability Research Platform

Research AreaProgress LevelKey FindingsOrganizations
Attention visualizationAdvancedKnowledge storage patternsAnthropic, OpenAI
Activation patchingModerateCausal intervention methodsRedwood Research
Sparse autoencodersAdvancingFeature decomposition in large modelsAnthropic, Google DeepMind (Gemma Scope)
Neuron explanationModerateLM-explained neurons via GPT-4Bills et al. 2023
Mechanistic understandingEarly-ModerateTransformer circuitsAnthropic Interpretability

Constitutional AI and Value Learning

Anthropic's Constitutional AI demonstrates approaches to value alignment through self-critique, principle-following, and preference learning. The specific success rates for these techniques vary considerably across evaluations, tasks, and model versions; the figures below represent approximate reported ranges rather than settled benchmarks, and should be treated with caution.

TechniqueReported RangeApplicationLimitations
Self-critique≈70–85% (approximate)Harmful content reductionRequires good initial training
Principle following≈60–80% (approximate)Consistent value applicationVulnerable to gaming
Preference learning≈65–75% (approximate)Human value approximationDistributional robustness

Scalable Oversight Applications

Modern LLMs enable several approaches to AI safety through automated oversight:

  • Output evaluation: AI systems critiquing other AI outputs. Agreement rates with human evaluators vary substantially by task and domain; no single well-characterized aggregate figure covers the literature.
  • Red-teaming: Automated discovery of failure modes and adversarial inputs.
  • Safety monitoring: Real-time analysis of AI system behavior patterns.
  • Research acceleration: AI-assisted safety research and experimental design.
  • Content moderation: OpenAI has published work on using GPT-4 for content moderation, describing how LLM-based moderation can operate at scale and reduce human labeler exposure to harmful content.
  • Summarization oversight: Work on summarizing books with human feedback established recursive decomposition as a technique for extending human oversight to tasks too long for direct evaluation, contributing to the scalable oversight research agenda.

Concerning Capabilities Assessment

Persuasion and Manipulation

Modern LLMs demonstrate persuasion capabilities that raise questions for democratic discourse and individual autonomy:

CapabilityCurrent StateEvidenceRisk Level
Audience adaptationAdvancedAnthropic persuasion researchHigh
Persona consistencyAdvancedExtended roleplay studiesHigh
Emotional manipulationModerateRLHF alignment researchModerate
Debate performanceAdvancedHuman preference studiesHigh

According to some sources, frontier LLMs can substantially increase human agreement rates through targeted persuasion techniques, raising concerns about consensus manufacturing; however, the specific figure of 82% attributed to Anthropic research has not been independently verified against a primary source and should be treated with caution. The ARC Evaluations (2025) specifically evaluates persuasion and resistance capabilities in LLMs through adversarial resource extraction games, finding that frontier models can maintain persuasive pressure across extended multi-turn interactions.

Research on political alignment inference (LLMs Can Infer Political Alignment from Online Conversations) finds that frontier models can reliably identify political alignment from text with accuracy exceeding human raters in controlled settings. This capability has dual-use implications: it enables personalized political persuasion but also could be used for detection of coordinated influence campaigns.

Multilingual Safety Gaps

A consistent finding across the research literature is that safety alignment applied primarily in English does not reliably transfer to other languages. Helpful to a Fault (2025) measured illicit assistance rates for multi-turn, multilingual LLM agents across 40+ languages, finding that:

  • Models provide meaningfully more potentially harmful assistance in non-English languages
  • The gap is largest for low-resource languages with sparse alignment training data
  • Multi-turn interactions elicit more harmful assistance than single-turn, with each turn potentially eroding earlier refusals

This suggests that multilingual deployment of LLMs creates safety gaps that single-language evaluation would miss, a concern reportedly noted in the GPT-5 System Card.

Cybersecurity Capabilities and Refusal Frameworks

LLMs' code generation and vulnerability analysis capabilities create dual-use risks in cybersecurity:

  • Models can reportedly assist with vulnerability identification, exploit development, and attack planning at varying levels depending on the specificity of the request
  • *A Content-Based Framework for Cybersecurity

References

1Kaplan et al. (2020)arXiv·Jared Kaplan et al.·2020·Paper
★★★☆☆
2Hoffmann et al. (2022)arXiv·Jordan Hoffmann et al.·2022·Paper
★★★☆☆
3AnthropicAnthropic
★★★★☆
4OpenAIOpenAI
★★★★☆

Anthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their work aims to understand and mitigate potential risks associated with increasingly capable AI systems.

★★★★☆
7Partnership on AIpartnershiponai.org

A nonprofit organization focused on responsible AI development by convening technology companies, civil society, and academic institutions. PAI develops guidelines and frameworks for ethical AI deployment across various domains.

10Emergent AbilitiesarXiv·Jason Wei et al.·2022·Paper
★★★☆☆
★★★★★
18OpenAI.openai.com
23DeepMindGoogle DeepMind
★★★★☆
28OpenAI: Model BehaviorOpenAI·Paper
★★★★☆
31Kenton et al. (2021)arXiv·Stephanie Lin, Jacob Hilton & Owain Evans·2021·Paper
★★★☆☆
32OpenAI WebGPT behaviorarXiv·Reiichiro Nakano et al.·2021·Paper
★★★☆☆
33Evaluating Large Language Models Trained on CodearXiv·Mark Chen et al.·2021·Paper
★★★☆☆
34Gemini 2.0 FlashGoogle AI
★★★★☆
35GPT-4oOpenAI
★★★★☆
36LlamaFirewallMeta AI
★★★★☆

Related Pages

Top Related Pages

Risks

Reward HackingEmergent Capabilities

Approaches

AI AlignmentAI Governance Coordination Technologies

Policy

EU AI ActAI Standards Development

Safety Research

Scalable Oversight

Concepts

AGI Timeline

Other

Philip Tetlock (Forecasting Pioneer)Robin HansonGPT-4GPT-4oARC-AGIGPQA Diamond

Analysis

AI Uplift Assessment ModelDeceptive Alignment Decomposition ModelWikipedia ViewsSquiggleAI

Key Debates

AI Alignment Research Agendas

Historical

Deep Learning Revolution Era