Large Language Models
Large Language Models
Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.
Large Language Models
Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed \$1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Capability Level | Near-human to superhuman on structured tasks | o3-mini achieves 87.5% on ARC-AGI (human baseline ≈85%); 87.7% on GPQA Diamond |
| Progress Rate | 2–3× capability improvement per year | Stanford AI Index 2025[^2]: benchmark scores rose 18–67 percentage points in one year |
| Training Cost Trend | 2.4× annual growth | Epoch AI: frontier model training costs projected to exceed $1B by 2027 |
| Inference Cost Trend | 280× reduction since 2022 | GPT-4-equivalent dropped from $10 to $1.07 per million tokens |
| Hallucination Rates | 8–45% depending on task | Vectara leaderboard: best models reportedly at ≈8%; HalluLens: up to 45% on factual queries |
| Safety Maturity | Moderate | Constitutional AI, RLHF established; Responsible Scaling Policies implemented by major labs |
| Open-Closed Gap | Narrowing | Gap reportedly shrunk from 8.04% to 1.70% on Chatbot Arena (Jan 2024 → Feb 2025) |
Key Links
| Source | Link |
|---|---|
| Official Website | learn.microsoft.com |
| Wikipedia | en.wikipedia.org |
| arXiv | arxiv.org |
Overview
Large Language Models (LLMs) are transformer-based neural networks trained on vast text corpora using next-token prediction. Despite their deceptively simple training objective, LLMs exhibit sophisticated emergent capabilities including reasoning, coding, scientific analysis, and complex task execution. These models have transformed abstract AI safety discussions into concrete, immediate concerns while providing the clearest path toward artificial general intelligence.
The core insight underlying LLMs is that training a model to predict the next word in a sequence—a task achievable without labeled data—produces internal representations useful for a wide range of downstream tasks. This approach was explored in early unsupervised pretraining work from OpenAI in 2018. OpenAI's GPT-2 (2019) then demonstrated coherent multi-paragraph generation at scale, showing that larger models trained on more data produced qualitatively stronger outputs. An earlier indicator that such models form interpretable internal structure was the discovery of an "unsupervised sentiment neuron" in 2017, which emerged without any sentiment-specific supervision.
A foundational development alongside raw scale was learning to align model outputs to human preferences. The GPT-2 paper itself acknowledged concerns about potential misuse of fluent generation, leading OpenAI to initially withhold the full 1.5B-parameter model and conduct a staged release to study harms before broader deployment. Subsequent work on instruction-following alignment—culminating in InstructGPT—demonstrated that fine-tuning with human feedback could substantially improve model usefulness and reduce harmful outputs without proportional increases in model size. The complementary technique of RLHF applied to summarization tasks, and later to instruction-following, established the training paradigm that underlies most current aligned frontier models.
Current frontier models—including GPT-4o, Claude Opus 4.5, Gemini 2.5 Pro, and Llama 4—demonstrate near-human or superhuman performance across diverse cognitive domains. Training runs for leading frontier systems reportedly consume hundreds of millions of dollars, and model parameter counts have reached into the hundreds of billions to trillions. These substantial computational investments have shifted AI safety from theoretical to practical urgency. The late 2024–2025 period marked a paradigm shift toward inference-time compute scaling, with reasoning models such as OpenAI's o1 and o3 achieving higher performance on reasoning benchmarks by allocating more compute at inference rather than only at training time.
A parallel development is the rapid growth of the open-weight ecosystem. Meta's Llama family has grown substantially since its initial release, with Meta reporting over a billion downloads and more than ten times the developer activity compared to 2023. Google's Gemma models—including Gemma 3 and the mobile-first Gemma 3n variants—have provided the safety research community with accessible architectures for mechanistic interpretability work. This open/closed model convergence has implications for both capability diffusion and the tractability of safety interventions.
Capability Architecture
The diagram below maps the flow from training through inference to observed capabilities. Capabilities are grouped by which inference regime produces them; the distinction between "standard" and "search-augmented" inference is a key axis along which safety-relevant behaviors (extended planning, autonomous task execution) emerge.
Note: Capabilities in the "Emergent Capabilities" subgraph are descriptive, not evaluative. Safety implications of each capability class are discussed in the Concerning Capabilities Assessment and Safety-Relevant Positive Capabilities sections below.
Risk Assessment
| Risk Category | Severity | Likelihood | Timeline | Trend |
|---|---|---|---|---|
| Deceptive Capabilities | High | Moderate | 1-3 years | Increasing |
| Persuasion & Manipulation | High | High | Current | Accelerating |
| Autonomous Cyber Operations | Moderate-High | Moderate | 2-4 years | Increasing |
| Scientific Research Acceleration | Mixed | High | Current | Accelerating |
| Economic Disruption | High | High | 2-5 years | Accelerating |
Capability Progression Timeline
| Model | Release | Parameters | Key Breakthrough | Performance Milestone |
|---|---|---|---|---|
| GPT-2 | Feb 2019 | 1.5B | Coherent text generation | Initially withheld for safety concerns; full 1.5B release Nov 2019 |
| GPT-3 | Jun 2020 | 175B | Few-shot learning emergence | Creative writing, basic coding |
| Codex | Aug 2021 | Undisclosed | Code generation from natural language | Powered early GitHub Copilot; evaluated on HumanEval benchmark |
| InstructGPT | Jan 2022 | 1.3B–175B | Instruction-following via RLHF | Preferred by labelers over 175B GPT-3 despite smaller size |
| GPT-4 | Mar 2023 | Undisclosed | Multimodal reasoning | Reportedly ≈90th percentile SAT, bar exam passing |
| GPT-4o | May 2024 | Undisclosed | Multimodal speed/cost | Real-time audio-visual, described as 2x faster than GPT-4 Turbo |
| Claude 3.5 Sonnet Sonnet | Jun 2024 | Undisclosed | Advanced tool use | 86.5% MMLU, leading SWE-bench at release |
| o1 | Sep 2024 | Undisclosed | Chain-of-thought reasoning | 77.3% GPQA Diamond, 74% MATH |
| o3 | Dec 2024 | Undisclosed | Inference-time search | 87.7% GPQA Diamond, 91.6% AIME 2024 |
| Gemini 2.0 Flash | Feb 2025 | Undisclosed | Agentic multimodal capabilities | Native image generation, real-time audio; positioned as foundation for agentic era |
| Gemini 2.5 Pro | Mar 2025 | Undisclosed | Long-context reasoning | 1M-token context, leading coding benchmarks at release |
| Llama 4 | Apr 2025 | Undisclosed | Natively multimodal open-weight | Mixture-of-experts architecture |
| GPT-5 | May 2025 | Undisclosed | Unified reasoning + tool use | Highest reported scores at release on GPQA Diamond and SWE-bench |
| Claude Opus 4.5 | Reportedly Nov 2025 | Undisclosed | Extended reasoning | Reportedly 80.9% SWE-bench Verified |
| GPT-5.2 | Reportedly late 2025 | Undisclosed | Deep thinking modes | Reportedly 93.2% GPQA Diamond, 90.5% ARC |
Primary sources: OpenAI model announcements, Anthropic model cards, Google DeepMind blog.
Benchmark Performance Comparison (2024–2025)
| Benchmark | Measures | GPT-4o (2024) | o1 (2024) | o3 (2024) | Human Expert |
|---|---|---|---|---|---|
| GPQA Diamond | PhD-level science | ≈50% | 77.3% | 87.7% | ≈89.8% |
| AIME 2024 | Competition math | 13.4% | 74% | 91.6% | Top 500 US |
| MMLU | General knowledge | 84.2% | 90.8% | ≈92% | 89.8% |
| SWE-bench Verified | Real GitHub issues | 33.2% | 48.9% | 71.7% | N/A |
| ARC-AGI | Novel reasoning | ≈5% | 13.3% | 87.5% | ≈85% |
| Codeforces | Competitive coding | ≈11% | 89% (94th percentile) | 99.8th percentile | N/A |
Sources: OpenAI o1 system card, ARC Prize o3 analysis, Anthropic Claude 3.5 model card.
The o3 results represent a qualitative shift: o3 achieved nearly human-level performance on ARC-AGI (87.5%) versus a ~85% human baseline, a benchmark specifically designed to test general reasoning rather than pattern matching. On FrontierMath, o3 reportedly solved 25.2% of problems compared to o1's 2%—a roughly 12x improvement that suggests reasoning capabilities may be scaling faster than expected. However, on the harder ARC-AGI-2 benchmark, o3 scores only ~3% compared to ~60% for average humans, revealing significant limitations in truly novel reasoning tasks.
Scaling Laws and Predictable Progress
Core Scaling Relationships
Research by Kaplan et al. (2020)↗📄 paper★★★☆☆arXivKaplan et al. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan et al. (2020)capabilitiestrainingcomputellm+1Source ↗ and refined by Hoffmann et al. (2022)↗📄 paper★★★☆☆arXivHoffmann et al. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. (2022)capabilitiestrainingevaluationcompute+1Source ↗ demonstrates robust mathematical relationships governing LLM performance:
| Factor | Scaling Law | Implication |
|---|---|---|
| Model Size | Performance ∝ N^0.076 | 10x parameters → 1.9x performance |
| Training Data | Performance ∝ D^0.095 | 10x data → 2.1x performance |
| Compute | Performance ∝ C^0.050 | 10x compute → 1.4x performance |
| Optimal Ratio | N ∝ D^0.47 | Chinchilla scaling for efficiency |
Sources: Chinchilla paper; Scaling Laws for Neural Language Models
According to Epoch AI research, approximately two-thirds of LLM performance improvements over the last decade are attributable to increases in model scale, with training techniques contributing roughly 0.4 orders of magnitude per year in compute efficiency. The cost of training frontier models has grown by reportedly 2.4x per year since 2016, with the largest models projected to exceed $1B by 2027, according to Epoch AI cost analyses.
A related phenomenon is the emergence of new scaling regimes beyond training compute. Research published in 2025 finds that emergent capability thresholds are sensitive to random factors including data ordering and initialization seeds, suggesting that the apparent sharpness of emergence in aggregate curves may partly reflect averaging over many random runs with different thresholds rather than a clean phase transition.
The Shift to Inference-Time Scaling (2024–2025)
The o1 and o3 model families introduced a new paradigm: inference-time compute scaling. Rather than only scaling training compute, these models allocate additional computation at inference time through extended reasoning chains and search procedures.
| Scaling Type | Mechanism | Trade-off | Example |
|---|---|---|---|
| Pre-training scaling | More parameters, data, training compute | High upfront cost, fast inference | GPT-4, Claude 3.5 |
| Inference-time scaling | Longer reasoning chains, search | Lower training cost, expensive inference | o1, o3 |
| Combined scaling | Both approaches | Maximum capability, maximum cost | GPT-5, Claude Opus 4.5 |
This shift has implications for AI safety: inference-time scaling allows models to "think longer" on hard problems, potentially achieving strong performance on specific tasks while maintaining manageable training costs. According to some sources, o1 is approximately 6x more expensive and 30x slower than GPT-4o per query. The Stanford AI Index 2025 cites the RE-Bench evaluation finding that in short time-horizon settings (2-hour budget), top AI systems score 4x higher than human experts, but as the time budget increases to 32 hours, human performance surpasses AI by roughly 2 to 1.
Emergent Capability Thresholds
| Capability | Emergence Scale | Evidence | Safety Relevance |
|---|---|---|---|
| Few-shot learning | ≈100B parameters | GPT-3 breakthrough | Tool use foundation |
| Chain-of-thought | ≈10B parameters | PaLM, GPT-3 variants | Complex reasoning |
| Code generation | ≈1B parameters | Codex, GitHub Copilot | Cyber capabilities |
| Instruction following | ≈10B parameters | InstructGPT | Human-AI interaction paradigm |
| PhD-level reasoning | o1+ scale | GPQA Diamond performance | Expert-level autonomy |
| Strategic planning | o3 scale | ARC-AGI performance | Deception potential |
Research documented in a 2025 emergent abilities survey finds that emergent abilities depend on multiple interacting factors: scaling up parameters or depth lowers the threshold for emergence but is neither necessary nor sufficient alone—data quality, diversity, training objectives, and architecture modifications also matter significantly. Emergence aligns more closely with pre-training loss landmarks than with sheer parameter count; smaller models can match larger ones if training loss is sufficiently reduced.
According to the Stanford AI Index 2025, benchmark performance improved substantially in a single year: scores rose by 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively. The gap between leading US and Chinese models also narrowed—from 17.5 to 0.3 percentage points on MMLU—over the same period.
Safety implication: As AI systems gain autonomous reasoning capabilities, they also develop behaviors relevant to safety evaluation, including goal persistence, strategic planning, and the capacity for deceptive alignment. OpenAI's o3-mini reportedly became the first AI model to receive a "Medium risk" classification for Model Autonomy under internal capability evaluation frameworks.
Foundational Safety Research and Alignment Techniques
GPT-2: Staged Release as Safety Practice
The GPT-2 release in 2019 marked the first occasion where an AI lab deliberately staged a major model release over several months specifically to study potential misuse. OpenAI released progressively larger versions of GPT-2 (117M, 345M, 762M, and finally 1.5B parameters) across February–November 2019, monitoring for evidence of misuse between each stage. The six-month review found no strong evidence of misuse attributable to the staged approach, and OpenAI acknowledged the difficulty of empirically evaluating whether staged release achieved its safety goals. This episode established the practice of conducting deliberate capability-safety reviews before broad model deployment, a practice later formalized in responsible scaling policies.
A central concern motivating the staged release was that models capable of coherent, long-form text generation at scale could be used for disinformation, spam, and other misuse at lower cost than previous methods. OpenAI's subsequent analysis noted that the question of whether to release a model cannot be separated from analysis of what specific harms it enables and at what marginal uplift over existing tools.
Instruction Following and InstructGPT
A critical development between GPT-3 and GPT-4 was the discovery that models fine-tuned to follow instructions using human feedback were substantially preferred by users over much larger models trained only on next-token prediction. OpenAI's Aligning Language Models to Follow Instructions (2022)—the InstructGPT paper—demonstrated that a 1.3B-parameter RLHF-tuned model was preferred over the 175B GPT-3 in head-to-head evaluations by human labelers across a range of tasks. This result was significant for safety because it showed that alignment training and capability were not in fundamental tension: smaller, better-aligned models could outperform larger unaligned ones.
InstructGPT also documented a key limitation of RLHF: the fine-tuned models showed measurable performance regression on standard NLP benchmarks (the "alignment tax"), which OpenAI partially mitigated by mixing pretraining data into the RLHF training. The paper identified sycophancy—models outputting what users want to hear rather than accurate information—as a persistent failure mode stemming from human labeler preferences.
Learning to Summarize and RLHF Foundations
Two key papers established the empirical foundations for applying RLHF to language tasks at scale:
- Fine-Tuning GPT-2 from Human Preferences (Ziegler et al., 2019): Demonstrated that human feedback could steer model behavior on stylistic tasks including summarization and continuation, establishing the core RLHF pipeline for language models.
- Learning to Summarize with Human Feedback (Stiennon et al., 2020): Scaled RLHF to the task of summarizing Reddit posts, showing that reward models trained on human preferences could substantially improve summary quality beyond what supervised fine-tuning alone achieved, and that the gains persisted when evaluating on held-out humans.
The summarization work also introduced the practice of training reward models separately from the policy model and using them as a proxy for human judgment, a design choice that has shaped subsequent RLHF applications. An early application to long-form content was Summarizing Books with Human Feedback (Wu et al., 2021), which used recursive decomposition to enable human labelers to evaluate summaries of book-length texts—a technique relevant to scalable oversight of tasks too long for direct human evaluation.
TruthfulQA and Measuring Falsehood
A persistent failure mode of large language models is that they generate false answers that nonetheless reflect common human misconceptions—a pattern distinct from simple hallucination. TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2022) documented this systematically: the benchmark comprises 817 questions spanning 38 categories where the correct answer is counterintuitive or contradicts popular belief.
Key findings from TruthfulQA:
- Larger models are not more truthful on this benchmark; in some categories, larger models perform worse because they more fluently reproduce common misconceptions
- The best models at time of evaluation answered truthfully on roughly 58% of questions, versus a human baseline of 94%
- Model-generated answers that were false were often confidently stated, reducing the signal value of uncertainty expressions
This work established an important distinction between calibration (knowing what you don't know) and truthfulness (correctly answering questions), and demonstrated that scaling alone does not reliably improve truthfulness. TruthfulQA has been used as a standard evaluation in subsequent model releases, including the GPT-4 technical report.
WebGPT: Factuality Through Web Browsing
WebGPT: Browser-Assisted Question-Answering with Human Feedback (Nakano et al., 2021) addressed factuality from a different angle: instead of improving parametric knowledge, the system learned to browse the web and cite sources when answering long-form questions. WebGPT was fine-tuned using human feedback to reward answers that were both well-supported by retrieved sources and preferred by evaluators.
Key contributions:
- Demonstrated that retrieval-augmented generation with citation could substantially reduce unsupported factual claims
- Found that human evaluators preferred WebGPT answers to GPT-3 answers on 56% of ELI5 questions, and that the model's answers were more factual than those produced without web access
- Identified a persistent failure mode: the model sometimes selected misleading or unrepresentative sources to support a desired conclusion, foreshadowing later concerns about "faithful but misleading" retrieval-augmented generation
WebGPT represents an early instance of tool-augmented LLM deployment and contributed to the design of subsequent browsing-enabled deployments including ChatGPT's web browsing mode.
OpenAI Codex and Code Generation
Evaluating Large Language Models Trained on Code (Chen et al., 2021)—the Codex paper—introduced the HumanEval benchmark and reported that Codex solved 28.8% of programming problems in a zero-shot setting, rising to 70.2% with repeated sampling. Codex was fine-tuned from GPT-3 on publicly available code from GitHub and became the foundation of GitHub Copilot.
From a safety perspective, Codex introduced a new dual-use concern: models capable of generating functional code from natural language instructions could assist with both beneficial software development and potentially harmful applications including vulnerability discovery and exploit development. The Codex paper acknowledged these concerns and noted that the responsible disclosure of such capabilities required consideration of both the benefits to legitimate users and the marginal uplift provided to malicious actors.
The HumanEval benchmark—which tests whether generated code passes unit tests on 164 programming problems—has become a standard evaluation metric, though subsequent work has noted that benchmark performance can overstate practical coding ability on real-world repositories with complex dependencies.
Deliberative Alignment
Deliberative Alignment: Reasoning Enables Safer Language Models (2024) introduced a safety training approach in which models are explicitly trained to reason through safety considerations before generating responses to potentially sensitive queries. Rather than relying solely on learned behavioral patterns to refuse or comply with requests, deliberative alignment trains the model to identify relevant safety principles, reason about their application to the specific context, and generate responses consistent with that reasoning.
Key claims from the paper:
- Models trained with deliberative alignment show reduced rates of both over-refusal (refusing benign requests) and under-refusal (complying with genuinely harmful ones)
- The reasoning process is observable and partially auditable, providing some transparency into why a given refusal or compliance decision was made
- The approach transfers more reliably to novel or edge-case inputs than pure behavioral fine-tuning, because the model reasons about principles rather than matching patterns
Deliberative alignment is described as a component of GPT-4o's safety training and has been cited as a mechanism for improving the precision of safety behaviors—distinguishing between superficially similar benign and harmful requests.
RLHF and Alignment Training Foundations
A key technique underlying modern aligned LLMs is Reinforcement Learning from Human Feedback (RLHF), which trains a reward model on human preference comparisons and uses it to fine-tune Large Language Models outputs via RL. The foundational application of this approach to Large Language Models was demonstrated in Fine-Tuning GPT-2 from Human Preferences (Ziegler et al., 2019), which showed that human feedback could steer model behavior on stylistic tasks. This was later scaled to instruction-following via InstructGPT and then to Claude's Constitutional AI framework.
| RLHF Component | Function | Safety Relevance |
|---|---|---|
| Reward model | Converts human preferences into a differentiable signal | Central to RLHF alignment |
| PPO fine-tuning | Updates language model to maximize reward | Can introduce reward hacking |
| Constitutional AI | Replaces human labelers with model self-critique against principles | Scales alignment oversight; see Constitutional AI |
| Multilingual consistency | Enforces safety behaviors across languages | Align Once, Benefit Multilingually (2025) |
A significant limitation is that RLHF-trained models can exhibit sycophancy—systematically agreeing with user beliefs rather than providing accurate responses—because human raters often prefer confident, agreeable answers. Recent work on multi-objective alignment for psychotherapy contexts and policy-constrained alignment frameworks like PolicyPad (2025) explore how to balance multiple competing alignment objectives.
Multilingual alignment gap: A consistent finding across alignment research is that safety training applied primarily in English does not reliably transfer to other languages. Align Once, Benefit Multilingually (2025) proposes methods to enforce multilingual consistency in safety alignment by training on cross-lingual consistency losses. Related work on Helpful to a Fault (2025) measures illicit assistance rates across 40+ languages in multi-turn interactions, finding that multilingual agents provide more potentially harmful assistance than English-only evaluations would suggest, particularly in low-resource languages where safety fine-tuning data is sparse.
Evaluation Methodology and Benchmarks
Factuality Benchmarking
A growing area of evaluation research focuses on measuring LLM factuality rigorously and across diverse task types. Two related but distinct benchmarks from Google DeepMind address different aspects of this problem:
- FACTS Grounding: Introduced to measure whether long-form generated text is supported by provided source documents. It distinguishes between claims grounded in retrieved context and claims introduced without basis—directly relevant to retrieval-augmented generation (RAG) systems where the model has access to retrieved context but may still generate unsupported claims.
- FACTS Benchmark Suite: A broader systematic evaluation that decomposes factuality failures by error type across diverse task categories, enabling more fine-grained analysis of where and how models generate false information.
The distinction between these two benchmarks matters for practitioners: FACTS Grounding tests faithfulness to provided context, while the FACTS Benchmark Suite tests factual accuracy more broadly, including against parametric knowledge.
AI Psychometrics
AI Psychometrics: Evaluating the Psychological Reasoning of Large Language Models with Psychometric Validities applies psychometric methodology—originally developed for measuring human cognitive and psychological attributes—to evaluate LLMs. The paper distinguishes between tests of maximal performance (what can the model do when trying its best?) and tests of typical performance (what does the model actually do in naturalistic settings?), arguing that most LLM benchmarks measure only the former.
Key findings relevant to safety:
- Models show substantial divergence between their maximum capability and their typical behavior on the same tasks, suggesting benchmark scores may overstate practical reliability
- Psychometric validity concepts—including construct validity, convergent validity, and discriminant validity—provide a principled framework for evaluating whether LLM benchmarks actually measure the constructs they claim to measure
- Standard LLM benchmarks often lack discriminant validity: scores on ostensibly different benchmarks correlate highly, suggesting they may be measuring a common underlying factor rather than distinct capabilities
Jailbreak Scaling Analysis
Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models examines how jailbreak attack success rates change as both attacker and defender scale. Key findings:
- Attack success rates do not consistently decrease as model size increases; some attack categories become more effective against larger models
- Automated jailbreak generation scales approximately as efficiently as model capability, suggesting the safety-capability gap may not close automatically with scale
- The paper identifies attack categories that are particularly resistant to safety fine-tuning, including multi-turn attacks that gradually shift model behavior across a conversation
This work is relevant to understanding whether responsible scaling policies based on capability thresholds can adequately capture safety improvements, or whether adversarial robustness requires additional targeted evaluation.
SemBench and Universal Semantic Evaluation
SemBench: A Universal Semantic Framework for LLM Evaluation proposes a framework for evaluating whether LLMs understand meaning in a semantically consistent way, rather than relying on surface form matches. The benchmark tests whether models can recognize paraphrases, semantic equivalences, and entailments across diverse phrasings—capturing a form of understanding that standard task-specific benchmarks may miss.
CLASP: Defense Against Hidden State Poisoning
CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks addresses a specific attack surface in LLM deployment: adversarial manipulation of model hidden states through carefully crafted inputs. CLASP proposes detection and defense mechanisms for scenarios where an attacker can influence the model's internal representations, which is particularly relevant for hybrid deployment architectures where models process inputs from multiple sources with varying trust levels.
Gemini 2.0 and the Agentic Era
Google DeepMind's Gemini 2.0 launch in December 2024 marked an explicit strategic pivot from general-purpose language models toward Agentic AI systems capable of multi-step autonomous task execution. The announcement positioned Gemini 2.0 as "our new AI model for the agentic era," signaling that agentic capabilities had become a primary design goal rather than an emergent side effect.
Gemini 2.0 Flash
Gemini 2.0 Flash was the first production-ready model in the 2.0 family, made available to all users in February 2025. Key technical features:
| Capability | Description | Agentic Relevance |
|---|---|---|
| Native image generation | Generates images directly without a separate model | Enables single-model multimodal task completion |
| Native audio output | Generates audio responses directly | Supports voice-based agentic interactions |
| Computer use | Can interact with web browsers and desktop applications | Direct environment manipulation |
| Real-time API | Low-latency streaming for interactive applications | Enables responsive agentic loops |
| Tool use | Structured function calling across multiple APIs | Multi-system orchestration |
The native multimodal generation capabilities in Gemini 2.0 Flash—where the model generates images and audio as part of a unified response rather than calling separate specialized models—represented a design shift with implications for agentic deployment. Prior systems required orchestration layers to coordinate between specialized text, image, and audio models; native generation simplified the architecture for agentic pipelines.
From a safety perspective, Gemini 2.0's agentic positioning raised concerns that required explicit treatment in Google DeepMind's safety documentation:
- Computer use capabilities enable the model to take actions in the real world (browsing, form submission, application interaction) with limited reversibility
- Multi-step task execution increases the potential for errors to compound before human review
- Real-time audio interaction reduces the latency available for content filtering
Google's Gemini 2.0 is now available to everyone post described safety testing specific to agentic use cases, including evaluations of whether the model would take irreversible actions without appropriate confirmation and whether it would resist prompt injection attacks from adversarial content in browsed web pages.
Significance for the Gemini 2.5 Family
Gemini 2.0's agentic infrastructure—particularly the real-time API, native multimodal generation, and computer use capabilities—provided the foundation on which Gemini 2.5 Pro and Flash were built. The 2.5 series added chain-of-thought reasoning (the "thinking models" capability) and extended context to 1 million tokens, but the agentic deployment architecture established with 2.0 remained substantially intact. This lineage is relevant for understanding why Gemini 2.5 models are frequently benchmarked on agentic tasks (WebArena, SWE-bench) rather than purely on question-answering or generation metrics.
Major 2025 Model Releases
GPT-5 and the GPT-5 Family
GPT-5, released in May 2025, represents OpenAI's integration of reasoning and general capability into a single unified model, replacing the separate GPT-4o and o1 product lines. According to the GPT-5 System Card, GPT-5 achieves the highest scores to date on GPQA Diamond, SWE-bench Verified, and MMLU among OpenAI models at the time of release.
Key developments in the GPT-5 family:
- GPT-5.1: Released for developers in mid-2025, optimized for conversational applications and described as "smarter, more conversational" with improved instruction-following. Cursor and Tolan reported substantial gains in their agentic pipelines.
- GPT-5.2: A subsequent release in late 2025 adding deeper thinking modes, reportedly achieving 93.2% on GPQA Diamond and 90.5% on ARC.
- GPT-5.3-Codex-Spark: Positioned as a coding-specialized variant within the GPT-5.x family, targeting agentic software development workflows.
- gpt-oss-120b and gpt-oss-20b: Open-weight models released by OpenAI, with model cards published for both sizes. These represent a shift toward open-weight strategy alongside closed frontier models.
- GPT Realtime API: Extended with real-time audio dialog capabilities, building on the GPT-4o voice capabilities introduced in Hello GPT-4o (2024).
The GPT-5 System Card documents safety evaluations including assessments of CBRN uplift risk, cyberoffense capabilities, and persuasion. An Addendum on Sensitive Conversations addresses handling of mental health, self-harm, and politically contentious topics, noting both improvements in refusal precision and remaining cases where the model provides responses that require strengthening.
gpt-oss and the Open-Weight Strategy
The gpt-oss model family—comprising gpt-oss-20b and gpt-oss-120b—represents OpenAI's entry into the open-weight model space. This was a notable strategic shift for an organization that had previously released only API-accessible models. The gpt-oss model card documents intended use cases, known limitations, and safety evaluations for both parameter sizes.
A companion release was gpt-oss-safeguard, a smaller model designed specifically for content safety classification, with a technical report providing detailed evaluation methodology. gpt-oss-safeguard is intended to be deployable locally alongside gpt-oss models, enabling operators to run safety classification without API dependency—a design choice that addresses privacy and latency constraints in enterprise deployment but also raises questions about whether safety models deployed without oversight updates will remain effective as threat landscapes evolve.
Key safety implications of OpenAI's open-weight release:
- Open weights enable mechanistic interpretability research that would otherwise require API-level access
- Fine-tuning safety guardrails out of open-weight models remains tractable for technically sophisticated users, a concern acknowledged in the gpt-oss model card
- Governance frameworks targeting API access do not apply to locally deployed open-weight models
Economic Impacts Research
OpenAI's economic impacts research program, described in Economic Impacts Research at OpenAI, examines how LLM deployment affects labor markets, productivity, and economic distribution. Key themes from this research:
- LLMs disproportionately affect tasks involving information processing, writing, and analysis—including tasks previously considered high-skill
- Productivity effects are heterogeneous: users who effectively leverage LLM assistance show large gains; those who do not may face competitive disadvantage
- The research acknowledges uncertainty about whether productivity gains translate to wage increases or are captured primarily by firms and shareholders
- Studies of LLM adoption in specific domains (customer service, software development, medical documentation) show consistent task completion speed improvements but variable effects on output quality
This research is relevant to the economic disruption risk dimension because it provides empirical grounding for displacement and productivity claims that are often made without supporting data.
Gemini 2.5 Family
Google DeepMind's Gemini 2.5 family, released across March–June 2025, introduced several models:
| Model | Key Feature | Context Window | Primary Use Case |
|---|---|---|---|
| Gemini 2.5 Pro | Highest capability, coding-focused | 1M tokens | Complex reasoning, coding |
| Gemini 2.5 Flash | Speed-optimized frontier | 1M tokens | Scaled production use |
| Gemini 2.5 Flash-Lite | Cost/latency optimized | 1M tokens | High-volume inference |
Gemini 2.5: Our Most Intelligent AI Model describes the Pro variant as achieving leading performance on coding benchmarks including LiveCodeBench and outperforming GPT-4o on MMLU at release. Gemini 2.5 Flash-Lite was made production-ready in June 2025 with a focus on throughput-sensitive applications.
The 2.5 family also introduced native thinking models—models that produce explicit chain-of-thought reasoning before answering—across both Pro and Flash tiers. Advanced audio dialog and generation capabilities were extended across the Gemini 2.5 family.
Gemma 3 and Open-Weight Models
Google's Gemma 3 family, released in 2025, provides open-weight models ranging from 1B to 27B parameters optimized for single-accelerator deployment. The Gemma 3 270M variant targets edge and mobile deployment. Gemma 3n introduced a mobile-first architecture with selective parameter activation for on-device inference.
MedGemma, released in mid-2025, provides open health-specific models demonstrating LLM application in clinical reasoning. T5Gemma introduced encoder-decoder variants of the Gemma architecture, enabling use cases where separate encoding and decoding is beneficial (e.g., retrieval-augmented generation, classification tasks).
Llama 4 and the Open-Weight Ecosystem
Meta's Llama 4 Herd, announced at LlamaCon in April 2025, represents a shift to natively multimodal architecture using a mixture-of-experts design. Llama 4 Scout and Llama 4 Maverick support image, video, and text inputs from the base model level. Meta reported over 10x growth in Llama usage since 2023, with the model family becoming a reference implementation for open-weight AI development.
Key safety implications of the open-weight ecosystem:
- Fine-tuning safety guardrails out of open-weight models remains tractable for technically sophisticated users
- Mechanistic interpretability research benefits from open weights (e.g., Gemma Scope 2, described below)
- Governance frameworks targeting API access do not apply to locally deployed open-weight models
Differential Privacy and VaultGemma
VaultGemma: The World's Most Capable Differentially Private LLM addresses a constraint on LLM deployment in privacy-sensitive domains: the need for formal privacy guarantees during fine-tuning on user data. Standard LLM fine-tuning risks memorizing and later revealing sensitive training data; differentially private fine-tuning adds formal guarantees that individual training examples cannot be inferred from model outputs.
VaultGemma demonstrates that differentially private fine-tuning can produce capable models—competitive with non-private baselines on several benchmarks—at moderate privacy budget costs. This work is relevant to healthcare, legal, and financial deployments where data protection requirements preclude standard fine-tuning approaches.
Mechanistic Interpretability Advances
Mechanistic interpretability research—which seeks to understand the internal computations of neural networks in human-interpretable terms—has accelerated substantially with the availability of open-weight models and new tooling.
Gemma Scope 2
Gemma Scope 2 is a suite of sparse autoencoders (SAEs) trained on Gemma 3 models, released by Google DeepMind to support interpretability research on complex language model behaviors. Building on the original Gemma Scope release for Gemma 2, Gemma Scope 2 provides SAEs at multiple layers and widths, enabling decomposition of model activations into human-interpretable features.
Gemma Scope 2 supports research into:
- Feature geometry and polysemanticity in larger models
- Cross-layer feature interactions and information flow
- Identification of features relevant to safety-relevant behaviors (deception, refusal, sycophancy)
Language Models Explaining Neurons
Language Models Can Explain Neurons in Language Models (Bills et al., 2023) demonstrated that GPT-4 can generate natural language explanations of GPT-2 neurons with higher validity than human-written explanations. This opened a scalable pathway for automated interpretability: using more capable models to explain less capable ones. Subsequent work has extended this to sparse autoencoder features and cross-model explanation transfer.
Activation Steering and Causal Intervention
Activation steering—injecting vectors into model residual streams to steer behavior—has become a primary tool for behavioral intervention research. Recent work has refined understanding of when and why steering succeeds:
- Surgical Activation Steering via Generative Causal Mediation (2025) demonstrates that steering effectiveness depends on correctly identifying the causal pathway through which a concept influences output, rather than simply finding directions in activation space. Steering at incorrect layers or attention heads produces unreliable or null effects.
- Mechanistic Indicators of Steering Effectiveness in Large Language Models (2025) identifies model-internal signals that predict whether a given steering vector will successfully alter behavior, enabling more principled selection of intervention targets.
- Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models (2025) finds that personality-related features form geometric structures in activation space that can interfere with unrelated steering directions, suggesting that alignment-relevant features are not always cleanly separable.
- Does GPT-2 Represent Controversy? A Small Mech Interp Investigation (2024) provides a case study of applying mechanistic interpretability methods to identify controversy-related representations in GPT-2, illustrating the workflow for small-scale interpretability research.
- Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability (2025) proposes using interpretability-discovered features as reward signals for RLHF, bypassing the need for human labeling of open-ended tasks.
- Causality is Key for Interpretability Claims to Generalise (2025) argues that interpretability claims must be evaluated using causal interventions rather than correlation alone, since correlational analyses can identify spurious structure that does not causally influence model behavior.
Research Platform Value of Open-Weight Models
| Research Application | Open-Weight Benefit | Example |
|---|---|---|
| Mechanistic interpretability | Full activation access | Gemma Scope 2 on Gemma 3 |
| SAE training | Weight access for feature analysis | Gemma Scope, TranscoderLens |
| Activation steering | Residual stream intervention | Multiple labs using Llama |
| Fine-tuning safety | Rapid iteration | Constitutional AI variants |
| Neuron explanation | Cross-model explanation transfer | Bills et al. 2023 |
Hallucination and Factuality
LLM hallucination—the generation of plausible-sounding but factually incorrect or unsupported content—remains a central reliability and safety challenge. Hallucination rates vary substantially by task, model, and measurement methodology, with published estimates ranging from roughly 8% to 45% depending on benchmark design and model version.
Factuality Benchmarking
| Benchmark | Scope | Key Finding | Link |
|---|---|---|---|
| TruthfulQA | Questions where correct answers contradict popular misconceptions | Best models ≈58% truthful vs. 94% human baseline | TruthfulQA |
| FACTS Grounding | Long-form factuality against source documents | Measures supported vs. unsupported claims | FACTS Grounding |
| FACTS Benchmark Suite | Systematic factuality across task types | Decomposes factuality failures by error type | FACTS Suite |
| Vectara Hallucination Leaderboard | Summarization hallucination | Best models: ≈8% hallucination rate | Vectara |
| HalluLens | Factual query hallucination | Up to 45% on factual queries for GPT-4o | HalluLens (ACL 2025) |
| CheckIfExist | Citation hallucination in AI-generated content | Detects fabricated citations in RAG systems | CheckIfExist (2025) |
The FACTS Grounding benchmark introduced by Google DeepMind specifically addresses the challenge of evaluating long-form generation against reference documents, distinguishing between claims grounded in provided source material and claims introduced without basis. This is particularly relevant for retrieval-augmented generation (RAG) systems, where the model has access to retrieved context but may still generate unsupported claims.
CheckIfExist (2025) addresses citation hallucination specifically—the generation of plausible-looking but non-existent citations, which poses particular risks in legal, medical, and academic contexts. The benchmark finds that citation hallucination rates remain substantial even for frontier models, and that RAG systems can still hallucinate citations from retrieved documents by misattributing or confabulating specific reference details.
Why Models Hallucinate
OpenAI's explainer on why language models hallucinate identifies several contributing mechanisms:
- Training objective mismatch: Next-token prediction rewards coherent text, not factual accuracy
- Knowledge compression: Models must compress world knowledge into fixed-weight representations, leading to lossy encoding
- Context-weight tension: Models may blend retrieved context with parametric knowledge, producing hybrid outputs that are faithful to neither
- Sycophancy pressure: RLHF can train models to confirm user beliefs rather than correct factual errors, since raters may prefer agreeable responses
- Calibration failure: Models often express high confidence in incorrect claims, reducing the signal value of expressed uncertainty
Hallucination rates are not monotonically improved by scale: models engineered for massive context windows do not consistently achieve lower hallucination rates than smaller counterparts, and increased model size can increase the confidence of hallucinated outputs without reducing their frequency. This suggests hallucination is partly an architectural and training-objective issue rather than purely a capacity limitation.
The TruthfulQA benchmark specifically documented that larger models can perform worse on questions where the correct answer contradicts common human misconceptions, because larger models more fluently reproduce those misconceptions. This counterintuitive finding challenges the assumption that capability improvements automatically translate to truthfulness improvements.
Compression and Consistency
Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information analyzes why LLMs, when trained on corpora containing both true and false statements, tend to preferentially reproduce consistent patterns rather than factually accurate ones. The key finding is that information compression during training rewards statements that co-occur consistently with many other statements—a property that correlates with truth for facts with strong evidential support, but diverges from truth for contested claims, common misconceptions, or facts that appear rarely in training data.
This theoretical framing helps explain why TruthfulQA finds that larger models more fluently reproduce misconceptions: greater model capacity allows more faithful compression of co-occurrence statistics, including the statistics of false-but-widespread beliefs.
Deception and Truthfulness
| Behavior Type | Frequency | Context | Mitigation |
|---|---|---|---|
| Hallucination | 8–45% | Varies by task and model | Training improvements, RAG |
| Citation hallucination | reportedly ≈17% | Legal, academic domain | CheckIfExist detection systems |
| Role-play deception | High | Prompted scenarios | Safety fine-tuning |
| Sycophancy | Moderate | Opinion questions | Constitutional AI, RLHF adjustment |
| Strategic deception | Low-Moderate | Evaluation scenarios | Ongoing research |
According to some sources, even recent frontier models have hallucination rates exceeding 15% when asked to analyze provided statements, and approximately 1 in 6 AI responses in legal contexts reportedly contain citation hallucinations. The wide variance across benchmarks reflects both genuine model differences and definitional variation in what constitutes a hallucination.
Safety-Relevant Positive Capabilities
Interpretability Research Platform
| Research Area | Progress Level | Key Findings | Organizations |
|---|---|---|---|
| Attention visualization | Advanced | Knowledge storage patterns | Anthropic↗🔗 web★★★★☆AnthropicAnthropicfoundation-modelstransformersscalingescalation+1Source ↗, OpenAI↗🔗 web★★★★☆OpenAIOpenAIfoundation-modelstransformersscalingtalent+1Source ↗ |
| Activation patching | Moderate | Causal intervention methods | Redwood Research |
| Sparse autoencoders | Advancing | Feature decomposition in large models | Anthropic↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...alignmentinterpretabilitysafetysoftware-engineering+1Source ↗, Google DeepMind (Gemma Scope) |
| Neuron explanation | Moderate | LM-explained neurons via GPT-4 | Bills et al. 2023 |
| Mechanistic understanding | Early-Moderate | Transformer circuits | Anthropic Interpretability↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...alignmentinterpretabilitysafetysoftware-engineering+1Source ↗ |
Constitutional AI and Value Learning
Anthropic's Constitutional AI↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...alignmentinterpretabilitysafetysoftware-engineering+1Source ↗ demonstrates approaches to value alignment through self-critique, principle-following, and preference learning. The specific success rates for these techniques vary considerably across evaluations, tasks, and model versions; the figures below represent approximate reported ranges rather than settled benchmarks, and should be treated with caution.
| Technique | Reported Range | Application | Limitations |
|---|---|---|---|
| Self-critique | ≈70–85% (approximate) | Harmful content reduction | Requires good initial training |
| Principle following | ≈60–80% (approximate) | Consistent value application | Vulnerable to gaming |
| Preference learning | ≈65–75% (approximate) | Human value approximation | Distributional robustness |
Scalable Oversight Applications
Modern LLMs enable several approaches to AI safety through automated oversight:
- Output evaluation: AI systems critiquing other AI outputs. Agreement rates with human evaluators vary substantially by task and domain; no single well-characterized aggregate figure covers the literature.
- Red-teaming: Automated discovery of failure modes and adversarial inputs.
- Safety monitoring: Real-time analysis of AI system behavior patterns.
- Research acceleration: AI-assisted safety research and experimental design.
- Content moderation: OpenAI has published work on using GPT-4 for content moderation, describing how LLM-based moderation can operate at scale and reduce human labeler exposure to harmful content.
- Summarization oversight: Work on summarizing books with human feedback established recursive decomposition as a technique for extending human oversight to tasks too long for direct evaluation, contributing to the scalable oversight research agenda.
Concerning Capabilities Assessment
Persuasion and Manipulation
Modern LLMs demonstrate persuasion capabilities that raise questions for democratic discourse and individual autonomy:
| Capability | Current State | Evidence | Risk Level |
|---|---|---|---|
| Audience adaptation | Advanced | Anthropic persuasion research | High |
| Persona consistency | Advanced | Extended roleplay studies | High |
| Emotional manipulation | Moderate | RLHF alignment research | Moderate |
| Debate performance | Advanced | Human preference studies | High |
According to some sources, frontier LLMs can substantially increase human agreement rates through targeted persuasion techniques, raising concerns about consensus manufacturing; however, the specific figure of 82% attributed to Anthropic research has not been independently verified against a primary source and should be treated with caution. The ARC Evaluations (2025) specifically evaluates persuasion and resistance capabilities in LLMs through adversarial resource extraction games, finding that frontier models can maintain persuasive pressure across extended multi-turn interactions.
Research on political alignment inference (LLMs Can Infer Political Alignment from Online Conversations) finds that frontier models can reliably identify political alignment from text with accuracy exceeding human raters in controlled settings. This capability has dual-use implications: it enables personalized political persuasion but also could be used for detection of coordinated influence campaigns.
Multilingual Safety Gaps
A consistent finding across the research literature is that safety alignment applied primarily in English does not reliably transfer to other languages. Helpful to a Fault (2025) measured illicit assistance rates for multi-turn, multilingual LLM agents across 40+ languages, finding that:
- Models provide meaningfully more potentially harmful assistance in non-English languages
- The gap is largest for low-resource languages with sparse alignment training data
- Multi-turn interactions elicit more harmful assistance than single-turn, with each turn potentially eroding earlier refusals
This suggests that multilingual deployment of LLMs creates safety gaps that single-language evaluation would miss, a concern reportedly noted in the GPT-5 System Card.
Cybersecurity Capabilities and Refusal Frameworks
LLMs' code generation and vulnerability analysis capabilities create dual-use risks in cybersecurity:
- Models can reportedly assist with vulnerability identification, exploit development, and attack planning at varying levels depending on the specificity of the request
- *A Content-Based Framework for Cybersecurity
References
Anthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their work aims to understand and mitigate potential risks associated with increasingly capable AI systems.
A nonprofit organization focused on responsible AI development by convening technology companies, civil society, and academic institutions. PAI develops guidelines and frameworks for ethical AI deployment across various domains.