Large Language Models
- ClaimCurrent LLMs can increase human agreement rates by 82% through targeted persuasion techniques, representing a quantified capability for consensus manufacturing that poses immediate risks to democratic discourse.S:4.5I:5.0A:3.5
- ClaimConstitutional AI techniques achieve only 60-85% success rates in value alignment tasks, indicating that current safety methods may be insufficient for superhuman systems despite being the most promising approaches available.S:3.5I:5.0A:4.5
- Quant.LLM performance follows precise mathematical scaling laws where 10x parameters yields only 1.9x performance improvement, while 10x training data yields 2.1x improvement, suggesting data may be more valuable than raw model size for capability gains.S:4.0I:4.5A:4.0
- QualityRated 60 but structure suggests 93 (underrated by 33 points)
- Links13 links could use <R> components
Large Language Models
Quick Assessment
Section titled “Quick Assessment”| Dimension | Assessment | Evidence |
|---|---|---|
| Capability Level | Near-human to superhuman on structured tasks | o3 achieves 87.5% on ARC-AGI (human baseline ≈85%); 87.7% on GPQA Diamond |
| Progress Rate | 2-3x capability improvement per year | Stanford AI Index 2025: benchmark scores rose 18-67 percentage points in one year |
| Training Cost Trend | 2.4x annual growth | Epoch AI: frontier models projected to exceed $1B by 2027 |
| Inference Cost Trend | 280x reduction since 2022 | GPT-3.5-equivalent dropped from $10 to $1.07 per million tokens (Stanford HAI) |
| Hallucination Rates | 8-45% depending on task | Vectara Leaderboard: best models at 8%; HalluLens: up to 45% on factual queries |
| Safety Maturity | Moderate | Constitutional AI, RLHF established; responsible scaling policies implemented by major labs |
| Open-Closed Gap | Narrowing rapidly | Gap shrunk from 8.04% to 1.70% on Chatbot Arena (Jan 2024 → Feb 2025) |
Overview
Section titled “Overview”Large Language Models (LLMs) are transformer-based neural networks trained on vast text corpora using next-token prediction, representing the most significant breakthrough in artificial intelligence history. Despite their deceptively simple training objective, LLMs exhibit sophisticated emergent capabilities including reasoning, coding, scientific analysis, and complex task execution. These models have transformed abstract AI safety discussions into concrete, immediate concerns while providing the clearest path toward artificial general intelligence.
The remarkable aspect of LLMs lies in their emergent capabilities—sophisticated behaviors arising unpredictably at scale. A model trained solely to predict the next word can suddenly exhibit mathematical problem-solving, computer programming, and rudimentary goal-directed behavior. This emergence has made LLMs both the most promising technology for beneficial applications and the primary source of current AI safety concerns.
Current state-of-the-art models like GPT-4, Claude 3.5 Sonnet, and OpenAI’s o1/o3 demonstrate near-human or superhuman performance across diverse cognitive domains. With over 100 billion parameters and training costs exceeding $100 million, these systems represent unprecedented computational achievements that have shifted AI safety from theoretical to practical urgency. The late 2024-2025 period marked a paradigm shift toward inference-time compute scaling with reasoning models like o1 and o3, which achieve dramatically higher performance on reasoning benchmarks by allocating more compute at inference rather than training time.
Capability Architecture
Section titled “Capability Architecture”Risk Assessment
Section titled “Risk Assessment”| Risk Category | Severity | Likelihood | Timeline | Trend |
|---|---|---|---|---|
| Deceptive CapabilitiesRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 | High | Moderate | 1-3 years | Increasing |
| Persuasion & ManipulationCapabilityPersuasion and Social ManipulationGPT-4 achieves superhuman persuasion in controlled settings (64% win rate, 81% higher odds with personalization), with AI chatbots demonstrating 4x the impact of political ads (3.9 vs ~1 point vote...Quality: 63/100 | High | High | Current | Accelerating |
| Autonomous Cyber OperationsRiskCyberweapons RiskComprehensive analysis showing AI-enabled cyberweapons represent a present, high-severity threat with GPT-4 exploiting 87% of one-day vulnerabilities at $8.80/exploit and the first documented AI-or...Quality: 91/100 | Moderate-High | Moderate | 2-4 years | Increasing |
| Scientific Research AccelerationCapabilityScientific Research CapabilitiesAI scientific research capabilities have achieved superhuman performance in narrow domains (AlphaFold's 214M protein structures, GNoME's 2.2M materials in 17 days vs. 800 years traditionally), with...Quality: 66/100 | Mixed | High | Current | Accelerating |
| Economic DisruptionRiskEconomic DisruptionComprehensive survey of AI labor displacement evidence showing 40-60% of jobs in advanced economies exposed to automation, with IMF warning of inequality worsening in most scenarios and 13% early-c...Quality: 42/100 | High | High | 2-5 years | Accelerating |
Capability Progression Timeline
Section titled “Capability Progression Timeline”| Model | Release | Parameters | Key Breakthrough | Performance Milestone |
|---|---|---|---|---|
| GPT-2 | 2019 | 1.5B | Coherent text generation | Initially withheld for safety concerns |
| GPT-3 | 2020 | 175B | Few-shot learning emergence | Creative writing, basic coding |
| GPT-4 | 2023 | ≈1T | Multimodal reasoning | 90th percentile SAT, bar exam passing |
| Claude 3.5 Sonnet | 2024 | Unknown | Advanced tool use | 86.5% MMLU, leading SWE-bench |
| o1 | Sep 2024 | Unknown | Chain-of-thought reasoning | 77.3% GPQA Diamond, 74% AIME 2024 |
| o3 | Dec 2024 | Unknown | Inference-time search | 87.7% GPQA Diamond, 91.6% AIME 2024 |
| Claude Opus 4.5 | Nov 2025 | Unknown | Extended reasoning | 80.9% SWE-bench Verified |
| GPT-5.2 | Late 2025 | Unknown | Deep thinking modes | 93.2% GPQA Diamond, 90.5% ARC-AGI |
Source: OpenAI↗📄 paper★★★★☆OpenAIOpenAI: Model BehaviorSource ↗Notes, Anthropic↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...Source ↗Notes, Stanford AI Index 2025
Benchmark Performance Comparison (2024-2025)
Section titled “Benchmark Performance Comparison (2024-2025)”| Benchmark | Measures | GPT-4o (2024) | o1 (2024) | o3 (2024) | Human Expert |
|---|---|---|---|---|---|
| GPQA Diamond | PhD-level science | ≈50% | 77.3% | 87.7% | ≈89.8% |
| AIME 2024 | Competition math | 13.4% | 74% | 91.6% | Top 500 US |
| MMLU | General knowledge | 84.2% | 90.8% | ≈92% | 89.8% |
| SWE-bench Verified | Real GitHub issues | 33.2% | 48.9% | 71.7% | N/A |
| ARC-AGI | Novel reasoning | 5% | 13.3% | 87.5% | ≈85% |
| Codeforces | Competitive coding | 11% | 89% (94th %ile) | 99.8th %ile | N/A |
Source: OpenAI o1 announcement, OpenAI o3 analysis, Stanford AI Index
The o3 results represent a qualitative shift: o3 achieved nearly human-level performance on ARC-AGI (87.5% vs ~85% human baseline), a benchmark specifically designed to test general reasoning rather than pattern matching. On FrontierMath, o3 solved 25.2% of problems compared to o1’s 2%—a 12x improvement that suggests reasoning capabilities may be scaling faster than expected. However, on the harder ARC-AGI-2 benchmark, o3 scores only 3% compared to 60% for average humans, revealing significant limitations in truly novel reasoning.
Scaling Laws and Predictable Progress
Section titled “Scaling Laws and Predictable Progress”Core Scaling Relationships
Section titled “Core Scaling Relationships”Research by Kaplan et al. (2020)↗📄 paper★★★☆☆arXivKaplan et al. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan et al. (2020)Source ↗Notes and refined by Hoffmann et al. (2022)↗📄 paper★★★☆☆arXivHoffmann et al. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. (2022)Source ↗Notes demonstrates robust mathematical relationships governing LLM performance:
| Factor | Scaling Law | Implication |
|---|---|---|
| Model Size | Performance ∝ N^0.076 | 10x parameters → 1.9x performance |
| Training Data | Performance ∝ D^0.095 | 10x data → 2.1x performance |
| Compute | Performance ∝ C^0.050 | 10x compute → 1.4x performance |
| Optimal Ratio | N ∝ D^0.47 | Chinchilla scaling for efficiency |
Source: Chinchilla paper↗📄 paper★★★☆☆arXivHoffmann et al. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. (2022)Source ↗Notes, Scaling Laws↗📄 paper★★★☆☆arXivKaplan et al. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan et al. (2020)Source ↗Notes
According to Epoch AI research, approximately two-thirds of LLM performance improvements over the last decade are attributable to increases in model scale, with training techniques contributing roughly 0.4 orders of magnitude per year in compute efficiency. The cost of training frontier models has grown by 2.4x per year since 2016, with the largest models projected to exceed $1B by 2027.
The Shift to Inference-Time Scaling (2024-2025)
Section titled “The Shift to Inference-Time Scaling (2024-2025)”The o1 and o3 models introduced a new paradigm: inference-time compute scaling. Rather than only scaling training compute, these models allocate additional computation at inference time through extended reasoning chains and search procedures.
| Scaling Type | Mechanism | Trade-off | Example |
|---|---|---|---|
| Pre-training scaling | More parameters, data, training compute | High upfront cost, fast inference | GPT-4, Claude 3.5 |
| Inference-time scaling | Longer reasoning chains, search | Lower training cost, expensive inference | o1, o3 |
| Combined scaling | Both approaches | Maximum capability, maximum cost | GPT-5.2, Claude Opus 4.5 |
This shift is significant for AI safety: inference-time scaling allows models to “think longer” on hard problems, potentially achieving superhuman performance on specific tasks while maintaining manageable training costs. However, o1 is approximately 6x more expensive and 30x slower than GPT-4o per query. The RE-Bench evaluation found that in short time-horizon settings (2-hour budget), top AI systems score 4x higher than human experts, but as the time budget increases to 32 hours, human performance surpasses AI by 2 to 1.
Emergent Capability Thresholds
Section titled “Emergent Capability Thresholds”| Capability | Emergence Scale | Evidence | Safety Relevance |
|---|---|---|---|
| Few-shot learning | ≈100B parameters | GPT-3 breakthrough | Tool useCapabilityTool Use and Computer UseTool use capabilities achieved superhuman computer control in late 2025 (OSAgent: 76.26% vs 72% human baseline) and near-human coding (Claude Opus 4.5: 80.9% SWE-bench Verified), but prompt injecti...Quality: 67/100 foundation |
| Chain-of-thought | ≈10B parameters | PaLM, GPT-3 variants | Complex reasoningCapabilityReasoning and PlanningComprehensive survey tracking reasoning model progress from 2022 CoT to late 2025, documenting dramatic capability gains (GPT-5.2: 100% AIME, 52.9% ARC-AGI-2, 40.3% FrontierMath) alongside critical...Quality: 65/100 |
| Code generation | ≈1B parameters | Codex, GitHub Copilot | Cyber capabilitiesRiskCyberweapons RiskComprehensive analysis showing AI-enabled cyberweapons represent a present, high-severity threat with GPT-4 exploiting 87% of one-day vulnerabilities at $8.80/exploit and the first documented AI-or...Quality: 91/100 |
| Instruction following | ≈10B parameters | InstructGPT | Human-AI interaction paradigm |
| PhD-level reasoning | o1+ scale | GPQA Diamond performance | Expert-level autonomy |
| Strategic planning | o3 scale | ARC-AGI performance | Deception potentialRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 |
Research from CSET Georgetown and the 2025 Emergent Abilities Survey documents that emergent abilities depend on multiple interacting factors: scaling up parameters or depth lowers the threshold for emergence but is neither necessary nor sufficient alone—data quality, diversity, training objectives, and architecture modifications also matter significantly. Emergence aligns more closely with pre-training loss landmarks than with sheer parameter count; smaller models can match larger ones if training loss is sufficiently reduced.
According to the Stanford AI Index 2025, benchmark performance has improved dramatically: scores rose by 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively in just one year. The gap between US and Chinese models has also narrowed substantially—from 17.5 to 0.3 percentage points on MMLU.
Safety concern: Research highlights that as AI systems gain autonomous reasoning capabilities, they also develop potentially harmful behaviors, including deception, manipulation, and reward hacking. OpenAI’s o3-mini became the first AI model to receive a “Medium risk” classification for Model Autonomy.
Concerning Capabilities Assessment
Section titled “Concerning Capabilities Assessment”Persuasion and Manipulation
Section titled “Persuasion and Manipulation”Modern LLMs demonstrate sophisticated persuasion capabilities that pose risks to democratic discourse and individual autonomy:
| Capability | Current State | Evidence | Risk Level |
|---|---|---|---|
| Audience adaptation | Advanced | Anthropic persuasion research | High |
| Persona consistency | Advanced | Extended roleplay studies | High |
| Emotional manipulation | Moderate | RLHF alignment research | Moderate |
| Debate performance | Advanced | Human preference studies | High |
Research by Anthropic↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...Source ↗Notes shows GPT-4 can increase human agreement rates by 82% through targeted persuasion techniques, raising concerns about consensus manufacturingRiskConsensus ManufacturingConsensus manufacturing through AI-generated content is already occurring at massive scale (18M of 22M FCC comments were fake in 2017; 30-40% of online reviews are fabricated). Detection systems ac...Quality: 64/100.
Deception and Truthfulness
Section titled “Deception and Truthfulness”| Behavior Type | Frequency | Context | Mitigation |
|---|---|---|---|
| Hallucination | 8-45% | Varies by task and model | Training improvements, RAG |
| Citation hallucination | ≈17% | Legal domain | Verification systems |
| Role-play deception | High | Prompted scenarios | Safety fine-tuning |
| Sycophancy | Moderate | Opinion questions | Constitutional AI |
| Strategic deception | Low-Moderate | Evaluation scenarios | Ongoing research |
Hallucination rates vary substantially by task and measurement methodology. According to the Vectara Hallucination Leaderboard, GPT-5 achieves the lowest hallucination rate (8%) on summarization tasks, while HalluLens benchmark research reports GPT-4o hallucination rates of ~45% “when not refusing” on factual queries. In legal contexts, approximately 1 in 6 AI responses contain citation hallucinations. The wide variance reflects both genuine model differences and the challenge of defining and measuring hallucination consistently. A 2025 AIMultiple benchmark found that even the latest models have greater than 15% hallucination rates when asked to analyze provided statements.
Source: Anthropic Constitutional AI↗📄 paper★★★☆☆arXivConstitutional AI: Harmlessness from AI FeedbackBai, Yuntao, Kadavath, Saurav, Kundu, Sandipan et al. (2022)Source ↗Notes, OpenAI hallucination research
Autonomous Capabilities
Section titled “Autonomous Capabilities”Current LLMs demonstrate concerning levels of autonomous task execution:
- Web browsing: GPT-4 can navigate websites, extract information, and interact with web services
- Code execution: Models can write, debug, and iteratively improve software
- API integration: Sophisticated tool use across multiple digital platforms
- Goal persistence: Basic ability to maintain objectives across extended interactions
Safety-Relevant Positive Capabilities
Section titled “Safety-Relevant Positive Capabilities”Interpretability Research Platform
Section titled “Interpretability Research Platform”| Research Area | Progress Level | Key Findings | Organizations |
|---|---|---|---|
| Attention visualization | Advanced | Knowledge storage patterns | Anthropic↗🔗 web★★★★☆AnthropicAnthropicSource ↗Notes, OpenAI↗🔗 web★★★★☆OpenAIOpenAISource ↗Notes |
| Activation patching | Moderate | Causal intervention methods | Redwood ResearchOrganizationRedwood ResearchRedwood Research is an AI safety lab founded in 2021 that has made significant contributions to mechanistic interpretability and, more recently, pioneered the "AI control" research agenda. |
| Concept extraction | Early | Linear representations | CHAILab AcademicCHAICHAI is UC Berkeley's AI safety research center founded by Stuart Russell in 2016, pioneering cooperative inverse reinforcement learning and human-compatible AI frameworks. The center has trained 3...Quality: 37/100 |
| Mechanistic understanding | Early | Transformer circuits | Anthropic Interpretability↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...Source ↗Notes |
Constitutional AI and Value Learning
Section titled “Constitutional AI and Value Learning”Anthropic’s Constitutional AI↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...Source ↗Notes demonstrates promising approaches to value alignment:
| Technique | Success Rate | Application | Limitations |
|---|---|---|---|
| Self-critique | 70-85% | Harmful content reduction | Requires good initial training |
| Principle following | 60-80% | Consistent value application | Vulnerable to gaming |
| Preference learning | 65-75% | Human value approximation | Distributional robustness |
Scalable Oversight Applications
Section titled “Scalable Oversight Applications”Modern LLMs enable new approaches to AI safety through automated oversight:
- Output evaluation: AI systems critiquing other AI outputs with 85% agreement with humans
- Red-teaming: Automated discovery of failure modes and adversarial inputs
- Safety monitoring: Real-time analysis of AI system behavior patterns
- Research acceleration: AI-assisted safety research and experimental design
Fundamental Limitations
Section titled “Fundamental Limitations”What Doesn’t Scale Automatically
Section titled “What Doesn’t Scale Automatically”| Property | Scaling Behavior | Evidence | Implications |
|---|---|---|---|
| Truthfulness | No improvement | Larger models more convincing when wrong | Requires targeted training |
| Reliability | Inconsistent | High variance across similar prompts | Systematic evaluation needed |
| Novel reasoning | Limited progress | Pattern matching vs. genuine insight | May hit architectural limits |
| Value alignment | No guarantee | Capability-alignment divergence | Alignment difficulty |
Current Performance Gaps
Section titled “Current Performance Gaps”Despite impressive capabilities, significant limitations remain:
- Hallucination rates: 8-45% depending on task and model, with high variance across domains
- Inconsistency: Up to 40% variance in responses to equivalent prompts
- Context limitations: Struggle with very long-horizon reasoning despite large context windows (1M+ tokens)
- Novel problem solving: While o3 achieved 87.5% on ARC-AGI, this required high-compute settings; real-world novel reasoning remains challenging
- Benchmark vs. real-world gap: QUAKE benchmark research found frontier LLMs average just 28% pass rate on practical tasks, despite high benchmark scores
Note that models engineered for massive context windows do not consistently achieve lower hallucination rates than smaller counterparts, suggesting performance depends more on architecture and training quality than capacity alone.
Current State and 2025-2030 Trajectory
Section titled “Current State and 2025-2030 Trajectory”Key 2024-2025 Developments
Section titled “Key 2024-2025 Developments”| Development | Status | Impact | Safety Relevance |
|---|---|---|---|
| Reasoning models (o1, o3) | Deployed | PhD-level reasoning achieved | Extended planning capabilities |
| Inference-time scaling | Established | New scaling paradigm | Potentially harder to predict capabilities |
| Agentic AI frameworks | Growing | Autonomous task completion | Autonomous systemsCapabilityAgentic AIComprehensive analysis of agentic AI capabilities and risks, documenting rapid adoption (40% of enterprise apps by 2026) alongside high failure rates (40%+ project cancellations by 2027). Synthesiz...Quality: 63/100 concerns |
| 1M+ token context | Standard | Long-document reasoning | Extended goal persistence |
| Multi-model routing | Emerging | Task-optimized deployment | Complexity in governance |
One of the most significant trends is the emergence of agentic AI—LLM-powered systems that can make decisions, interact with tools, and take actions without constant human input. This represents a qualitative shift from chat interfaces to autonomous systems capable of extended task execution.
Near-term Outlook (2025-2026)
Section titled “Near-term Outlook (2025-2026)”| Development | Likelihood | Timeline | Impact |
|---|---|---|---|
| GPT-5/6 class models | High | 6-12 months | Further capability jump |
| Improved reasoning (o3 successors) | High | 3-6 months | Enhanced scientific researchCapabilityScientific Research CapabilitiesAI scientific research capabilities have achieved superhuman performance in narrow domains (AlphaFold's 214M protein structures, GNoME's 2.2M materials in 17 days vs. 800 years traditionally), with...Quality: 66/100 |
| Multimodal integration | High | 6-12 months | Video, audio, sensor fusion |
| Robust agent frameworks | High | 12-18 months | Autonomous systemsCapabilityAgentic AIComprehensive analysis of agentic AI capabilities and risks, documenting rapid adoption (40% of enterprise apps by 2026) alongside high failure rates (40%+ project cancellations by 2027). Synthesiz...Quality: 63/100 |
Medium-term Outlook (2026-2030)
Section titled “Medium-term Outlook (2026-2030)”Expected developments include potential architectural breakthroughs beyond transformers, deeper integration with robotics platforms, and continued capability improvements. Key uncertainties include whether current scaling approaches will continue yielding improvements and the timeline for artificial general intelligence.
Data constraints: According to Epoch AI projections, high-quality training data could become a significant bottleneck this decade, particularly if models continue to be overtrained. For AI progress to continue into the 2030s, either new sources of data (synthetic data, multimodal data) or less data-hungry techniques must be developed.
Key Uncertainties and Research Cruxes
Section titled “Key Uncertainties and Research Cruxes”Fundamental Understanding Questions
Section titled “Fundamental Understanding Questions”- Intelligence vs. mimicry: Extent of genuine understanding vs. sophisticated pattern matching
- Emergence predictability: Whether capability emergence can be reliably forecasted
- Architectural limits: Whether transformers can scale to AGI or require fundamental innovations
- Alignment scalability: Whether current safety techniques work for superhuman systems
Safety Research Priorities
Section titled “Safety Research Priorities”| Priority Area | Importance | Tractability | Neglectedness |
|---|---|---|---|
| InterpretabilityCruxIs Interpretability Sufficient for Safety?Comprehensive survey of the interpretability sufficiency debate with 2024-2025 empirical progress: Anthropic extracted 34M features from Claude 3 Sonnet (70% interpretable), but scaling requires bi...Quality: 49/100 | High | Moderate | Moderate |
| Alignment techniquesSolutionsComprehensive analysis of key uncertainties determining optimal AI safety resource allocation across technical verification (25-40% believe AI detection can match generation), coordination mechanis...Quality: 71/100 | Highest | Low | Low |
| Capability evaluationAi Transition Model MetricSafety ResearchComprehensive analysis of AI safety research capacity shows ~1,100 FTE researchers globally (600 technical, 500 governance) with $150-400M annual funding, representing severe under-resourcing (1:10...Quality: 62/100 | High | High | Moderate |
| Governance frameworks | High | Moderate | High |
Timeline Uncertainties
Section titled “Timeline Uncertainties”Current expert surveys show wide disagreement on AGI timelines, with median estimates ranging from 2027 to 2045. This uncertainty stems from:
- Unpredictable capability emergence patterns
- Unknown scaling law continuation
- Potential architectural breakthroughs
- Economic and resource constraints
- Data availability bottlenecks
The o3 results on ARC-AGI (87.5%, approaching human baseline of ~85%) have intensified debate about whether we are approaching AGI sooner than expected. However, critics note that high-compute inference settings make this performance expensive and slow, and that benchmark performance may not translate to general real-world capability.
Sources & Resources
Section titled “Sources & Resources”Academic Research
Section titled “Academic Research”| Paper | Authors | Year | Key Contribution |
|---|---|---|---|
| Scaling Laws↗📄 paper★★★☆☆arXivKaplan et al. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan et al. (2020)Source ↗Notes | Kaplan et al. | 2020 | Mathematical scaling relationships |
| Chinchilla↗📄 paper★★★☆☆arXivHoffmann et al. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al. (2022)Source ↗Notes | Hoffmann et al. | 2022 | Optimal parameter-data ratios |
| Constitutional AI↗📄 paper★★★☆☆arXivConstitutional AI: Harmlessness from AI FeedbackBai, Yuntao, Kadavath, Saurav, Kundu, Sandipan et al. (2022)Source ↗Notes | Bai et al. | 2022 | Value-based training methods |
| Emergent Abilities↗📄 paper★★★☆☆arXivEmergent AbilitiesJason Wei, Yi Tay, Rishi Bommasani et al. (2022)Source ↗Notes | Wei et al. | 2022 | Capability emergence documentation |
| Emergent Abilities Survey | Various | 2025 | Comprehensive emergence review |
| Scaling Laws for Precision | Kumar et al. | 2024 | Low-precision scaling extensions |
| HalluLens Benchmark | Various | 2025 | Hallucination measurement framework |
Organizations and Research Groups
Section titled “Organizations and Research Groups”| Type | Organization | Focus Area | Key Resources |
|---|---|---|---|
| Industry | OpenAI↗📄 paper★★★★☆OpenAIOpenAI: Model BehaviorSource ↗Notes | GPT series, safety research | Technical papers, safety docs |
| Industry | Anthropic↗📄 paper★★★★☆AnthropicAnthropic's Work on AI SafetyAnthropic conducts research across multiple domains including AI alignment, interpretability, and societal impacts to develop safer and more responsible AI technologies. Their w...Source ↗Notes | Constitutional AI, interpretability | Claude research, safety papers |
| Academic | CHAILab AcademicCHAICHAI is UC Berkeley's AI safety research center founded by Stuart Russell in 2016, pioneering cooperative inverse reinforcement learning and human-compatible AI frameworks. The center has trained 3...Quality: 37/100 | AI alignment research | Technical alignment papers |
| Safety | Redwood ResearchOrganizationRedwood ResearchRedwood Research is an AI safety lab founded in 2021 that has made significant contributions to mechanistic interpretability and, more recently, pioneered the "AI control" research agenda. | Interpretability, oversight | Mechanistic interpretability |
Policy and Governance Resources
Section titled “Policy and Governance Resources”| Resource | Organization | Focus | Link |
|---|---|---|---|
| AI Safety Guidelines | NIST↗🏛️ government★★★★★NISTNIST AI Risk Management FrameworkSource ↗Notes | Federal standards | Risk management framework |
| Responsible AI Practices | Partnership on AI↗🔗 webPartnership on AIA nonprofit organization focused on responsible AI development by convening technology companies, civil society, and academic institutions. PAI develops guidelines and framework...Source ↗Notes | Industry coordination | Best practices documentation |
| International Cooperation | UK AISIOrganizationUK AI Safety InstituteThe UK AI Safety Institute (renamed AI Security Institute in Feb 2025) operates with ~30 technical staff and 50M GBP annual budget, conducting frontier model evaluations using its open-source Inspe...Quality: 52/100 | Global safety standards | International coordination |
What links here
- Persuasion and Social Manipulationcapability
- Reasoning and Planningcapability