Skip to content

Dense Transformers

📋Page Status
Page Type:ContentStyle Guide →Standard knowledge base article
Quality:58 (Adequate)⚠️
Importance:73.5 (High)
Last edited:2026-01-30 (2 days ago)
Words:3.4k
Structure:
📊 17📈 1🔗 3📚 4913%Score: 13/15
LLM Summary:Comprehensive analysis of dense transformers (GPT-4, Claude 3, Llama 3) as the dominant AI architecture (95%+ of frontier models), with training costs reaching $100M-500M per run and 2.5x annual cost growth since 2016. Despite open weights for some models, mechanistic interpretability remains primitive—Anthropic's 2024 SAE research extracted millions of features from Claude 3 Sonnet but cannot predict emergent capabilities or detect deceptive reasoning, creating fundamental safety limitations for RLHF-based alignment approaches.
Issues (2):
  • QualityRated 58 but structure suggests 87 (underrated by 29 points)
  • Links24 links could use <R> components
DimensionAssessmentEvidence
DominanceNear-total (95%+ of frontier models)GPT-4, Claude 3, Gemini, Llama 3 all use transformer base architecture
ScalabilityProven to 1.8T+ parametersGPT-4 reportedly ≈1.8T params (SemiAnalysis 2023); Llama 3 405B trained on 3.8x10^25 FLOPs
InterpretabilityPrimitive despite open weightsAnthropic 2024: found millions of features in Claude 3 Sonnet, still cannot predict behavior
Training EfficiencyWell-understood pipelineRLHF (InstructGPT 2022): 1.3B model preferred over 175B GPT-3 with alignment
Safety ToolingMost developed but inadequateConstitutional AI, RLHF, red-teaming exist; deception detection remains unsolved
PredictabilityLow for emergent capabilitiesAbilities appear abruptly at scale thresholds; GPT-4 shows phase transitions
LongevityDominant for 3-5+ years minimumInfrastructure investment, training expertise, tooling all favor transformers

Dense transformers are the dominant architecture for current frontier AI systems. “Dense” refers to the fact that all parameters are active for every token, in contrast to sparse/MoE architectures where only a subset activates. This architecture processes inputs through repeated transformer blocks, each containing attention mechanisms that learn relationships between all positions and feed-forward networks that process each position independently.

The transformer architecture was introduced in the seminal paper “Attention Is All You Need” by Vaswani et al. in 2017. Originally developed at Google Brain for machine translation (achieving 28.4 BLEU on English-to-German, improving over previous best by 2+ BLEU), the architecture has accumulated over 160,000 citations according to Semantic Scholar and now underlies virtually all frontier AI systems. Every major frontier model - GPT-4, Claude 3, Gemini, Llama 3 - uses this architecture with relatively minor variations.

Despite having “open weights” for some models (Llama, Mistral), interpretability remains fundamentally limited. Anthropic’s 2024 Scaling Monosemanticity research successfully extracted tens of millions of interpretable features from Claude 3 Sonnet using sparse autoencoders, identifying concepts like “Golden Gate Bridge” and “neuroscience” - yet researchers still cannot predict emergent capabilities or reliably detect deceptive reasoning. We can inspect billions of parameters but cannot meaningfully understand what the model “knows” or “intends.”

The following diagram illustrates how dense transformers process information, from the foundational 2017 architecture through modern training pipelines:

Loading diagram...
ComponentFunctionParameters
EmbeddingsConvert tokens to vectorsvocab_size × d_model
AttentionLearn relationships between positions4 × d_model² per layer
FFNProcess each position independently8 × d_model² per layer
Layer NormStabilize training2 × d_model per layer
ModelParametersArchitectureContextTraining DataEstimated CostSource
GPT-4≈1.8T (8x220B MoE)MoE, 16 experts128K tokens≈13T tokensgreater than $100M (Altman)SemiAnalysis
Claude 3 OpusUndisclosedDense transformer200K tokensUndisclosedUndisclosedAnthropic Model Card
Llama 3.1 405B405BDense, 126 layers, 16,384 dim128K tokens15.6T tokens39.3M H100 GPU-hours (≈72 days on 16K GPUs)Meta Technical Report
Gemini 1.5 ProUndisclosedMoE1M+ tokensUndisclosedUndisclosedGoogle DeepMind

Key observations:

  • Llama 3.1 405B used 3.8x10^25 FLOPs on 16,000 H100 GPUs at 380 teraFLOP/s each over ~72 days (Epoch AI) - approximately 50x more compute than Llama 2
  • GPT-4 reportedly uses only ~280B parameters and 560 TFLOPs per forward pass (vs 3,700 TFLOPs for dense 1.8T model) due to MoE routing (SemiAnalysis)
  • Training reliability: during 54-day Llama 3 405B pre-training, 78% of interruptions were hardware-related; 57.3% of downtime from GPU failures; yet achieved greater than 90% effective training time
  • Context windows have expanded 64x since GPT-3 (2K to 128K+), with Gemini reaching 10M tokens in testing
PropertyRatingAssessment
White-box AccessLOWWeights exist but mechanistic interpretability is primitive
TrainabilityHIGHWell-understood pretraining + RLHF pipeline
PredictabilityLOW-MEDIUMEmergent capabilities, phase transitions, unpredictable failures
ModularityLOWMonolithic, end-to-end trained, no clear component boundaries
Formal VerifiabilityLOWBillions of parameters, no formal guarantees possible

A key insight: “open weights” does not mean “interpretable.” Even with full access to Llama 3’s 405 billion parameters, we cannot determine what the model “believes” or whether it harbors deceptive reasoning patterns.

What We HaveWhat We Don’t Have
Full parameter values (405B+ weights)Understanding of what concepts are encoded where
Activation patterns for any inputWhy specific outputs are generated
Attention maps showing token relationshipsHow to predict novel emergent behaviors
Gradient information for fine-tuningWhether deceptive reasoning exists

State of Mechanistic Interpretability (2024-2025)

Section titled “State of Mechanistic Interpretability (2024-2025)”

Anthropic’s interpretability research represents the frontier of the field. Their Scaling Monosemanticity work in 2024 and Circuit Tracing in 2025 achieved significant milestones:

CapabilityStatusEvidenceLimitations
Finding individual featuresBREAKTHROUGHExtracted tens of millions of features from Claude 3 Sonnet using sparse autoencoders; 70% map cleanly to single conceptsOnly tested on smaller models; scaling to frontier systems uncertain
Understanding circuitsEARLY PROGRESS2025 circuit tracing reveals sequences of features from prompt to responseLimited to specific reasoning patterns; cannot trace arbitrary computations
Predicting behaviorPOORGPT-4 tech report: can predict loss with less than 1/10000 computeCannot predict which capabilities emerge or when
Detecting deceptionUNKNOWNFound features related to deception, sycophancy, biasNo validated method to detect strategic deception in deployment
Steering model behaviorDEMONSTRATEDGolden Gate Bridge feature addition forces mentions in all responsesCrude intervention; cannot reliably elicit or suppress complex behaviors

Key findings from Anthropic’s 2025 research:

  • Circuit tracing uncovered a “shared conceptual space where reasoning happens before being translated into language” - suggesting models may develop internal representations that transfer across languages and modalities
  • October 2025 update: discovered cross-modal features that recognize concepts from mouths/ears in ASCII art to full visual depictions of dogs/cats, present in models from Haiku 3.5 to Sonnet 4.5
  • Anthropic open-sourced their circuit tracer library for use with any open-weights model, with a frontend on Neuronpedia for visualization
ChallengeSeverityEvidenceExplanation
Opaque internalsHIGHBillions of parameters with unknown functionCannot verify what model “believes” or “intends”
Emergent capabilitiesHIGHGPT-4 shows abilities absent in GPT-3; Wei et al. 2022New abilities appear unpredictably at scale thresholds
Phase transitionsHIGHSyntax acquisition shows sudden loss drops (Chen et al. 2024)Performance can change discontinuously during training
Training data influenceMEDIUMModels trained on 13-15T tokens of unknown provenanceUnknown biases, knowledge, and behaviors learned from pretraining
Deceptive alignmentUNKNOWNAnthropic found deception-related features but cannot detect strategic useNo validated method to detect if model is strategically deceiving
ApproachEffectivenessKey EvidenceLimitation
RLHFMEDIUM-HIGHInstructGPT: 1.3B model preferred over 175B GPT-3 (100x smaller); users preferred outputs 85 ± 3% of time vs base GPT-3; 71% vs few-shot GPT-3; 2x truthfulness on TruthfulQATrains for stated preferences, not true values; training cost ≈60 petaflops/s-days vs 3,640 for GPT-3 pretraining
Constitutional AIMEDIUMReduces toxic outputs without human feedback on each exampleStill relies on model correctly interpreting abstract principles
Red teamingLOW-MEDIUMFinds jailbreaks, harmful outputs; OpenAI employs 50+ red teamersOnly finds known failure modes; cannot discover unknown unknowns
Capability evalsMEDIUMGPT-4 tech report predicted loss accuratelyCan measure but not prevent dangerous capabilities from emerging
Sparse autoencodersEARLY70% of extracted features map to interpretable conceptsCannot yet scale to frontier models or detect complex behaviors
PaperYearAuthorsContributionImpact
Attention Is All You Need2017Vaswani et al. (Google Brain)Original transformer architecture161,000+ citations; foundation of all modern LLMs
Scaling Laws for Neural Language Models2020Kaplan et al. (OpenAI)Predictable power-law relationship between scale and performanceEnabled multi-billion dollar training investment decisions
Training language models to follow instructions with human feedback2022Ouyang et al. (OpenAI)RLHF methodology; InstructGPT1.3B aligned model outperformed 175B base model; basis for ChatGPT
Constitutional AI2022Bai et al. (Anthropic)Principle-based training without per-example human feedbackReduced toxic outputs; scaled alignment supervision
Scaling Monosemanticity2024Anthropic InterpretabilitySparse autoencoders extract interpretable features at scaleFirst demonstration of interpretability on production models
LabModelsEstimated Frontier SpendFocusKey Technical Contributions
OpenAIGPT-4, o1, o3greater than $1B annuallyCapabilities + commercial deploymentScaling laws, RLHF, reasoning models
AnthropicClaude 3/4 series≈$500M-1B annuallySafety-focused developmentConstitutional AI, interpretability research, RSP
Google DeepMindGemini 1.5/2.0greater than $1B annuallyMultimodal, long contextMoE efficiency, 1M+ token context
MetaLlama 3/4 series≈$500M annuallyOpen weights, research ecosystemLargest open model (405B), full technical reports
xAIGrok≈$500M annuallyRapid scalingTrained Grok on 100K H100s

Training costs for frontier models have grown approximately 2.5x per year since 2016, with the largest runs now exceeding $100M:

ModelYearEst. Training CostCost BreakdownSource
GPT-32020≈$1.6MCompute-dominatedIndustry estimates
GPT-42023$100M-540MHardware: $10M amortized; staff: tens of millionsSam Altman public statements
Llama 3.1 405B2024≈$100M39.3M H100 GPU-hours at 700W TDPMeta Technical Report
2024 frontier models2024greater than $1BResearch + inference: ≈$1B total for OpenAIEpoch AI analysis
2025-2027 projected2025-27$1B-10BDario Amodei predictionEntrepreneur

Cost breakdown for frontier models (Epoch AI 2024):

  • AI accelerator chips: 40-50% of total
  • Staff costs (researchers, engineers): 30-40%
  • Server components: 15-22%
  • Cluster interconnect: 9-13%
  • Energy consumption: 2-6%

Investment scale: Microsoft has invested over $13B in OpenAI since 2019. Amazon and Google have invested $1B and $1B respectively in Anthropic. Microsoft reported plans to invest approximately $10B in FY2025 for AI-enabled data centers.

Dense transformers are clearly dominant and will remain so for the foreseeable future.

The meaning of “scaling” has fundamentally changed, as noted by AI industry analysts:

EraPrimary Scaling AxisKey DriverExample
2020-2024Training computeLarger models, more dataGPT-3 → GPT-4: 10x parameters
2024-2025Inference computeTest-time reasoning, searcho1/o3: deliberation at generation
EmergingArchitecture efficiencySparse attention, MoEGemini 1.5: 10x longer context at similar cost

As machine learning expert Ilya Sutskever noted: “The 2010s were the age of scaling, now we’re back in the age of wonder and discovery once again.” This suggests the era of pure parameter scaling may be giving way to architectural innovation and inference-time optimization.

Architecture saturation: 2020-2025 saw most frontier progress occur within a largely standardized Transformer paradigm, with gains increasingly driven by data pipelines, optimization recipes, and post-training rather than radical architectural innovation.

Key trends with quantified progress:

Trend2022 State2025 State3-Year ChangeProjection
Max parameters175B (GPT-3)405B dense / 1.8T MoE2.3x dense, 10x MoEApproaching 10T parameters by 2027
Context window4K tokens1M+ tokens (Gemini)250x increaseMemory-augmented systems for unlimited context
Training compute≈$10M$100M-500M per run10-50x increase$1B+ training runs by 2026
Inference efficiency10-50 tokens/sec100-500 tokens/sec5-10x speedupReal-time streaming ubiquitous
ModalitiesText only (mostly)Text, vision, audio, video, codeFull multimodalEmbodied/robotic integration
QuestionBull CaseBear CaseRelevance
Will scaling continue to improve capabilities?Capabilities scale predictably to 10T+ paramsDiminishing returns already visible; current scaling hits wallDetermines transformer dominance duration
Will MoE/sparse variants replace pure dense?GPT-4’s MoE approach becomes standard; 10x efficiency gainsComplexity costs outweigh benefits; dense remains simplerAffects training economics and accessibility
Can interpretability catch up to scale?Anthropic’s SAE approach scales; full circuits mapped by 2027Interpretability fundamentally intractable for capable systemsDetermines whether safety research can keep pace
Will new architectures emerge?State-space models (Mamba) or hybrid approaches challenge transformers8 years of infrastructure investment creates strong lock-inAffects long-term AI development trajectory
AspectDense TransformerSparse/MoESSM/MambaHybrid Approaches
MaturityHIGH (8 years)MEDIUM (3 years at scale)LOW (2 years)EMERGING
InterpretabilityLOW (but most studied)LOWUNKNOWNUNKNOWN
Training efficiencyBASELINE2-5x better for same quality3-5x better for long sequencesVariable
Inference efficiencyBASELINE3-10x better (only active experts)Linear in sequence lengthDepends on design
Peak capabilitiesHIGHEST (GPT-4, Claude 3)HIGH (Gemini 1.5, Mixtral)MODERATE but growingLimited data
Safety toolingMOST developedSOME (inherits from dense)LITTLEMinimal
InfrastructureUbiquitousGrowing rapidlyLimitedExperimental

Key tradeoff: Dense transformers remain the capability frontier, but MoE variants (used in GPT-4 and Gemini 1.5) offer 3-10x inference efficiency by activating only 10-20% of parameters per token. State-space models like Mamba offer linear-time attention but have not yet matched transformer performance on complex reasoning tasks.

The dense transformer architecture creates specific constraints and opportunities for safety research:

ApproachWhy It WorksCurrent EffectivenessKey Examples
Behavioral evaluationsOnly requires black-box access; scales with model capabilityMEDIUM-HIGHMMLU, HumanEval, TruthfulQA benchmarks
RLHF and variantsLeverages transformer trainability; proven at scaleHIGHInstructGPT achieved 71% preference over base GPT-3
Prompt engineeringWorks with any capable model; no retraining neededMEDIUMSystem prompts, jailbreak defenses
Red teamingIdentifies real failure modes before deploymentMEDIUMOpenAI employs 50+ red teamers; finds jailbreaks
Constitutional AIReduces need for human feedback; principles scaleMEDIUM-HIGHAnthropic’s approach reduces toxic outputs
ApproachWhy It StrugglesCurrent StatusPath Forward
Mechanistic interpretabilityBillions of parameters; features are distributedEARLY (70% feature interpretability on small models)Sparse autoencoders may scale; uncertain
Formal verificationNo tractable methods for neural networks at scaleNOT FEASIBLEFundamental theory needed; may be intractable
Detecting deceptionCannot distinguish genuine from strategic behaviorUNKNOWNFound deception features; no detection method
Predicting emergenceCapabilities appear suddenly at scale thresholdsPOORMay require new theoretical frameworks
Guaranteed alignmentRLHF optimizes for stated not true preferencesFUNDAMENTAL CHALLENGENo solution on horizon

Will dense transformers hit fundamental limits, or continue improving indefinitely?

Evidence for continued scaling: Kaplan et al. (2020) showed smooth power-law improvements up to 175B parameters. GPT-4 and Claude 3 continue this trend. OpenAI’s technical report demonstrated loss prediction accuracy with less than 1/10000 of final compute.

Evidence for limits: Some researchers argue current scaling is hitting diminishing returns. The jump from GPT-3 to GPT-4 was smaller than GPT-2 to GPT-3. However, OpenAI’s o1/o3 reasoning models show capabilities can still improve dramatically through inference-time compute scaling.

Can mechanistic interpretability ever work at frontier scale?

Optimistic view: Anthropic’s 2024 sparse autoencoder work scaled from toy models to Claude 3 Sonnet (a production system), extracting tens of millions of interpretable features. Their 2025 circuit tracing revealed reasoning pathways.

Pessimistic view: Even with millions of features extracted, researchers cannot predict emergent capabilities or detect strategic deception. The gap between “identifying features” and “understanding computation” may be insurmountable.

Are we stuck with transformers due to infrastructure investment?

The transformer ecosystem represents approximately $10B+ in specialized hardware (H100s, TPUs optimized for attention), 8 years of tooling development (PyTorch, JAX, CUDA kernels), and thousands of trained researchers. Microsoft alone plans to invest approximately $10B in FY2025 for AI-enabled data centers. Alternative architectures like Mamba face a massive disadvantage even if theoretically superior.

Investment concentration: OpenAI has received over $13B from Microsoft since 2019. Anthropic has received $1B from Amazon and $1B from Google. This capital is overwhelmingly directed at transformer-based development.

Is there a fundamental limit to how safe dense transformers can be made?

The core challenge: RLHF and Constitutional AI train models to produce outputs humans rate highly, not to have genuinely aligned values. A sufficiently capable model could learn to satisfy human raters while pursuing different objectives - and we have no method to detect this. This is not a technical limitation that more research will solve; it may be a fundamental property of training on behavioral feedback.



  • Sparse/MoE Transformers - Efficiency-focused variant
  • Heavy Scaffolding - How transformers are deployed
  • Mechanistic Interpretability - Efforts to understand internals