Skip to content

Sparse / MoE Transformers

📋Page Status
Page Type:ContentStyle Guide →Standard knowledge base article
Quality:45 (Adequate)⚠️
Importance:62 (Useful)
Last edited:2026-01-28 (4 days ago)
Words:2.2k
Structure:
📊 15📈 1🔗 3📚 1813%Score: 13/15
LLM Summary:MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety research is underdeveloped - no expert-level interpretability tools exist despite rapid adoption (Mixtral, DeepSeek-V3, rumored GPT-4).
Issues (1):
  • QualityRated 45 but structure suggests 87 (underrated by 42 points)

Sparse and Mixture-of-Experts (MoE) architectures are transformer variants where only a subset of parameters activates for each token. Instead of every parameter contributing to every forward pass, a routing mechanism selects which “expert” sub-networks to use.

This offers dramatic efficiency gains - you can have a model with 8x the parameters but similar compute cost per token. Mixtral 8x7B (46B total, ~12B active) performs comparably to Llama 2 70B.

MoE is likely already used in GPT-4 (based on leaks) and may become the default architecture for frontier models due to efficiency advantages.

DimensionAssessmentEvidence
Adoption RateRapidly increasingMixtral, DeepSeek-V3, DBRX, Qwen3-MoE all released 2024-2025; GPT-4 rumored MoE since 2023
Efficiency Gains2-7x compute savingsSwitch Transformer: 7x pre-training speedup; DBRX: 2x faster inference than dense equivalent
Parameter ScalingTrillion+ parameters feasibleDeepSeek-V3: 671B total/37B active; GLaM: 1.2T total with 1/3 GPT-3 training energy
Quality ParityMatches dense modelsMixtral 8x7B (46B total, 12.9B active) matches Llama 2 70B across benchmarks
Safety ResearchUnderdevelopedExpert-level interpretability tools lacking; routing behavior understudied
Open-Weight AvailabilityHighMixtral, DBRX, DeepSeek, Qwen all Apache 2.0 licensed
Hardware SupportImprovingMegaBlocks library enables dropless MoE; inference optimization ongoing
Loading diagram...
ComponentFunctionTrainable
RouterDecides which experts to useYes
ExpertsSpecialized FFN sub-networksYes
Load balancerEnsures experts are used evenlyAuxiliary loss
CombinerMerges expert outputsWeighted by router
ModelTotal ParamsActive ParamsEfficiency RatioDeveloper
Mixtral 8x7B46.7B12.9B3.6xMistral AI
Mixtral 8x22B141B39B3.6xMistral AI
DeepSeek-V3671B37B18xDeepSeek
DBRX132B36B3.7xDatabricks
GLaM1.2T96.6B12xGoogle
Switch-XXL1.6T100B16xGoogle
Qwen3-MoE235B22B10.7xAlibaba
GPT-4 (rumored)1.76T220B8xOpenAI
PropertyRatingAssessment
White-box AccessLOWSame opacity as dense transformers, plus routing complexity
TrainabilityHIGHStandard training with load balancing losses
PredictabilityLOWRouting adds another layer of unpredictability
ModularityMEDIUMExpert boundaries exist but interact
Formal VerifiabilityLOWCombinatorial explosion of expert combinations
AdvantageExplanation
Expert analysisCan study what individual experts learn
Efficiency = more testingSame capability, lower cost → more safety evaluation budget
Modular structureCould potentially ablate/modify specific experts
Specialization visibleRouting patterns reveal some structure
RiskSeverityExplanation
Routing unpredictabilityMEDIUMWhich experts activate can be surprising
Combinatorial complexityHIGHCannot test all expert combinations
Emergent routingMEDIUMRouting patterns may encode unexpected behaviors
Specialized deceptionUNKNOWNCould specific experts learn deceptive behavior?
AspectDenseMoE
Overall opacityHIGHHIGH
Modular structureNONESOME (experts)
Analysis toolsSOMEFEWER
Activation patternsComplexComplex + routing
ModelDeveloperReleaseTotal ParamsActive ParamsExpertsTop-kContextTraining Data
Mixtral 8x7BMistral AIDec 202346.7B12.9B8232KUnknown
Mixtral 8x22BMistral AIApr 2024141B39B8264KUnknown
DeepSeek-V2DeepSeekMay 2024236B21B1606128KUnknown
DeepSeek-V3DeepSeekDec 2024671B37B2568128K14.8T tokens
DBRXDatabricksMar 2024132B36B16432K12T tokens
Qwen3-MoEAlibabaApr 2025235B22B128832K36T tokens
Qwen3-NextAlibabaSep 202580B3B512Variable128KUnknown
GLaMGoogleDec 20211.2T96.6B642Unknown1.6T tokens
Switch-CGoogleJan 20211.6T100B20481UnknownC4 dataset
GPT-4OpenAIMar 2023≈1.76T (rumored)≈220B≈162128KUnknown

Performance benchmarks (Mixtral 8x7B vs. dense models):

  • MMLU: 70.6% (vs. Llama 2 70B: 68.9%)
  • HellaSwag: 86.7% (vs. Llama 2 70B: 87.3%)
  • GSM8K: 74.4% (vs. Llama 2 70B: 56.8%)
  • HumanEval: 40.2% (vs. Llama 2 70B: 29.9%)
MechanismDescriptionTrade-offsUsed By
Top-k gatingSelect k highest-scoring experts per tokenSimple, may cause load imbalance; requires auxiliary lossMixtral (k=2), DBRX (k=4)
Expert choiceEach expert selects its top-k tokensGuaranteed load balance; variable experts per tokenGoogle research models
Soft routingWeighted combination of all expertsFully differentiable; less sparse; higher computeEarly MoE research
Auxiliary-loss-freeBias terms adjusted based on load monitoringNo auxiliary loss needed; adaptive balancingDeepSeek-V3

MoE training requires mechanisms to prevent expert collapse (where the model uses only 1-2 experts):

Auxiliary loss approach (most common):

  • Total Loss = Task Loss + alpha x Load Balance Loss
  • Typical alpha values: 0.01-0.1
  • Balance loss penalizes uneven expert utilization

Without load balancing: Studies show models collapse to using 10-20% of experts, negating efficiency benefits.

DeepSeek innovation: Auxiliary-loss-free balancing adds learnable bias terms to routing scores, adjusted at each training step based on observed load. This avoids the training instability auxiliary losses can cause.

Each expert has a fixed capacity (maximum tokens it can process per batch). When both selected experts are full:

  • Token dropping: Overflow tokens skip MoE layer, passed via residual connection (GShard approach)
  • Dropless MoE: Dynamic allocation prevents dropping; used by DBRX via MegaBlocks library

Capacity factor: Typical values of 1.0-2.0x mean experts can handle 100-200% of perfectly balanced load.

PaperYearKey ContributionCitation
Outrageously Large Neural Networks2017Introduced sparsely-gated MoE with 137B parameters; 1000x capacity increase with minor efficiency lossShazeer et al. (Google), ICLR 2017
GShard2020Scaled MoE to 600B parameters; auto-sharding for distributed trainingLepikhin et al. (Google)
Switch Transformers2021Simplified to single-expert routing; achieved 7x pre-training speedup; scaled to 1.6T parametersFedus, Zoph, Shazeer (Google), JMLR 2022
GLaM20211.2T parameter model using 1/3 GPT-3 training energy; demonstrated MoE efficiency at scaleDu et al. (Google), ICML 2022
Expert Choice Routing2022Experts select tokens instead of vice versa; 2x faster convergence; intrinsic load balancingZhou et al. (Google), NeurIPS 2022
Mixtral of Experts2024Open-weight MoE matching Llama 2 70B with 3.6x fewer active parametersJiang et al. (Mistral AI)
DeepSeek-V3 Technical Report2024671B/37B model with auxiliary-loss-free load balancing; 256 fine-grained expertsDeepSeek
LabModelsFocusNotable Achievement
GoogleSwitch, GLaM, GShardResearch pioneeringFirst trillion-parameter MoE; routing innovations
Mistral AIMixtral 8x7B, 8x22BOpen-weight qualityMatched Llama 70B with 12.9B active params
DeepSeekV2, V3Extreme sparsity671B total with only 37B active (5.5%)
DatabricksDBRXEnterprise deploymentFine-grained MoE (16 experts, choose 4)
AlibabaQwen3-MoE, Qwen3-NextUltra-sparse80B total with 3B active (3.7%)
OpenAIGPT-4 (rumored)Production frontierReportedly ≈1.76T params, 8x220B MoE
FactorEvidenceImpact
Training efficiencyGLaM achieved GPT-3 quality with 1/3 training energy; DBRX training is 2x more FLOP-efficient than denseMajor cost reduction
Inference efficiencyDBRX: 2-3x higher throughput than 132B dense model; Mixtral serves at Llama 2 13B costEnables larger deployments
Quality parityMixtral 8x7B matches/exceeds Llama 2 70B on MMLU (+1.7%), GSM8K (+17.6%), HumanEval (+10.3%)No capability penalty
Scaling ceilingDense models face diminishing returns; MoE enables continued scaling to 1T+ parametersExtends scaling laws
Open ecosystemMixtral, DBRX, DeepSeek, Qwen all Apache 2.0; enables research and fine-tuningAccelerates adoption
PeriodDevelopmentSignificance
2017Shazeer introduces modern MoEFoundational architecture
2020-2021Google scales to 1T+ parametersProves viability at scale
2023GPT-4 launches (rumored MoE)Frontier deployment
2024Mixtral, DBRX, DeepSeek-V2Open-weight MoE ecosystem emerges
2025DeepSeek-V3, Qwen3-NextUltra-sparse MoE (3-5% activation)
2026+Expected: MoE becomes defaultDense models may become niche
QuestionCurrent EvidenceResearch Priority
Do MoE models have different emergent behaviors?Limited study; routing adds complexityHIGH
Is routing a new source of alignment risk?Unknown; router learns token-expert mappingsHIGH
Can we interpret expert specialization?Some evidence experts specialize by domain/languageMEDIUM
Will ultra-sparse (less than 5% activation) remain stable?DeepSeek-V3, Qwen3-Next suggest yesMEDIUM
Does MoE training require different safety approaches?Not yet studied systematicallyHIGH
  • Behavioral evaluations - Testing still works
  • RLHF - Training approach unchanged
  • Red teaming - Can still probe for failures
  • Capability evals - Measuring what it can do
  • Expert-level interpretability - What does each expert learn?
  • Routing analysis - When/why does routing change?
  • Combinatorial testing - How to cover expert combinations?
  • Expert ablation - Can we remove problematic experts?

MoE structure enables some new safety approaches:

ApproachDescriptionFeasibility
Expert specialization analysisStudy what each expert learnsMEDIUM
Selective expert ablationRemove experts that encode bad behaviorUNKNOWN
Routing interventionControl which experts activatePOSSIBLE
Expert-level alignmentTrain specific experts for safetySPECULATIVE
  • Dense Transformers - The simpler baseline
  • Heavy Scaffolding - How MoE models are deployed
  • SSM/Mamba - Alternative efficient architecture