Skip to content

State-Space Models / Mamba

📋Page Status
Page Type:ContentStyle Guide →Standard knowledge base article
Quality:54 (Adequate)⚠️
Importance:54.5 (Useful)
Last edited:2026-01-28 (4 days ago)
Words:3.5k
Structure:
📊 21📈 1🔗 3📚 6514%Score: 13/15
LLM Summary:Comprehensive analysis of state-space models (SSMs) like Mamba as transformer alternatives, documenting that Mamba-3B matches Transformer-6B perplexity with 5x throughput but lags on in-context learning (MMLU: 46.3% vs 51.2% at 8B scale). Hybrid architectures combining 43% SSM + 7% attention outperform pure transformers (+1.3 points) while maintaining efficiency gains, with estimated 45% probability of hybrids becoming dominant vs 35% for pure transformers.
Issues (2):
  • QualityRated 54 but structure suggests 87 (underrated by 33 points)
  • Links3 links could use <R> components

State-Space Models (SSMs), particularly the Mamba architecture developed by Albert Gu (CMU) and Tri Dao (Princeton), represent a fundamentally different approach to sequence modeling than transformers. Instead of the pairwise attention mechanism (quadratic O(n^2) complexity), SSMs use structured state-space dynamics derived from continuous-time systems theory, achieving linear O(n) complexity in sequence length.

The efficiency gains are substantial: Mamba achieves 5x higher inference throughput than comparably-sized transformers and the Mamba-3B model matches Transformer-6B perplexity while being 40% cheaper to run. On the Long Range Arena benchmark, the foundational S4 model achieved 80.48% average accuracy—the first architecture to solve the Path-X task requiring reasoning over 16,384 tokens—compared to less than 60% for all transformer baselines.

However, pure SSMs exhibit consistent weaknesses on tasks requiring strong in-context learning or copying from context. NVIDIA research (2024) found that while Mamba and Mamba-2 match transformers on many benchmarks at 8B scale, they lag on five-shot MMLU and phonebook lookup tasks. This has driven increasing adoption of hybrid architectures: AI21’s Jamba 1.5 Large scored 65.4 on Arena Hard, outperforming Llama-3.1-70B and 405B, using a 43% Mamba-2, 7% attention, 50% MLP architecture.

Estimated probability of pure SSMs being dominant at transformative AI: 5-15%. Probability of SSM-transformer hybrids playing significant role: 25-40%.

The fundamental difference between transformers and SSMs lies in how they handle sequence dependencies. Transformers compute pairwise relationships between all tokens (quadratic), while SSMs compress history into a fixed-size state that evolves with each new token (linear).

Loading diagram...

The selection mechanism is Mamba’s key innovation. Unlike prior SSMs where state dynamics (A, B, C matrices) were fixed, Mamba makes them input-dependent. This allows the model to:

  • Remember important tokens by increasing their influence on state (large delta)
  • Forget irrelevant tokens by letting state decay quickly (small delta)
  • Focus on content-relevant patterns rather than just positional patterns
AspectTransformerSSM/Mamba
AttentionFull pairwise attentionNone (implicit in state)
ComplexityO(n^2) in sequence lengthO(n) linear
Memory (inference)O(n) KV cacheO(1) constant state
ParallelismHigh (attention parallelizes)Different (scan operations)
Long contextExpensive (memory/compute)Efficient (linear scaling)
In-context learningStrongWeaker (stateful compression)
Proven scaleYes (GPT-4, Claude level)Emerging (14B max pure SSM)

The SSM family has diversified rapidly since 2021. The following table compares major architectures:

ArchitectureYearDeveloperKey InnovationBest Benchmark ResultMax Scale Trained
S42021Stanford (Gu, Goel, Ré)Structured state space parameterization80.48% LRA (first to solve Path-X)1B parameters
H32022StanfordSSM + short convolutions hybridMatched GPT-Neo on OpenWebText2.7B parameters
Hyena2023Stanford/Together AIImplicit long convolutions + gatingMatched Transformer at 20% less compute1.4B parameters
RWKV2023Community (RWKV Foundation)Linear attention + RNN hybridEagle 7B: 3.36 Lambada perplexity14B parameters
Mamba2023CMU/Princeton (Gu & Dao)Selective SSM (input-dependent dynamics)Mamba-3B matches Transformer-6B2.8B parameters
Griffin2024Google DeepMindGated linear recurrence + local attentionMatches Llama-2 at 6x fewer tokens14B parameters
Mamba-22024CMU/Princeton (Gu & Dao)State space duality (SSD) framework2-8x faster than Mamba-1, same quality8B parameters
Jamba2024AI21 LabsSSM + Attention + MoE hybridJamba 1.5 Large: 65.4 Arena Hard52B (12B active)
StripedHyena2023Together AIOptimized Hyena + attention hybridMatches Llama-2-7B on OpenLLM7B parameters
RecurrentGemma2024Google DeepMindGriffin-based production modelMatches Gemma with lower memory9B parameters

Mamba (Gu & Dao, 2023) introduced key innovations:

InnovationDescriptionBenefit
Selective SSMInput-dependent state dynamicsBetter modeling of dependencies
Hardware-awareOptimized for GPU memory hierarchyFast inference
Gated architectureSimilar to GRU/LSTM gatingTraining stability
h'(t) = Ah(t) + Bx(t) # State evolution
y(t) = Ch(t) + Dx(t) # Output

The key insight is that this continuous system can be discretized and computed efficiently using parallel scans. The matrices have interpretable roles: A (transition) controls how state information persists or decays, B (input) maps new tokens into state, C (output) maps state to predictions, and D provides skip connections. Mamba’s innovation is making these parameters input-dependent (selective), allowing the model to decide what to remember or forget based on content.

The following tables compile benchmark results from peer-reviewed papers comparing SSMs against transformers at similar scales.

ModelParametersTraining TokensPile PerplexityWikiText-103 PPLSource
GPT-3 (Transformer)2.7B300B7.50Brown et al. 2020
Mamba2.8B300B6.22Gu & Dao 2023
Mamba-22.7B300B6.09Dao & Gu 2024
Pythia (Transformer)2.8B300B7.92Biderman et al. 2023
RWKV-63B1.12T5.24Peng et al. 2024
Llama-2 (Transformer)7B2T5.47Touvron et al. 2023
Griffin7B300B5.83De et al. 2024

Lower perplexity is better. Mamba achieves superior perplexity at equivalent scale.

NVIDIA’s empirical study (2024) provides the most comprehensive head-to-head comparison at production scale:

ModelArchitectureMMLU (5-shot)HellaSwagARC-CWinoGrandeAverage
TransformerPure attention51.2%79.1%53.8%74.2%64.6%
MambaPure SSM45.8%78.4%52.1%73.8%62.5%
Mamba-2Pure SSD46.3%78.9%52.6%74.0%62.9%
Mamba-2-Hybrid43% SSM + 7% Attn + 50% MLP52.4%80.2%55.1%75.8%65.9%

Hybrid architecture outperforms pure transformer by +1.3 points average while offering 8x faster inference.

ModelContext LengthPasskey RetrievalSCROLLSQuALITYSource
GPT-3.5-Turbo16K100%78.2%61.3%OpenAI
Mamba16K99.8%76.4%58.9%Gu & Dao 2023
Jamba 1.5256K100%82.1%68.4%AI21 2024
Griffin32K99.5%77.8%62.1%De et al. 2024
RWKV-728K100%74.2%55.8%RWKV Foundation

SSMs excel at long context due to constant memory usage. RWKV-7 performance degrades rapidly beyond 28K.

ModelParamsThroughput (tokens/sec)Memory @ 8K ctxMemory @ 64K ctxLatency (ms/token)
Transformer-7B7B1,20016 GB128 GB12.5
Mamba-7B7B6,0008 GB8 GB2.5
Hybrid (Jamba)52B (12B active)4,80010 GB14 GB3.1

Mamba achieves 5x throughput and constant memory regardless of context length.

PropertyRatingAssessment
White-box AccessMEDIUMDifferent internals than transformers, less studied
TrainabilityHIGHStill gradient-based training
PredictabilityMEDIUMRecurrence adds some complexity
ModularityLOWSimilar to transformers
Formal VerifiabilityUNKNOWNRecurrent structure might help or hurt

The shift from attention to state-space dynamics has significant implications for AI safety research. SSMs present both opportunities and challenges that differ fundamentally from transformer-based systems.

AdvantageMechanismQuantified Benefit
Efficiency enables more testing5x throughput means 5x more red-teaming for same cost5x evaluation coverage at constant budget
Constant memory enables longer evalsNo KV cache growthCan test 100K+ token scenarios cheaply
Different failure modesNo attention-based adversarial attacksMay resist prompt injection techniques
Deterministic state evolutionRecurrent structure more predictableEasier to trace information flow
Reduced context hijackingState compression limits perfect recallHarder to inject malicious instructions late in context
Risk CategorySeverityEvidenceMitigation Status
Interpretability gapHIGHAttention visualizations don’t apply; state probing tools immatureActive research at Anthropic, Redwood
Unknown emergent behaviorsMEDIUMNo SSM at GPT-4 scale exists; scaling laws less understoodJamba 1.6 (52B hybrid) is largest production model
State opacityMEDIUMHidden state encodes compressed history; less interpretable than attentionMamba Explained notes interpretability challenges
Safety research transferMEDIUMRLHF works, but mechanistic interpretability doesn’t transferNeed new SSM-specific probing methods
Selective mechanism manipulationLOW-MEDIUMSelection weights could be adversarially targetedNot yet demonstrated in practice

The Gradient’s analysis notes that while attention patterns in transformers provide intuitive visualizations of “what the model is looking at,” SSM interpretability is fundamentally different:

“The precise selection mechanism’s interpretability is less than that of attention visualizations, though selection weights can be probed.”

Interpretability MethodTransformersSSMs
Attention visualizationDirect, intuitiveN/A (no attention)
Activation patchingWell-developedRequires adaptation
Circuit analysisMature toolingNascent
Probing classifiersWorksWorks (similar)
State analysisN/AEmerging method
Selection weight analysisN/APossible but less interpretable

Production and Research Models (2024-2025)

Section titled “Production and Research Models (2024-2025)”
ModelDeveloperArchitectureParametersStatusKey Achievement
MambaGu & DaoPure SSM130M - 2.8BResearchFirst SSM competitive with Transformers
Mamba-2Gu & DaoSSDUp to 8BResearch2-8x faster training than Mamba-1
Jamba 1.6AI21 LabsSSM + Attention + MoE52B (12B active)ProductionOutperforms Llama-3.1-405B on RAG tasks
RecurrentGemmaGoogle DeepMindGriffin-based2B, 9BProductionOfficial Google SSM deployment
RWKV-7RWKV FoundationRNN + Linear AttentionUp to 14BOpen SourceStrongest open-source pure SSM
Codestral MambaMistral AIPure Mamba7BProductionFirst commercial pure-Mamba for code
Granite 4.0IBM ResearchMamba-2 hybridVariousProductionEnterprise SSM deployment
StripedHyenaTogether AIHyena + Attention7BResearchMatches Llama-2-7B with 50% less memory

The emergence of hybrid models reflects a growing consensus that pure SSMs and pure transformers each have fundamental limitations. Hybrids aim to capture the efficiency of SSMs with the in-context learning strength of attention.

Hybrid PatternSSM RatioAttention RatioExampleRationale
Interleaved87.5%12.5%Jamba (1 attn per 8 layers)Minimal attention for retrieval tasks
Block-based43%7% + 50% MLPMamba-2-HybridOptimal ratio from scaling laws
Head-mixed50%50%H3Early hybrid exploration
Local + Global75%25% local onlyGriffinLocal attention for nearby context

NVIDIA’s empirical study found the 43% SSM + 7% attention + 50% MLP configuration optimal at 8B scale, outperforming pure transformers by +2.65 points average while projecting 8x faster generation.

PaperAuthorsVenueKey ContributionCitations
S4: Structured State Spaces for Sequence ModelingGu, Goel, RéICLR 2022First efficient SSM parameterization1,500+
Mamba: Linear-Time Sequence Modeling with Selective State SpacesGu, DaoICLR 2024Input-dependent (selective) SSMs2,000+
Transformers are SSMs (Mamba-2)Dao, GuICML 2024State Space Duality unifying SSMs and attention400+
Hyena HierarchyPoli et al.ICML 2023 (Oral)Implicit convolutions as attention alternative600+
RWKV: Reinventing RNNs for the Transformer EraPeng et al.EMNLP 2023Linear attention + RNN formulation500+
Griffin: Mixing Gated Linear RecurrencesDe et al. (Google)ICML 2024Production-ready recurrent architecture200+
An Empirical Study of Mamba-based Language ModelsWaleffe et al. (NVIDIA)2024Definitive 8B-scale comparison100+
Researcher/LabAffiliationContributionCurrent Focus
Albert GuCMU → Cartesia AIS4, Mamba, Mamba-2, SSM theoryCommercial SSM deployment
Tri DaoPrinceton → Together AIFlashAttention, Mamba optimizationHardware-efficient algorithms
Chris RéStanford/Together AIS4, Hyena, SAFARI projectLong-context architectures
Google DeepMindGriffin, RecurrentGemma, HawkProduction recurrent models
AI21 LabsJamba seriesFirst production hybrid SSM
RWKV FoundationCommunityRWKV-4 through RWKV-7Open-source SSM ecosystem
IBM ResearchBamba, Granite SSM collaborationEnterprise SSM deployment
Mistral AICodestral MambaCode-focused SSM models
TaskPerformanceWhy
Long document processingGOODLinear complexity
Audio/signal processingEXCELLENTDesigned for continuous signals
Efficient inferenceEXCELLENTO(n) vs O(n²)
TaskAssessmentReason
In-context learningTransformers betterAttention enables direct comparison
Few-shot reasoningTransformers betterRequires token-to-token reasoning
Frontier capabilitiesTransformersSimply more proven at scale
DriverCurrent Status2025-2027 ProjectionImpact on SSM Adoption
Context length demand100K-200K standard1M+ contexts emergingHIGH: Transformers hit memory walls
Inference cost pressure$1.01-0.10/1K tokensCost competition intensifyingHIGH: SSM 5x cheaper inference
Memory bandwidthH100: 3.35 TB/sScaling slower than computeMEDIUM: Benefits SSM constant-memory
Agentic workloadsEmerging30-50% of enterprise AI by 2027HIGH: Long contexts, repeated inference
Edge deploymentLimitedGrowing rapidlyHIGH: SSM memory efficiency critical

Arguments for SSM/Hybrid Growth (60-70% probability of significant adoption)

Section titled “Arguments for SSM/Hybrid Growth (60-70% probability of significant adoption)”
  1. Efficiency becomes critical — At GPT-5+ scale, O(n^2) attention cost is $10-100M per training run. SSM efficiency offers 40-80% cost reduction.
  2. Long context is table stakes — Applications demand 100K-1M token contexts. Transformer KV cache hits memory limits; SSM scales linearly.
  3. Hybrid architectures validated — NVIDIA’s study and Jamba 1.5 demonstrate hybrids can outperform pure transformers with better efficiency.
  4. Production deployments expanding — Google (RecurrentGemma), AI21 (Jamba 1.6), Mistral (Codestral Mamba), IBM (Granite 4.0) all shipping SSM-based models.

Arguments Against (30-40% probability SSMs remain niche)

Section titled “Arguments Against (30-40% probability SSMs remain niche)”
  1. In-context learning ceiling — Pure SSMs consistently underperform on MMLU, few-shot tasks. May be fundamental limit of stateful compression.
  2. Transformer ecosystem lock-in — PyTorch, TensorFlow, vLLM, TensorRT all optimized for attention. Switching costs are substantial.
  3. Investment momentum — >95% of frontier training compute goes to transformers. Network effects favor incumbents.
  4. Interpretability gap — Safety teams trained on attention analysis. SSM interpretability tools 3-5 years behind.
ScenarioProbabilityKey Indicators
Hybrids dominate (SSM + Attention)45%Jamba/Griffin-style architectures become default
Transformers remain dominant35%Pure attention with improved efficiency (e.g., FlashAttention-4)
Pure SSMs breakthrough10%SSM solves in-context learning limitation
New architecture emerges10%Neither SSM nor transformer (e.g., state-space diffusion)
  • RLHF - Training approach similar
  • Behavioral evals - Testing works the same
  • Red teaming - Adversarial testing still applies
  • Attention-based interpretability - No attention to analyze
  • Transformer-specific probes - Need new tools
  • Circuit analysis - Different computational structure
OpportunityDescription
State analysisUnderstand what hidden states encode
Recurrence interpretabilityNew methods for recurrent systems
Efficiency-enabled safetyMore evaluation for same cost
QuestionCurrent EvidenceResolution TimelineImportance
Can pure SSMs match transformers at frontier scale?No pure SSM >14B trained; hybrids close gap2025-2026 (if labs invest)CRITICAL
Is in-context learning fundamentally limited by state compression?Evidence suggests yes; hybrids mitigateOngoing theoretical researchHIGH
Do SSMs have different safety properties?Unknown; less interpretability research2-3 years of safety research neededHIGH
Will hybrids become standard architecture?Strong evidence: Jamba, Griffin, NVIDIA study2025 (trend clear)MEDIUM
Can SSM interpretability catch up?Tools emerging but 3-5 years behind transformer tooling2026-2028MEDIUM

The core uncertainty is whether the in-context learning limitation of pure SSMs is:

A. Fundamental — State compression inherently loses precise retrieval capability. Transformers’ O(n) KV cache stores exact tokens; SSMs’ O(1) state must compress. If true, hybrids will dominate.

B. Solvable — Better selection mechanisms, larger state dimensions, or architectural innovations could match transformer in-context learning. If true, pure SSMs could dominate due to efficiency.

Current evidence favors interpretation (A): NVIDIA’s empirical study found that even at 8B scale with extensive training, pure Mamba-2 lags on MMLU (46.3% vs 51.2%) and phonebook lookup tasks. The 43% SSM + 7% attention hybrid closes this gap completely, suggesting attention provides irreplaceable retrieval capability.

  • Dense Transformers - The dominant alternative
  • Sparse/MoE - Another efficiency-focused approach
  • Heavy Scaffolding - Deployment pattern