Mamba-3: Improved Sequence Modeling using State Space Principles
Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their …
Authors: Aakash Lahoti, Kevin Y. Li, Berlin Chen
Mamba-3: Impro ved Sequence Modeling using State Space Principles Aakash Lahoti ∗ 1 , Ke vin Y . Li ∗ 1 , Berlin Chen ∗ 2 , Caitlin W ang ∗ 2 , A viv Bick 1 , J. Zico Kolter 1 , Tri Dao † 23 , and Albert Gu † 14 1 Carnegie Mellon University 2 Princeton University 3 T ogether AI 4 Cartesia AI {alahoti, kyl2, abick, zkolter, agu}@cs.cmu.edu {bc2188, caitlinwang, tridao}@princeton.edu Abstract Scaling inference-time compute has emerged as an important driver of LLM performance, making inference eciency a central focus of model design alongside model quality . While the current Transformer-based models deliver str ong model quality , their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. Howe ver , many recent linear models trade o model quality and capability for algorithmic eciency , failing on tasks such as state tracking. Moreover , their theoretically linear infer ence remains hardware-inecient in practice. Guide d by an infer ence-rst perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. W e combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule that enables richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation for b etter model performance without increasing de code latency . T ogether with architectural renements, our Mamba-3 model achieves signicant gains across retrie val, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 impro ves av- erage do wnstream accuracy by 0.6 percentage points compar ed to the next best model (Gated DeltaNet), with Mamba-3’s MIMO variant further impro ving accuracy by another 1.2 points for a total 1.8 point gain. Across state-size e xperiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half of its predecessor’s state size. Our evaluations demonstrate Mamba-3’s ability to advance the performance-eciency Pareto frontier . 1 Introduction T est-time compute has emerged as a ke y driver of progress in LLMs, with techniques like chain-of-thought reasoning and iterative r enement demonstrating that inference-time scaling can unlock new capabilities (Snell et al. 2024 ; Wu et al. 2025 ). The rapid rise of parallel, agentic workows has only intensie d the ne ed for ecient inference and deployment of such models (Anthropic 2026 ; OpenAI 2026 ). This paradigm shift makes inference eciency (K won et al. 2023 ; Li et al. 2024 ) paramount, as the practical impact of AI systems now depends critically on their ability to perform large-scale inference during deployment. Model architecture design plays a fundamental role in determining inference eciency , as architectural choices directly dictate the computational and memor y requirements during generation. While Transformer- based models (V aswani et al. 2017 ) are the current industry standard, they ar e fundamentally bottlenecked by linearly increasing memory demands through the KV cache and quadratically increasing compute requirements through the self- attention me chanism. These drawbacks have motivated recent lines of work on sub-quadratic models, e.g., state space models (SSMs) and linear attention, which r etain constant memory and linear compute while attaining comparable or better performance than their T ransformer counterparts. These models have made it into the mainstr eam, with lay ers such as Mamba-2 (Dao and Gu 2024 ) and Gated DeltaNet (GDN) (Schlag, Irie, and Schmidhuber 2021 ; S. Y ang, B. W ang, Y . Zhang, et al. 2025 ) recently incorporated into large-scale hybrid models that match the p erformance of pure T ransformer alternatives with much higher eciency (Kimi T eam et al. 2025 ; N VIDIA et al. 2025 ; T encent Hunyuan T eam et al. 2025 ; A. Y ang et al. 2025 ). Despite the success of linear models, signicant progress remains in improving their performance, in particular on advanc- ing the Pareto fr ontier between model quality and inference eciency . For example , Mamba-2 was dev eloped to impro ve ∗ Equal contribution. † Equal advising. 1 training speed and simplicity over Mamba-1 (Gu and Dao 2024 ), by sacricing some expressivity and thus performing worse for inference-matched models. In addition, the y have b een shown to lack certain capabilities, such as poor state- tracking abilities, e.g., simply determining parity of bit sequences (Grazzi, Siems, Zela, et al. 2025 ; Sarrof, V eitsman, and Hahn 2024 ). Finally , despite these sub-quadratic models being prized for theoretically ecient inference and thus their widespread adoption, their inference algorithms are not hardware ecient. In particular , because these algorithms were developed from a training persp ective, their deco ding phase has low arithmetic intensity (the ratio of FLOPs to memor y trac), resulting in large portions of hardware remaining idle . T o develop more p erformant models from an inference-rst paradigm, we introduce three core metho dological changes on top of Mamba-2, inuenced by an SSM-centric viewpoint of sub-quadratic models. Exponential- Trapezoidal Discretization. W e pro vide a simple technique for discretizing time-varying, selective SSMs. Through our framework, we can derive several new discretization methods. One of our instantiations, r eferred to as “e xponential-Euler , ” formalizes Mamba-1 and Mamba-2’s heuristic discretization that previously lacke d theoretical justi- cation. Our new “exponential-trapezoidal” instantiation is a more expressiv e generalization of “ exponential-Euler , ” where the recurrence can be expanded to reveal an implicit convolution applied on the SSM input. Combine d with explicit 𝐵, 𝐶 bias terms, Mamba-3 can empirically replace the short causal convolution in language model architectures, which was previously hypothesized to be essential for recurrent models. Complex-valued State Space Model. By viewing the underlying SSM of Mamba-3 as complex-valued, we enable a more expressive state update than Mamba-2’s. This change in update rule, designed to b e lightweight for training and inference, overcomes the lack of state-tracking ability common in many current linear models. W e show that our complex- valued update rule is equivalent to a data-dep endent rotary embedding and can be eciently computed (Su et al. 2023 ), and empirically demonstrate its ability to solve synthetic tasks outside the capabilities of prior linear models. Multi-Input, Multi-Output (MIMO) SSM. T o improve FLOP eciency during decoding, we switch from an outer- product–based state update to a matrix-multiplication–based state update. From the view of the signal processing foun- dations of SSMs, such a transition exactly coincides with the generalization from a single-input single-output (SISO) sequence dynamics to a multiple-input multiple-output (MIMO) one. Here, we nd that MIMO is particularly suitable for inference, as the extra expressivity enables more computation during the memory-b ound state up date during decoding, without increasing the state size and compromising speed. Put together , these improvements form the cor e of our Mamba-3 layer . Methodologically , we note that these all arise naturally from an SSM-centric perspective but are not immediate from other popular viewpoints of modern linear layers such as linear attention or test-time r egression; we discuss these connections further in Section 5 . Empirically , we validate our new model’s abilities and capabilities on a suite of synthetic state-tracking and language-modeling tasks. • Better Quality . At 1.5B scale, Mamba-3 (MIMO) improves downstream language modeling accuracy by +2.2 over Transformers, +1.9 points over Mamba-2, and +1.8 over GDN, while Mamba-3 (SISO) improves over the next best model, GDN, by +0.6 points. Furthermore, across state size experiments, Mamba-3 (MIMO) with state size 64 matches the perplexity of Mamba-2 with state size 128, eectively achieving the same language modeling performance with half the latency . • New Capabilities. Mamba-3’s complexication of the SSM state enables it to solve synthetic state-tracking tasks that Mamba-2 cannot . W e empirically demonstrate that the ecient RoPE-like calculation is able to near perfectly solve arithmetic tasks, while Mamba-3 without RoPE and Mamba-2 perform no better than random guessing. • Inference Eciency . Mamba-3 (MIMO) improves hardware utilization. It increases decoding FLOPs by up to 4 × relative to Mamba-2 at xed state size, while maintaining similar wall-clock decode latency , and simultane ously improving perplexity and downstr eam performance. W e r elease fast training and inference kernels for Mamba-3. 1 Mamba-3 (SISO) improves quality and capability ov er prior linear models, and Mamba-3 (MIMO) further improves per- formance over Mamba-3 (SISO) and other strong baselines while matching inference sp eed with Mamba-2. Both of our Mamba-3 variants advance the performance-latency Pareto frontier through their strong modeling capabilities and hardware-ecient design. 1 https://github.com/state- spaces/mamba . 2 2 Preliminaries 2.1 Notation Scalars are denoted by plain-text letters (e.g., 𝑥 , 𝑦 ). T ensors, including vectors and matrices, are denote d by b old letters (e .g., h , C ). The shape of the tensor can b e inferred from the context. W e denote the input sequence length as 𝑇 , the model dimension as 𝐷 , and the SSM state size as 𝑁 . For time indices, we use subscripts ( e.g., 𝑥 𝑡 for the input at time 𝑡 ). The Hadamard product b etween two tensors is denoted by ⊙ . For a vector v ∈ R 𝑑 , we denote Diag ( v ) ∈ R 𝑑 × 𝑑 as the diagonal matrix with the vector v as the diagonal, and for products of scalars across time steps, we use the notation 𝛼 𝑡 · · · 𝑠 = 𝛼 × 𝑡 : 𝑠 = 𝑡 𝑖 = 𝑠 𝛼 𝑖 . 2.2 SSM Preliminaries State Space Models (SSMs) describe continuous-time linear dynamics via ¤ h ( 𝑡 ) = A ( 𝑡 ) h ( 𝑡 ) + B ( 𝑡 ) 𝑥 ( 𝑡 ) , 𝑦 ( 𝑡 ) = C ( 𝑡 ) ⊤ h ( 𝑡 ) , where h ( 𝑡 ) ∈ R 𝑁 is the hidden state, 𝑥 ( 𝑡 ) ∈ R the input, and A ( 𝑡 ) ∈ R 𝑁 × 𝑁 , B ( 𝑡 ) , C ( 𝑡 ) ∈ R 𝑁 . W e will occasionally refer to A ( 𝑡 ) as the state-transition and B ( 𝑡 ) 𝑥 ( 𝑡 ) as the state-input ; this also e xtends to their discretized counterparts. For discrete sequences with step size Δ 𝑡 , Mamba-1 and Mamba-2 discretized the system to the following recurrence h 𝑡 = 𝑒 Δ 𝑡 A 𝑡 h 𝑡 − 1 + Δ 𝑡 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = C ⊤ 𝑡 h 𝑡 . Mamba-2’s Parameterization. The core of the Mamba-2 layer (Dao and Gu 2024 ) is a data-dependent and hardware- ecient SSM. Both the state-transition and state-input are made data-dependent through the projection of Δ 𝑡 ∈ R > 0 and B , C ∈ R 𝑁 from the current token. By parameterizing the state-transition A 𝑡 as a scalar times identity ( A 𝑡 = 𝐴 𝑡 I 𝑁 × 𝑁 , where 𝐴 𝑡 ∈ R < 0 ), the SSM recurrence can be eciently compute d with the matrix multiplication tensor cores of GP Us. Dening 𝛼 𝑡 B 𝑒 Δ 𝑡 𝐴 𝑡 ∈ ( 0 , 1 ) and 𝛾 𝑡 B Δ 𝑡 , the update becomes h 𝑡 = 𝛼 𝑡 h 𝑡 − 1 + 𝛾 𝑡 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = C ⊤ 𝑡 h 𝑡 . (1) The data-dependent state-transition 𝛼 𝑡 controls the memory horizon of each SSM within the layer . Δ 𝑡 in particular mod- ulates b oth the state-transition and state-input: a larger Δ 𝑡 forgets faster and up-weights the current token more str ongly , while a smaller Δ 𝑡 retains the hidden state with minimal contributions from the current token. Remark 1 . In Mamba-2, 𝐴 𝑡 is data-independent, since the o verall discrete transition 𝛼 𝑡 B 𝑒 Δ 𝑡 𝐴 𝑡 is data-dependent through Δ 𝑡 . In Mamba-3, w e empirically found that data-dep endent 𝐴 𝑡 has similar p erformance to data-indep endent 𝐴 𝑡 , and chose the former as a default for consistency so that all SSM parameters are data-dependent. 2.3 Structured Masked Representation and State Space Duality Mamba-2 showed that a large class of SSMs admit a matrix form that vectorizes the time-step recurrence. Through the state space duality (SSD) framework, recurrent SSMs can be represented within a parallel form that incorporates an element-wise mask to model the state-transition decay . SSD provides a general framework for a duality b etween linear recurrence and parallelizable (matrix-multiplication-based) computational forms Y = ( L ⊙ C B ⊤ ) X (2) where L ∈ R 𝑇 × 𝑇 is a structured mask, B , C ∈ R 𝑇 × 𝑁 , X ∈ R 𝑇 × 𝐷 are the inputs to the SSM and Y ∈ R 𝑇 × 𝐷 is its output. Dierent structures on L give rise to various instantiations of SSD . Equation ( 2 ) also draws a general connection b etween recurrence and attention, by setting Q B C , K B B , V B X and viewing L as a data-dependent mask. In fact, the simplest case of SSD is (causal) linear attention (Kathar opoulos et al. 2020 ), where L is the causal triangular mask. 3 𝑡 !"# 𝑡 !"# 𝑡 ! 𝑡 ! !𝑒 ! ! (# ! $%)'' 𝐵 𝜏 𝑥 𝜏 𝑑𝜏 "𝑡 !"# "𝑡 ! ≈ 𝛼 !:!# × 𝛼 %:!# × 𝛼 &:!# × 𝛼 %:%# × 𝛼 &:%# × 𝛼 &: × 1 1 1 1 𝛾 ' 𝛾 ! 𝛾 % 𝛾 & ℳ = 𝛽 ! 𝛽 % 𝛽 & Figure 1: Left: The structured mask induced by the exponential-trapezoidal rule (Se ction 3.1 ) is a product of the de cay and two-band convolutional mask. Right: Euler ( hold endpoint) versus Trapezoidal (average endpoints) integral approx- imation. Mamba-2 is a generalization where L = 1 𝛼 1 1 . . . . . . 𝛼 𝑇 .. . 1 · · · 𝛼 𝑇 1 · Diag ( 𝛾 ) (3) composed of terms 𝛼 𝑡 , 𝛾 𝑡 from equation ( 1 ). 2 In Section 3.1.3 , we show that Mamba-3 is a generalization of Mamba-2 with a more expr essive L , and hence also an instance of SSD . 3 Methodology W e introduce Mamba-3, a state space model with three new innovations: “e xponential-trapezoidal” discretization for more expressive dynamics (Section 3.1 ), complex-valued state spaces for state tracking (Section 3.2 ), and multi-input multi-output (MIMO) to improve modeling power and inference-time hardware utilization (Section 3.3 ). These advances address the quality , capability , and eciency limitations of current sub-quadratic architectures. W e combine these together into an updated Mamba architecture block in Section 3.4 . 3.1 Exponential- Trapezoidal Discretization Structured SSMs are naturally dened as continuous-time dynamical systems that map input functions, 𝑥 ( 𝑡 ) ∈ R , to output functions, 𝑦 ( 𝑡 ) ∈ R , for time 𝑡 > 0 . The underlying continuous state space system is dened by a rst-order ordinary dierential e quation (ODE) for the state ¤ h ( 𝑡 ) and an algebraic e quation for the output 𝑦 ( 𝑡 ) . In sequence mo deling, however , the data is only obser ved at discrete time steps, which then requir es applying a discretization step to the SSM to transform its continuous-time dynamics into a discrete recurrence . Discretization methods are well-studied in classical control theory with sev eral canonical formulas used in earlier SSM works in deep learning (Gu, Goel, and Ré 2022 ; Gu, Gupta, et al. 2022 ; Smith, W arrington, and Linderman 2023 ). These mechanisms were traditionally stated and applied to linear-time invariant (LTI) systems, and their derivations do not directly apply to linear-time varying (LTV) systems. Additionally , while Mamba-1 adapted the zero-order hold (ZOH) method to LT V systems without proof, the complexity asso ciated with sele ctive SSMs prompted the use of an additional heuristic approximation that lacke d theoretical justication and did not correspond to any established discr etization tech- nique. In the following subsection, w e formalize the pr evious heuristics use d in current LT V SSMs through our discretiza- tion framework and utilize it to propose a more e xpressive discr etization scheme. 2 In the original Mamba-2 paper , 𝛾 does not appear be cause it is viewed as folded into the B term. In this paper , B 𝑡 represents the continuous parameter , whereas in Mamba-2, B 𝑡 represents the discretized parameter which is equivalent to 𝛾 𝑡 B 𝑡 . 3 While the Mamba-1 paper reports ZOH discretization, the implementation follows https://github.com/state- spaces/mamba/issues/129 . 4 T able 1: T able of canonical linear-time invariant discretizations (top) and custom linear-time var ying discretizations derived from our exponential-adjusted framework (b ottom), along with their appearance in structured SSMs used in deep learning. Our theory formalizes the prior Mamba discretization as exponential-Euler and extends it with the more expressive exponential-trapezoidal method. The generalized discretization framework converts a continuous SSM ¤ h ( 𝑡 ) = A ( 𝑡 ) h ( 𝑡 ) + B ( 𝑡 ) 𝑥 ( 𝑡 ) into the discrete recurrence h 𝑡 = 𝛼 𝑡 h 𝑡 − 1 + 𝛽 𝑡 B 𝑡 − 1 𝑥 𝑡 − 1 + 𝛾 𝑡 B 𝑡 𝑥 𝑡 , where various discr etization methods yield dierent formulas for 𝛼 𝑡 , 𝛽 𝑡 , 𝛾 𝑡 . Discretization Metho d 𝛼 𝑡 𝛽 𝑡 𝛾 𝑡 Appearance Forward Euler 𝐼 + Δ 𝐴 — Δ — Backward Euler ( 𝐼 − Δ 𝐴 ) − 1 — ( 𝐼 − Δ 𝐴 ) − 1 Δ — Trapezoidal ( 𝐼 − Δ 2 𝐴 ) − 1 ( 𝐼 + Δ 2 𝐴 ) — ( 𝐼 − Δ 2 𝐴 ) − 1 Δ S4 Zero-Order Hold exp ( Δ 𝐴 ) — 𝐴 − 1 ( exp ( Δ 𝐴 ) − 𝐼 ) S4D , S5 Zero-Order Hold exp ( Δ 𝑡 𝐴 𝑡 ) — 𝐴 − 1 𝑡 ( exp ( Δ 𝑡 𝐴 𝑡 ) − 𝐼 ) Exponential-Euler exp ( Δ 𝑡 𝐴 𝑡 ) — Δ 𝑡 Mamba-1, -2 3 Exponential- Trapezoidal exp ( Δ 𝑡 𝐴 𝑡 ) ( 1 − 𝜆 𝑡 ) Δ 𝑡 exp ( Δ 𝑡 𝐴 𝑡 ) 𝜆 𝑡 Δ 𝑡 Mamba-3 3.1.1 Overview of Exponential- Adjusted Discretization W e introduce a simple derivation that leads to a class of new discretization methods for LT V state space models. The method can be instantiated in various ways; we show that one instantiation results in the heuristic used in Mamba-1/2, thereby theoretically justifying it (exponential-Euler). W e also introduce a more powerful discretization (exponential- trapezoidal) used in Mamba-3. The high-level intuition of our derivation originates from the closed form solution 𝑥 ( 𝑡 ) = 𝑒 𝑡 𝐴 𝑥 ( 0 ) of a simple linear ODE 𝑥 ′ ( 𝑡 ) = 𝐴𝑥 ( 𝑡 ) , which discretizes to 𝑥 𝑡 + 1 = 𝑒 Δ 𝐴 𝑥 𝑡 . In this example , the exponential dominates the dynamics of the underlying rst-order ODE, resulting in imprecise approximations when using low-order methods without signicantly constraining Δ . Thus, we analyze the dynamics of the exponential-adjusted system 𝑒 − 𝐴𝑡 𝑥 ( 𝑡 ) . The adjusted system yields a discrete r ecurrent form where the state-transition and the state-input integrals are appro ximated separately—the state- transition integral is appro ximated by a right-hand approximation, i.e . 𝐴 ( 𝑠 ) B 𝐴 ( 𝜏 𝑡 ) for all 𝑠 ∈ [ 𝜏 𝑡 − 1 , 𝜏 𝑡 ] , yielding, h ( 𝜏 𝑡 ) = exp 𝜏 𝑡 𝜏 𝑡 − 1 𝐴 ( 𝑠 ) 𝑑 𝑠 h ( 𝜏 𝑡 − 1 ) via right-hand approximation + 𝜏 𝑡 𝜏 𝑡 − 1 exp 𝜏 𝑡 𝜏 𝐴 ( 𝑠 ) 𝑑 𝑠 B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 via dierent discretization schemes , h 𝑡 ≈ exp ( Δ 𝑡 𝐴 𝑡 ) h 𝑡 − 1 + 𝜏 𝑡 𝜏 𝑡 − 1 exp ( ( 𝜏 𝑡 − 𝜏 ) 𝐴 𝑡 ) B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 , which serves as the foundation for further discretization te chniques for the state-input integral. The full derivation is detailed in Proposition 5 . ZOH. The classical zero-order hold discretization method can b e derived from the foundation above with a specic approximation of the right-hand side integral. By treating 𝐴 𝑡 , B ( 𝜏 ) , 𝑥 ( 𝜏 ) as constants over the interval [ 𝜏 𝑡 − 1 , 𝜏 𝑡 ] where the values are xed to the right endpoint 𝜏 𝑡 , the integral results in 𝐴 − 1 𝑡 ( exp ( Δ 𝑡 𝐴 𝑡 ) − 𝐼 ) B 𝑡 𝑥 𝑡 . W e note that this formally proves that the classical ZOH formula for LTI systems applies to LT V by naively r eplacing the parameters 𝐴, 𝐵 , Δ with their time-varying ones. Exponential-Euler (Mamba-1/-2). While Mamba-1 stated to use the time-varying ZOH formula above, Mamba-1 and Mamba-2 actually used an additional approximation in the released implementation. This discr etization can be recovered by approximating the state-input integral with Euler’s rule (Süli and Mayers 2003 ) and holding the (right) endpoint constant 5 throughout the interval (Fig. 1 ) h 𝑡 ≈ 𝑒 Δ 𝑡 𝐴 𝑡 h 𝑡 − 1 + ( 𝜏 𝑡 − 𝜏 𝑡 − 1 ) 𝑒 ( 𝜏 𝑡 − 𝜏 𝑡 ) 𝐴 𝑡 B 𝑡 𝑥 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 h 𝑡 − 1 + Δ 𝑡 B 𝑡 𝑥 𝑡 . (4) W e call equation ( 4 ) the exponential-Euler discr etization method, stemming fr om the e xponential integration follow ed by Euler approximation. This derivation justies the formulas use d in Mamba-1/-2’s implementation. Exponential- Trapezoidal (Mamba-3). However , Euler’s rule provides only a rst-order approximation of the state- input integral and its lo cal truncation error scales as 𝑂 ( Δ 2 𝑡 ) . In contrast, we introduce a generalized trap ezoidal rule , which provides a second-order accurate approximation of the integral, oering improved accuracy over Euler’s rule. Sp ecically , it approximates the integral with a data-dependent, convex combination of both interval endp oints . This generalization ex- tends the classical trapezoidal rule (Süli and Mayers 2003 ), which simply averages the interval endpoints (Figure 1 ). Proposition 1 (Exponential- Trapezoidal Discretization) . A pproximating the state-input integral in equation ( 16 ) by the general trapezoidal rule yields the recurrence, h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 h 𝑡 − 1 + ( 1 − 𝜆 𝑡 ) Δ 𝑡 𝑒 Δ 𝑡 𝐴 𝑡 B 𝑡 − 1 𝑥 𝑡 − 1 + 𝜆 𝑡 Δ 𝑡 B 𝑡 𝑥 𝑡 , (5) C 𝛼 𝑡 h 𝑡 − 1 + 𝛽 𝑡 B 𝑡 − 1 𝑥 𝑡 − 1 + 𝛾 𝑡 B 𝑡 𝑥 𝑡 , (6) where 𝜆 𝑡 ∈ [ 0 , 1 ] is a data-dependent scalar , 𝛼 𝑡 B 𝑒 Δ 𝑡 𝐴 𝑡 , 𝛽 𝑡 B ( 1 − 𝜆 𝑡 ) Δ 𝑡 𝑒 Δ 𝑡 𝐴 𝑡 , 𝛾 𝑡 B 𝜆 𝑡 Δ 𝑡 . Remark 2 (Expressivity) . The exponential-trap ezoidal rule is a generalization of (a) the classical trap ezoid rule, which is recovered when 𝜆 𝑡 = 1 2 , and (b) Mamba-2’s Euler’s rule, which is recovered when 𝜆 𝑡 = 1 . Remark 3 (Error Rate) . This is a second-order discretization of the state-input integral and its error scales as 𝑂 ( Δ 3 𝑡 ) under standard stability assumptions, provided that the trapezoidal parameter satises 𝜆 𝑡 = 1 2 + 𝑂 ( Δ 𝑡 ) . However , our ablations indicate that not enforcing this constraint is better for empirical performance. See Appendix A.2 and A.3 for details. Our new discretization framework and the two instantiations, exponential-Euler and exponential-trapezoidal, are, to the best of our knowledge, novel for structured SSMs used in deep learning. T able 1 compares and summarizes canonical and commonly used discretization schemes for state space models. 3.1.2 Exponential- Trapezoidal Recurrence as an Implicit Convolution Our generalize d exponential-trapezoidal discr etization is equivalent to applying a data-dependent convolution of size two on the state-input to the SSM. In particular , a normal SSM in recurrent form materializes the state-input v 𝑡 = B 𝑡 𝑥 𝑡 , then computes a linear recurrence h 𝑡 = 𝛼 𝑡 h 𝑡 − 1 + 𝛾 𝑡 v 𝑡 . In equation ( 6 ) we instead rst apply a width-2 convolution on v 𝑡 (weighted by 𝛽 , 𝛾 ) b efore passing it into the linear recurrence. Remark 4 (Convolution Dier ences) . There is a distinct dierence between the “ convolution” induced by exponential- trapezoidal discretization and the standard short convolutions use d by sequence mo dels such as Mamba and GDN. Stan- dard short convolutions are independent op erations applied on 𝑥 𝑡 (and often B 𝑡 , C 𝑡 ) outside the core recurrence, while our new discretization can be interpreted as a convolution on the state-input B 𝑡 𝑥 𝑡 within the core recurrence . 3.1.3 Parallel Representation of Exponential- Trapezoidal Recurrence Our new r ecurrence can be instantiated as a case of SSD and has a corresponding parallel form to equation ( 2 ). Expanding the state recurrence fr om h 0 = 𝛾 0 B 0 𝑥 0 results in h 𝑇 = 𝛼 𝑇 · · · 2 ( 𝛾 0 𝛼 1 + 𝛽 1 ) B 0 𝑥 0 + · · · + 𝛾 𝑇 B 𝑇 𝑥 𝑇 , where the SSM output is y 𝑇 = 𝛼 𝑇 · · · 2 ( 𝛾 0 𝛼 1 + 𝛽 1 ) C ⊤ 𝑇 B 0 𝑥 0 + · · · + 𝛾 𝑇 C ⊤ 𝑇 B 𝑇 𝑥 𝑇 . Unrolling these r ows shows that the mask induced by the trapezoidal update is no longer a xed averaging of endpoints (as in the classical trapezoid rule), but a data-dependent convex combination of the two interval endpoints. Under the SSD frame work ( 2 ) with parallel form Y = ( L ⊙ C B ⊤ ) X , Mamba-3 corresponds to a mask L whose structure 6 is a 1-semiseparable matrix composed with a 2-band matrix: 4 L = 𝛾 0 ( 𝛾 0 𝛼 1 + 𝛽 1 ) 𝛾 1 𝛼 2 ( 𝛾 0 𝛼 1 + 𝛽 1 ) ( 𝛾 1 𝛼 2 + 𝛽 2 ) 𝛾 2 . . . . . . 𝛼 𝑇 · · · 2 ( 𝛾 0 𝛼 1 + 𝛽 1 ) · · · 𝛾 𝑇 = 1 𝛼 1 1 𝛼 2 𝛼 1 𝛼 2 1 . . . . . . 𝛼 𝑇 · · · 1 · · · 1 𝛾 0 𝛽 1 𝛾 1 0 𝛽 2 𝛾 2 . . . . . . 0 · · · 𝛾 𝑇 . (7) This parallel formulation enables the hardware-ecient matmul-focused calculation of the SSM output for training. W e note that the convolutional connection of Mamba-3 can also be se en through this parallel dual form, wher e multiplica- tion by the 2-band matrix in equation ( 7 ) represents convolution with weights 𝛽 , 𝛾 . In Appendix A.1 , we use the SSD tensor contraction machinery to prove that the parallel form is e quivalent to a vanilla SSM with a state-input convolution. Remark 5 . The structur ed mask of Mamba-3 can be viewed as generalizing Mamba-2, which instead of the 2-band matrix has a diagonal matrix with 𝛾 𝑡 only ( 3 ). 3.2 Complex- V alued SSMs Modern SSMs ar e designed with eciency as the central goal, motivated by the ne ed to scale to larger models and longer sequences. For instance, successive architectures have progressively simplied the state-transition matrix: S4 (Gu, Goel, and Ré 2022 ) used complex-valued Normal P lus Low Rank (NPLR) matrices, Mamba (Gu and Dao 2024 ) reduced this to a diagonal of r eals, and Mamba-2 (Dao and Gu 2024 ) further simplie d it to a single scaled identity matrix. Although these simplications largely maintain language mo deling p erformance, recent works (Grazzi, Siems, Zela, et al. 2025 ; Merrill, Petty, and Sabharwal 2025 ; Sarr of, V eitsman, and Hahn 2024 ) have shown that the restriction to real, non-negative eigenvalue transitions degrades the capabilities of the model on simple state-tracking tasks—here referring primarily to the solvable-group regime (TC 0 ) such as parity—which can be solv ed by a one-lay er LSTM. This limitation, formalized in Theorem 1 of (Grazzi, Siems, Schrodi, et al. 2024 ), arises from restricting the eigenvalues of the transition matrix to real numbers, which cannot represent “rotational” hidden state dynamics. For instance, consider the parity function on binar y inputs { 0 , 1 } , dened as 𝑡 𝑥 𝑡 mod 2 . This task can be performe d using update: h 𝑡 = R ( 𝜋 𝑥 𝑡 ) h 𝑡 − 1 , where R ( ·) is a 2-D rotation matrix. Such rotational dynamics cannot be expressed with real eigenvalues. 3.2.1 Complex SSM with Exponential-Euler Discretization T o recover this capability , w e begin with complex SSMs ( 8 ), which are capable of representing state-tracking dynamics. W e show that, under discretization (Proposition 5 ), complex SSMs can be formulated as real SSMs with a block-diagonal transition matrix comp osed of 2 × 2 rotation matrices (Proposition 2 ). W e then show that this is e quivalent to applying data-dependent rotary embeddings on b oth the input and output projections B , C respe ctively . This result establishes a theoretical connection b etween complex SSMs and data-dependent RoPE embeddings (Proposition 3 ). Finally , the “RoPE trick” used in Su et al. ( 2023 ) allows for an ecient implementation of complex-valued state-transition matrices with minimal computational overhead compared to real-valued SSMs. Proposition 2 (Complex-to-Real SSM Equivalence) . Consider a complex-valued SSM ¤ h ( 𝑡 ) = Diag 𝐴 ( 𝑡 ) + 𝑖 θ ( 𝑡 ) h ( 𝑡 ) + B ( 𝑡 ) + 𝑖 ˆ B ( 𝑡 ) 𝑥 ( 𝑡 ) , (8) 𝑦 ( 𝑡 ) = Re C ( 𝑡 ) + 𝑖 ˆ C ( 𝑡 ) ⊤ h ( 𝑡 ) , where h ( 𝑡 ) ∈ C 𝑁 / 2 , θ ( 𝑡 ) , B ( 𝑡 ) , ˆ B ( 𝑡 ) , C ( 𝑡 ) , ˆ C ( 𝑡 ) ∈ R 𝑁 / 2 , and 𝑥 ( 𝑡 ) , 𝐴 ( 𝑡 ) ∈ R . Under exponential-Euler discretization, this system is e quivalent to a real-valued SSM h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 R 𝑡 h 𝑡 − 1 + Δ 𝑡 B 𝑡 𝑥 𝑡 , (9) 𝑦 𝑡 = C ⊤ 𝑡 h 𝑡 , 4 Incidentally , this is a special case of a 2-semiseparable matrix. 7 with state h 𝑡 ∈ R 𝑁 , projections B 𝑡 B B 𝑡 ˆ B 𝑡 ∈ R 𝑁 , C 𝑡 B C 𝑡 − ˆ C 𝑡 ∈ R 𝑁 , and a transition matrix R 𝑡 B Block { 𝑅 ( Δ 𝑡 θ t [ 𝑖 ] ) } 𝑁 / 2 𝑖 = 1 ∈ R 𝑁 × 𝑁 , 𝑅 ( 𝜃 ) B cos ( 𝜃 ) − sin ( 𝜃 ) sin ( 𝜃 ) cos ( 𝜃 ) . The proof is given in Appendix B.1 . Proposition 2 shows that the discretized complex SSM of state dimension 𝑁 / 2 has an e quivalent real SSM with doubled state dimension ( 𝑁 ), and its transition matrix is a scalar decayed block-diagonal matrix of 2 × 2 data-dep endent rotation matrices ( 𝑒 Δ 𝑡 𝐴 𝑡 R 𝑡 ). Proposition 3 (Complex SSM, Data-Dependent RoPE Equivalence) . Under the notation established in Proposition 2 , con- sider the real SSM dened in equation ( 9 ) unrolled for 𝑇 time-steps. The output of the above SSM is equivalent to that of a vanilla scalar transition matrix-base d SSM ( 4 ) with a data-dependent rotar y embe dding applie d on the B , C components of the SSM, as dene d by: h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 h 𝑡 − 1 + 𝑡 𝑖 = 0 R ⊤ 𝑖 Δ 𝑡 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = 𝑡 𝑖 = 0 R ⊤ 𝑖 C 𝑡 ⊤ h 𝑡 (10) where the matrix product represents right matrix multiplication, e.g., 1 𝑖 = 0 R 𝑖 = R 0 R 1 . W e refer to the usage of a transformed real-valued SSM to compute the complex SSM as the “RoPE trick. ” The proof is given in Appendix B.2 . T o observe the connection of complex SSMs to RoPE embeddings, note that in the above pr oposition, the data-dependent rotations R 𝑖 are aggregated across time-steps and applie d to C , B , which, by the state space duality framework, cor- respond to the query ( Q ) and key ( K ) components of attention (Section 2.3 ). Analogously , vanilla RoPE (Su et al. 2023 ) applies data-independent rotation matrices, where the rotation angles follow a xe d frequency schedule θ [ 𝑖 ] = 10000 − 2 𝑖 / 𝑁 . 3.2.2 Complex SSM with Exponential- Trapezoidal Discretization After deriving the recurrence for complex SSMs with exponential-Euler discretization, the generalization to e xponential- trapezoidal discretization is similar . Proposition 4 provides the full recurrence with the RoPE trick for Mamba-3. Proposition 4 (Rotary Embedding Equivalence with Exponential- Trapezoidal Discretization) . Discretizing a complex SSM with the exponential-trap ezoidal rule (Proposition 1 ) yields the recurrence h 𝑡 = 𝛼 𝑡 h 𝑡 − 1 + 𝛽 𝑡 𝑡 − 1 𝑖 = 0 R ⊤ 𝑖 B 𝑡 − 1 𝑥 𝑡 − 1 + 𝛾 𝑡 𝑡 𝑖 = 0 R ⊤ 𝑖 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = 𝑡 𝑖 = 0 R ⊤ 𝑖 C 𝑡 ⊤ h 𝑡 . (11) Here, R 𝑡 is the blo ck-diagonal rotation matrix dened in Proposition 2 . The proof is in Appendix B.3 . W e empirically validate that our complex SSM, implemented via data-dep endent RoPE, is capable of solving state-tracking tasks that real-valued SSMs with and without standard RoPE cannot (T able 5b ), supporting theoretical claims. 3.3 Multi-Input, Multi-Output Scaling test-time compute has op ened new frontiers in model capability , such as agentic workows, where inference takes up an increasing share of the ov erall compute budget. This has placed a renewed focus on inference eciency of 8 T able 2: Arithmetic Intensity for (a) SISO , (b) MIMO . The batch and head dimensions cancel out. The arithmetic intensity of MIMO increases linearly with rank 𝑅 , enabling better hardware utilization during memory-bound phases like deco de. Here 𝑁 is the state size (expansion factor ) and 𝑃 is the head dimension. For Mamba-3, typically 𝑅 ≪ 𝑁 , 𝑃 . Input Output FLOPs Arithmetic Intensity 𝐻 𝑡 : ( 𝑁 , 𝑃 ) 𝑥 𝑡 : ( 𝑃 ) 𝑎 𝑡 : ( 1 ) 𝑏 𝑡 : ( 𝑁 ) 𝑐 𝑡 : ( 𝑁 ) 𝑦 𝑡 : ( 𝑃 ) 5 𝑁 𝑃 − 𝑃 5 𝑁 𝑃 − 𝑃 2 ( 1 + 2 𝑁 + 𝑃 + 𝑁 𝑃 ) ≈ 2 . 5 = Θ ( 1 ) (a) SISO (2-byte data). Input Output FLOPs Arithmetic Intensity 𝐻 𝑡 : ( 𝑁 , 𝑃 ) 𝑥 𝑡 : ( 𝑃 , 𝑅 ) 𝑎 𝑡 : ( 1 ) 𝑏 𝑡 : ( 𝑁 , 𝑅 ) 𝑐 𝑡 : ( 𝑁 , 𝑅 ) 𝑦 𝑡 : ( 𝑃 , 𝑅 ) 4 𝑁 𝑃 𝑅 + 𝑁 𝑃 − 𝑃 𝑅 4 𝑁 𝑃 𝑅 + 𝑁 𝑃 − 𝑃 𝑅 2 ( 1 + 2 𝑁 𝑅 + 𝑃 𝑅 + 𝑁 𝑃 ) = Θ ( min ( 𝑁 , 𝑃 , 𝑅 ) ) = Θ ( 𝑅 ) , 𝑅 ≪ 𝑁 , 𝑃 (b) MIMO (2-byte data). language models and spurred the adoption of SSMs and sub-quadratic layers which feature xed-sized hidden states and thus oer lower compute and memory requirements. Although these new layers have a lower wall-clock time compared to Transformers, their deco ding is heavily memory-b ound, resulting in low hardware utilization. In this se ction, we use the SSM perspective to introduce a methodological renement to the Mamba-3 recurrence that allows for incr eased model FLOPs without increasing deco ding wall-clo ck time, resulting in a better model with the same decoding sp e ed. Decoding Arithmetic Intensity . T o improv e hardware eciency , we need to consider the arithmetic intensity of token generation, dened as FLOPs divided by the number of input-output bytes for a given op. Since SSM deco ding saturates the memor y bandwidth with idle compute (i.e., being memory-b ound ), we would like to increase its arithmetic intensity to eectively overlay compute with memor y I/O . More concretely , the arithmetic intensity for a single generation in Mamba is ar ound 2 . 5 ops per byte (T able 2a ), while the arithmetic intensity for boat16 matmul is ab out 295 ops per byte for NVIDIA H100-SXM5 (N VIDIA 2022 ). Consequently , SSM deco ding falls far short of a compute-bound regime, and moreover it is not clear how one can adjust the existing parameters in Mamba to mitigate the lack of hardware eciency . W e note that this observation applies generally to other sub-quadratic models, such as causal linear attention. From SISO to MIMO. Consider a single head of a typical SSM with head dimension 𝑃 , which inv olves stacking the SISO recurrence h 𝑡 ← 𝛼 𝑡 h 𝑡 − 1 + Δ 𝑡 B 𝑡 𝑥 𝑡 with 𝑃 copies sharing the same 𝛼 𝑡 , Δ 𝑡 and B 𝑡 . The resulting broadcasted recurrence h 𝑡 ← 𝛼 𝑡 h 𝑡 − 1 + Δ 𝑡 B 𝑡 x ⊤ 𝑡 takes vector inputs x 𝑡 ∈ R 𝑃 and has matrix-valued states h 𝑡 ∈ R 𝑁 × 𝑃 . Note that the memor y trac (input and output size) is dominated by the state h 𝑡 , while the computation mainly com- prises the outer product B 𝑡 x ⊤ 𝑡 which has FLOPs proportional to 𝑁 𝑃 . By increasing the dimension of the latter terms, transforming B 𝑡 ∈ R 𝑁 → B 𝑡 ∈ R 𝑁 × 𝑅 and x 𝑡 ∈ R 𝑃 → x 𝑡 ∈ R 𝑃 × 𝑅 , the memory trac do es not signicantly increase (for small 𝑅 ) while the FLOPs consumed increase by a factor of 𝑅 (T able 2a ). Thus, this transformation increases the arithmetic intensity of the r ecurrence. Furthermore, the increase in arithmetic intensity is translated into practical gains, since the outer product B 𝑡 x ⊤ 𝑡 becomes a hardware-ecient matrix-matrix product (matmul), which is computed using fast tensor-cores, incurring only a marginal latency cost. As a result, the MIMO recurrence is more expressiv e than the original SISO recurrence, computing 𝑅 × more FLOPs while practically preserving the deco ding speed. For similar reasons, the computation of the output from the state, y 𝑡 ← C ⊤ 𝑡 h 𝑡 acquires an extra rank 𝑅 by modifying the output pr ojection as C 𝑡 ∈ R 𝑁 → C 𝑡 ∈ R 𝑁 × 𝑅 . Overall, this transformation is equivalent to expanding the original single-input, single-output (SISO) recurrence to multi-input, multi-output (MIMO ). Training MIMO SSMs. While the MIMO formulation is motivated by inference eciency , the training algorithms for SSMs (including our developments in Se ction 3.1 , Section 3.2 ) have been typically developed for SISO models. W e b egin with the observation that MIMO SSMs can be expressed in terms of 𝑅 2 SISO SSMs, where 𝑅 SISO SSMs sharing the same recurrence are summed for each of the 𝑅 MIMO outputs. In particular , dene C ( 𝑖 ) 𝑡 ∈ R 𝑁 , B ( 𝑗 ) 𝑡 ∈ R 𝑁 , x ( 𝑗 ) 𝑡 ∈ R , Δ 𝑡 ∈ R , 9 where 𝑖 , 𝑗 ∈ { 0 , . . ., 𝑅 − 1 } , then we have, h ( 𝑗 ) 𝑡 ← 𝛼 𝑡 h ( 𝑗 ) 𝑡 − 1 + Δ 𝑡 B ( 𝑗 ) 𝑡 x ( 𝑗 ) 𝑡 (12) h 𝑡 = 𝑅 − 1 𝑗 = 0 h ( 𝑗 ) 𝑡 (13) y ( 𝑖 ) 𝑡 ← C ( 𝑖 ) 𝑡 ⊤ h 𝑡 (14) Thus, 𝑦 ( 𝑖 ) 𝑡 = 𝑗 SSM 𝛼 , Δ , B ( 𝑗 ) , C ( 𝑖 ) , x ( 𝑗 ) 𝑡 , where SSM 𝛼 , Δ , B ( 𝑗 ) , C ( 𝑖 ) , x ( 𝑗 ) 𝑡 : = C ( 𝑖 ) 𝑡 ⊤ h ( 𝑗 ) 𝑡 with h ( 𝑗 ) 𝑡 from ( 12 ). Furthermore, improvements to standard SISO-base d SSM models can be directly applied to MIMO models as the underlying SISO training algorithms can b e utilized as a black-box. This observation allows a MIMO mo del to be trained by invoking the SISO algorithm 𝑅 2 times as a black box in parallel. In contrast, when compute d in the recurrent form, equation ( 12 ), ( 13 ), and ( 14 ) can be performe d se quentially , incurring only an 𝑅 -times overhead relative to SISO SSMs (recall the discussion on MIMO decoding FLOPs). Chunked Algorithm for MIMO SSMs. Many modern SISO recurrent models, including Mamba-2, are compute d using a chunke d algorithm, where the sequence is divide d into chunks of length 𝐶 . Within each chunk, a parallel (but asymp- totically slower ) algorithm is applied, while a recurrence is computed across chunks. Chunked algorithms interpolate between two extremes: a fully parallel and a fully se quential algorithm. By exploiting this structure, we can reduce the training cost of MIMO SSMs to 𝑅 times that of SISO SSMs. This idea also appears in the SSD framework—SSD applies a hardware-friendly quadratic algorithm within each chunk, while using the recurrent form across chunks, and shows that when the state and head dimensions are comparable , setting the chunk size to this dimension yields an overall linear- time algorithm. Specically , SSD’s intra-chunk computation incurs 2 𝐶 2 𝑁 + 2 𝐶 2 𝑃 FLOPs per chunk, giving a total of 𝑇 𝐶 2 𝐶 2 𝑁 + 2 𝐶 2 𝑃 = 2 𝑇 𝐶 ( 𝑁 + 𝑃 ) . The inter-chunk computation incurs 4 𝑁 𝑃 𝐶 + 2 𝑁 𝑃 FLOPs per chunk, for a total of 𝑇 𝐶 ( 4 𝑁 𝑃 𝐶 + 2 𝑁 𝑃 ) = 4 𝑇 𝑁 𝑃 + 𝑇 𝐶 2 𝑁 𝑃 (ignoring negligible terms). Setting 𝐶 = 𝑃 = 𝑁 , the total FLOP count is 8 𝑇 𝑁 2 , which is linear in 𝑇 . The chunke d algorithm for SSD can be naturally generalized into MIMO SSMs. In such a case, the FLOP counts of state projection B x ⊤ and state emission C ⊤ h increase by 𝑅 × , while the FLOP count of the intrachunk component C ⊤ B increases by 𝑅 2 × . As a result, the intra-chunk computation incurs 2 · 𝑇 𝐶 ( 𝐶 𝑅 ) 2 𝑁 + 𝑇 𝐶 ( 𝐶 𝑅 ) 2 𝑃 FLOPs and inter-chunk computation incurs 4 · 𝑇 𝐶 𝑁 𝑃 ( 𝐶 𝑅 ) + 2 · 𝑇 𝐶 𝑁 𝑃 FLOPs. Thus, setting 𝐶 𝑅 = 𝑁 = 𝑃 yields a total FLOP count of 8 𝑇 𝑅 𝑁 2 , an 𝑅 -fold incr ease in FLOP count. Intuitively , setting MIMO chunk size as 1 𝑅 times the SISO chunk size, i.e., 𝐶 MIMO ← 1 𝑅 𝐶 SISO , maintains the SISO intra-chunk FLOP count while increasing the number of chunks by a factor of 𝑅 , resulting in an overall 𝑅 -times increase in FLOP count instead of an 𝑅 2 -times increase while keeping the algorithm hardware-friendly . The training sp eed of algorithms in practice depends on details of the kernel implementation strategy , architectural choices such as how the MIMO parameters ar e instantiated, and problem dimensions, but should be no more than 𝑅 times slow er . Our released Triton Mamba-3 SISO kernels ar e roughly on par with the Triton Mamba-2 kernels, and MIMO kernels only incur a slow down of 2 × when 𝑅 = 4 , as compute latency can be parallelize d with memory movement. T able 6 benchmarks the prell speed of various kernels which is equivalent to the for ward pass of the training kernel. MIMO Instantiation. Among various choices for MIMO parameterizations, Mamba-3’s approach achieves a balance that preser ves the state size and number of SSMs of its SISO counterpart, while avoiding excessive growth in parameter count. The naive conversion of a SISO SSM to a rank 𝑅 MIMO SSM would incur an 𝑅 × increase in parameters as all projections that model the inputs to the SSM, B , C , x , w ould increase. Block-level components, such as the gate z (which so far has be en ignored for simplicity) and output y projection would also be impacted. This inux in parameter count would be intractable at larger model scales. T o counteract this, we make the following change. Mamba’s multi-value attention (MV A) head structure results in shar ed B , C across heads, so these components’ projections can be directly converted to incorporate the new MIMO rank 𝑅 with only a slight increase in parameter count from 𝐷 𝑁 to 𝐷 𝑁 𝑅 for the entire layer (recall 𝐷 as the model dimension). However , the SSM input x 𝑡 , output y 𝑡 , and gate z 𝑡 are unique per head and therefore dominate the parameter count. Here, directly adjusting the pr ojections w ould increase the parameter count from 𝐷 𝑃 to 𝐷 𝑃 𝑅 for each head . Instead, we ke ep the original SISO projection and element-wise scale each dimension of the projected output to size 𝑅 with a learnable, data-independent vector , resulting in 𝐷 𝑃 + 𝑃 𝑅 parameters for each head. 10 B C X Mamba- 3 Block Sequence transformation MIMO projec tion (optiona l) Nonlineari ty (act ivation, normaliz ation, multiplicatio n, etc.) X SSM A Y Mamba- 2 Block ! B C X X ! Conv SSM A N Y ! Linear projection N N RoP E & Figure 2: Contrasting Mamba-2 and Mamba-3 Architectures: Key updates include exponential-trapezoidal discretization, data-dependent RoPE embeddings, MIMO projections, QK normalization, and learnable biases. This mitigates the multiplicativ e increase to a more reasonable additiv e parameter count increase. Appendix C details the parameterization, and all MIMO-variants in our paper ar e parameter-matched to their SISO counterparts by reducing the MLP width. Remark 6 . For simplicity , all discussion in this section was for simpler 2-term recurrences such as that arising from exponential-Euler discretization; the generalization to the 3-term exponential-trapezoidal recurrence is similar . 3.4 Mamba-3 Architecture The overall architecture follows Llama (Grattaori et al. 2024 ), alternating Mamba-3 and SwiGLU blo cks with pre-norm. The Mamba-3 block retains the overall layout of its pr edecessor , while introducing several key modications. Updated SSM Recurrence. The SSD layer is replaced with the more expressive complex-valued exponential-trapezoidal SSM dene d in Proposition 4 . Mamba-3 emplo ys the SISO SSM by default to enable fair comparisons with other SISO-like models, but its MIMO variant can be trained and deployed as a stronger alternative to baseline Mamba-3 (Table 3 ). Our SSM A is complex with both r eal and imaginar y components produce d by data-dependent pr ojections. With Figure 2 , this is partitioned into the real-valued 𝐴 and imaginary-valued Θ ; the former is passed into the SSD black box as in Mamba-2, while the latter is computed through the RoPE trick. BC / QK Normalization. RMS normalizations are added following the B , C projection, mirroring the QKNorm com- monly used in modern Transformers (Henry et al. 2020 ; W ortsman et al. 2023 ) and other recent linear models (Hu et al. 2025 ; S. Y ang, Kautz, and Hatamizadeh 2025 ). W e call this either BC normalization (BCNorm) or QK normalization (QKNorm) inter changeably . W e nd that BCNorm is also able to stabilize large-scale runs, resulting in the removal of the post-gate RMSNorm layer (introduced in Mamba-2 for stability ) in our pure Mamba-3 mo dels. However , in hybrid models, the removed RMSNorm layer is crucial for long-conte xt extrapolation (T able 4 ). B , C Biases. Similarly to Y u and Erichson ( 2025 ), which proved that adding channel-specic biases to B in a blockwise variant of Mamba-1 grants universal appr oximation capabilities, Mamba-3 incorporates learnable , head-sp ecic, channel- wise biases into the B and C components after the BCNorm. 11 T able 3: Downstream language mo deling evaluations on models trained with 100B Fine W eb-Edu tokens. Best results for each size are b olded , and se cond best are underlined, excluding Mamba-3 MIMO variants. All models are trained with the same procedure. Mamba-3 SISO outperforms Mamba-2 and others at every model scale, and the MIMO variant with rank 𝑅 = 4 further improves modeling capabilities. Model FW-Edu LAMB. LAMB. HellaS. PIQ A Arc-E Arc-C WinoGr . OBQA A verage ppl ↓ ppl ↓ acc ↑ acc_n ↑ acc ↑ acc ↑ acc_n ↑ acc ↑ acc ↑ acc ↑ Transformer-180M 16 . 89 45 . 0 32 . 5 39 . 0 67 . 1 59 . 8 27 . 9 51 . 2 21 . 8 42 . 8 GDN-180M 16 . 52 40 . 8 31 . 3 40 . 2 66 . 3 62 . 3 28 . 2 51 . 7 22 . 0 43 . 2 Mamba-2-180M 16 . 76 41 . 8 30 . 9 40 . 1 66 . 8 60 . 1 27 . 3 52 . 0 23 . 2 42 . 9 Mamba-3-SISO-180M 16 . 59 37 . 7 32 . 5 40 . 8 66 . 1 61 . 5 27 . 9 52 . 0 22 . 8 43 . 4 Mamba-3-MIMO-180M 16 . 46 32 . 1 34 . 0 41 . 0 66 . 7 60 . 6 27 . 7 52 . 9 22 . 0 43 . 5 Transformer-440M 13 . 03 21 . 2 41 . 7 50 . 5 69 . 9 67 . 6 34 . 6 56 . 7 26 . 0 49 . 6 GDN-440M 13 . 01 18 . 0 41 . 9 50 . 9 70 . 0 67 . 0 34 . 6 56 . 1 27 . 6 49 . 7 Mamba-2-440M 13 . 00 19 . 6 40 . 8 51 . 7 70 . 6 68 . 8 35 . 0 54 . 1 26 . 0 49 . 6 Mamba-3-SISO-440M 12 . 87 19 . 6 40 . 2 51 . 7 71 . 9 68 . 9 34 . 4 55 . 8 26 . 0 49 . 8 Mamba-3-MIMO-440M 12 . 72 17 . 1 43 . 4 52 . 8 70 . 8 69 . 6 35 . 6 56 . 3 28 . 4 51 . 0 Transformer-880M 11 . 42 15 . 0 44 . 7 57 . 2 72 . 6 71 . 6 39 . 2 57 . 7 26 . 8 52 . 8 GDN-880M 11 . 37 12 . 9 47 . 6 57 . 3 73 . 3 71 . 4 38 . 7 58 . 8 28 . 6 53 . 7 Mamba-2-880M 11 . 35 13 . 8 45 . 0 58 . 1 72 . 5 72 . 3 38 . 7 56 . 8 30 . 2 53 . 4 Mamba-3-SISO-880M 11 . 23 12 . 9 47 . 2 58 . 8 73 . 6 72 . 7 40 . 2 58 . 4 30 . 0 54 . 4 Mamba-3-MIMO-880M 11 . 11 11 . 8 49 . 5 59 . 2 73 . 7 74 . 7 41 . 2 59 . 9 28 . 6 55 . 3 Transformer-1.5B 10 . 51 11 . 1 50 . 3 60 . 6 73 . 8 74 . 0 40 . 4 58 . 7 29 . 6 55 . 4 GDN-1.5B 10 . 45 10 . 9 49 . 2 61 . 3 74 . 3 75 . 3 41 . 2 58 . 0 31 . 6 55 . 8 Mamba-2-1.5B 10 . 47 12 . 0 47 . 8 61 . 4 73 . 6 75 . 3 41 . 8 57 . 5 32 . 6 55 . 7 Mamba-3-SISO-1.5B 10 . 35 10 . 9 49 . 4 61 . 9 73 . 6 75 . 9 42 . 7 59 . 4 32 . 0 56 . 4 Mamba-3-MIMO-1.5B 10 . 24 10 . 2 51 . 7 62 . 3 75 . 3 76 . 5 44 . 5 60 . 6 32 . 6 57 . 6 W e hypothesize that these biases also induce a convolution-like behavior in the mo del. Specically , adding biases to B and C introduces data-independent components into SSMs that function more similarly to convolutions. Ablations on the bias parameterization are located in Appendix F . The combination of data-independent bias parameters, together with exponential-trapezoidal discretization (which itself induces a convolution on the state-input), is empirically able to obviate the short causal convolution and its accompanying activation function present in Mamba-2 and most modern recurrent models (Section 4.2 ). 4 Empirical V alidation W e empirically validate our SSM-centric methodological changes through the Mamba-3 mo del on a host of synthetic and real-world tasks. Section 4.1 evaluates Mamba-3 on language mo deling and retrieval-based tasks. Section 4.2 ablates the eect of our new SSM components such as discretization and complex transitions. Se ction 4.3 explores the inference eciency of the Mamba-3 family and MIMO Mamba-3’s b enets over the SISO variant under xed inference compute, and Section 4.4 benchmarks the performance of our Mamba-3 training and inference kernels. 4.1 Language Mo deling All models are pretrained with 100B tokens of the Fine W eb-Edu dataset (Penedo et al. 2024 ) with the Llama-3.1 tok- enizer (Grattaori et al. 2024 ) at a 2K context length with the same standard training protocol. Training and evaluation details can be found in Appendix D . Across all four mo del scales, Mamba-3 outp erforms popular baselines at various downstream tasks (T able 3 ). W e highlight that Mamba-3 does not utilize the external short convolution that has been empirically identied as an important compo- 12 nent in many performant linear models (Allen-Zhu 2025 ; Gu and Dao 2024 ; S. Y ang, Kautz, and Hatamizadeh 2025 ). 4.1.1 MIMO W e aim to further verify the gain from MIMO by investigating its language-modeling capabilities by training MIMO models with rank 𝑅 = 4 under the same settings. T o ensure that the total parameter count is comparable to SISO-based models, we decrease the inner dimension of the MLP layers in MIMO models to comp ensate for the increase due to the MIMO projections. In the 1.5B-parameter mo dels, for instance, the MLP inner dimension is reduce d by only 6 . 6% , from 4096 to 3824. See Appendix C for more details. On both validation perplexity and our suite of language evaluation tasks (T able 3 ), we see signicant gains when moving from SISO to MIMO for our Mamba-3 models. Namely , w e achieve a signicant p erplexity gain of 0 . 11 on the 1.5B models, and Figure 3 illustrates the downward shift in our validation loss. On the language evaluation front, we see gains on most tasks when compared to SISO , resulting in an average gain of 1.2 percentage points ov er SISO . 4.1.2 Retrieval Capabilities Beyond standard language modeling, an important measure for linear models is their r etrieval ability—ho w w ell they can recall information from earlier in the sequence (A. Ar ora et al. 2025 ; S. Arora, Eyuboglu, et al. 2025 ). Unlike attention models, which can freely revisit past context with the growing K V cache, linear models must compress context into a xed-size state. This trade-o is reected in the Transformer baseline’s substantially stronger retrieval scores. T o evaluate Mamba-3 under this lens, T able 4 compares it against baselines on both real-world and synthetic needle-in-a- haystack (NIAH) tasks (Hsieh et al. 2024 ), using our pretrained 1.5B mo dels from Section 4.1 . W e restrict the task sequence length to 2K tokens to match the training setup and adopt the cloze-style format for our real-world tasks to mirror the next-token-prediction objective, follo wing S. Arora, Eyuboglu, et al. ( 2025 ) and S. Arora, Timalsina, et al. ( 2024 ). Mamba-3 is competitive on real-world asso ciative recall and question-answering (TQ A, SQuAD) but struggles when ex- tracting information from semi-structured or unstructured data (SWDE, FD A). On synthetic NIAH tasks, howev er , Mamba- 3 surpasses or matches baselines on most cases and notably demonstrates markedly better out-of-distribution retrieval abilities than its Mamba-2 predecessor . Improving Retrieval with Hybrid Models. Because of the natural retrieval-based weaknesses of xed state-size, we predict that linear layers will b e predominantly used in hybrid architectures that mitigate this downside with quadratic self- attention layers. T o evaluate how Mamba-3 performs within this architectural paradigm, we train our hybrid mo dels at the same scale in an interleaving fashion with a 5:1 ratio of linear layer to NoPE self-attention (B. Y ang et al. 2025 ). As seen in prior w ork (W alee et al. 2024 ), hybrid models outperform the Transformer baseline. W e nd that the reintroduction of the pre-output projection RMSNorm (pre-gate , grouped RMSNorm in T able 4 ) to the Mamba-3 layer improv es the length generalization retrieval abilities at the slight cost of in-context, r eal-world retrieval tasks and is highly competitive as a linear sequence mixing backbone when mixed with self-attention. However , the ideal norm typ e (grouped vs default) and its placement (pre- vs post-gate) is still unclear due to comp eting tradeos (Appendix E , T able 9 ), as we nd that hybrid models and their exact characteristics and dynamics are complex and oftentimes unintuitive, a point echo ed in recent works such as Cabannes et al. ( 2025 ). 4.2 SSM Metho dology Ablations T able 5a ablates the changes that Mamba-3 introduces to core SSM components, mainly the introduction of BC bias and exponential-trapezoidal discretization. W e report the pretraining test perplexity on models at the 440M scale, trained for Chinchilla optimal tokens. W e nd that the bias and exponential-trapezoidal SSM synergize well and make the short convolution utilized by many current linear models redundant. W e empirically demonstrate that data-dependent RoPE in Mamba-3 enables state tracking. Follo wing Grazzi, Siems, Zela, et al. ( 2025 ), we e valuate on tasks from the Chomsky hierarchy—Parity , Modular Arithmetic ( without brackets), and Mod- ular Arithmetic ( with brackets)—and r eport scale d accuracies in T able 5b . Mamba-3 solves Parity and Modular Arithmetic (without brackets), and nearly closes the accuracy gap on Modular Arithmetic (with brackets). In contrast, Mamba-3 with- out RoPE, Mamba-3 with standard RoPE (Su et al. 2023 ), and Mamba-2 fail to learn these tasks. W e use the state-tracking- enabled variant of GDN and observe that Mamba-3 is competitive—matching parity and approaching its performance on 13 T able 4: Retrieval capabilities measured by a mixture of real-world and synthetic retrieval tasks. Real-world r etrieval tasks utilize cloze variants of the original datasets and are truncated to 2K length. Mamba-3 demonstrates strong associative recall, question-answering, and length generalization on ne edle-in-a-haystack (NIAH), but suers with information extraction of semi-structur ed and unstructured data. The T ransformer baseline uses RoPE which may explain its length generalization issues, and hybrid mo dels utilize NoPE (no positional embeddings). W e nd a pre-gate, grouped RMSNorm can be added to Mamba-3 SISO hybrid models to improve the length generalization of the NIAH tasks at a slight decrease in real-world retrie val performance. Model (1.5B) SWDE SQD. FDA TQ A NQ Drop NIAH-Single-1 NIAH-Single-2 NIAH-Single-3 Context Length 2048 1024 2048 4096 1024 2048 4096 1024 2048 4096 Pure Transformer 48 . 9 46 . 6 58 . 4 67 . 5 31 . 7 26 . 4 100 . 0 100 . 0 0 . 0 92 . 2 100 . 0 0 . 0 98 . 6 99 . 4 0 . 0 GDN 32 . 7 40 . 0 28 . 3 63 . 5 25 . 7 24 . 5 100 . 0 100 . 0 99 . 8 100 . 0 93 . 8 49 . 8 83 . 8 68 . 4 34 . 2 Mamba-2 30 . 7 39 . 1 23 . 7 64 . 3 25 . 1 28 . 5 100 . 0 99 . 6 62 . 0 100 . 0 53 . 8 11 . 8 95 . 8 87 . 4 13 . 4 Mamba-3 SISO 28 . 5 40 . 1 23 . 4 64 . 5 26 . 5 27 . 4 100 . 0 100 . 0 88 . 2 100 . 0 95 . 4 50 . 6 92 . 4 81 . 4 34 . 2 Mamba-3 MIMO 36 . 3 41 . 7 29 . 3 64 . 5 26 . 2 26 . 3 100 . 0 100 . 0 93 . 0 100 . 0 86 . 0 40 . 4 95 . 8 84 . 4 25 . 6 Hybrid GDN 54 . 6 48 . 4 58 . 8 64 . 9 32 . 7 30 . 0 100 . 0 100 . 0 71 . 4 99 . 6 100 . 0 60 . 2 70 . 0 96 . 2 24 . 0 Mamba-2 58 . 2 45 . 6 71 . 0 66 . 1 33 . 4 28 . 1 100 . 0 100 . 0 3 . 2 99 . 6 98 . 8 0 . 0 98 . 2 98 . 0 0 . 0 Mamba-3 SISO 58 . 5 47 . 0 65 . 9 64 . 8 33 . 4 27 . 0 100 . 0 100 . 0 36 . 2 100 . 0 100 . 0 9 . 4 99 . 8 100 . 0 8 . 8 Mamba-3 SISO Norm ∗ 58 . 6 47 . 3 52 . 4 65 . 7 33 . 3 28 . 5 100 . 0 100 . 0 100 . 0 100 . 0 100 . 0 96 . 0 99 . 8 97 . 2 56 . 8 T able 5: Left : Ablations on core modeling comp onents of Mamba-3 SISO, results on test split of dataset. Right : Formal language evaluation (scaled accuracy , %). Higher is better . SISO models are trained on short sequences and evaluated on longer lengths to test length generalization. For GDN we report the variant with eigenvalue range [ − 1 , 1 ] . Model V ariant ppl ↓ Mamba-3 − bias − trap 16 . 68 Mamba-3 − bias 16 . 49 Mamba-3 15 . 72 Mamba-3 + conv 15 . 85 (a) Component ablation at 440M scale. A combination of our BC bias and exponential-trapezoidal discr etiza- tion makes the ubiquitous short convolution optional. Model Parity ↑ Arith. w/o brackets ↑ Arith. w/ brackets ↑ Mamba-3 100 . 00 98 . 51 87 . 75 Mamba-3 (w/ Std. RoPE) 1 . 56 20 . 70 2 . 62 Mamba-3 (w/o RoPE) 2 . 27 1 . 49 0 . 72 Mamba-2 0 . 90 47 . 81 0 . 88 GDN [ -1,1] 100 . 00 99 . 25 93 . 50 (b) Performance comparison on formal language tasks. Results show that unlike Mamba-2, Mamba-3 features state-tracking ability stemming from data-dependent RoPE embeddings. both modular-arithmetic tasks. Experimental settings are co vered in Appendix D . 4.3 Inference Eciency to Performance Tradeo As 𝑑 state governs the de code runtime for the sub-quadratic models considered in this paper (Section 3.3 ), we use it as a proxy for inference sp eed. By plotting the validation p erplexity (a proxy for model performance) as a function of 𝑑 state , we aim to formulate a holistic picture about how sub-quadratic mo dels can trade o performance with inference sp eed. Figure 3 shows such a Pareto frontier for the Mamba models considered in this paper . For each data point, we train a 440M parameter model to 2 × Chinchilla optimal tokens on the Fineweb-Edu dataset, where the model is congur ed with a 𝑑 state of { 16 , 32 , 64 , 128 } . As expe cted, w e observe an inverse corr elation between validation loss and 𝑑 state . Moreover , there is a general downward shift on the Pareto frontier mo ving from Mamba-2 to Mamba-3, indicating a str onger mo del: in this setting, Mamba-3 with 2 × smaller state size achieves better pretraining perplexity than its Mamba-2 counterpart, resulting in a faster model with the same quality or a better model for the same spe ed. A further downward shift is observed when moving from the SISO variant of Mamba-3 to the MIMO variant of Mamba-3 (where we set the MIMO rank 𝑅 = 4 and decrease the MLP inner dimension to parameter match the SISO variants). W e expand the comparison to include the GDN baseline in Appendix E , Figure 6 , which also shows Mamba-3 comparing favorably to GDN. 14 1 0 5 R elative T otal State Size 14.4 14.6 14.8 15.0 15.2 P r etraining P erple xity Relative T otal State Size vs Pretraining Perplexity Mamba-2 Mamba-3 Mamba-3 MIMO Figure 3: Exploration of state size (infer ence speed pro xy) versus pretraining perplexity (performance proxy) across dierent Mamba variants. Mamba-3 improves the Pareto frontier compared to previous recurrent SISO mo dels, while incorporating MIMO further shifts the frontier through b et- ter modeling performance without increasing state size. Model FP32 BF16 𝑑 state = 64 𝑑 state = 128 𝑑 state = 64 𝑑 state = 128 Mamba-2 0 . 295 0 . 409 0 . 127 0 . 203 GDN 0 . 344 0 . 423 0 . 176 0 . 257 Mamba-3 (SISO) 0 . 310 0 . 399 0 . 110 0 . 156 Mamba-3 (MIMO) 0 . 333 0 . 431 0 . 137 0 . 179 T able 6: K ernel latency (in milliseconds) comparison across models, precision, and 𝑑 state values. Mamba-3 in- troduces minimal overhead compared to Mamba-2 and features highly ecient practical implementations. Our Mamba-3 SISO kernels are faster than reference Mamba-2 and GDN kernels at the commonly used bf16, 𝑑 state = 128 setting. Mamba-3 MIMO ( 𝑅 = 4 ) incurs little additional cost compared to SISO . T able 7: Prell and Prell+Decode latency across sequence lengths. Mamba-3 adds minimal overhead to its for ward-pass and retains competitive decode latencies. Details in Appendix G . Model 512 tokens 1024 tokens 2048 tokens 4096 tokens 16384 tokens Prell Prell+Dec Prell Prell+Dec Prell Prell+Dec Prell Prell+Dec Prell Prell+Dec vLLM (Llama-3.2-1B) 0.26 4.45 0.52 9.60 1.08 20.37 2.08 58.64 12.17 976.50 Gated DeltaNet 0.51 4.56 1.01 9.11 2.01 18.22 4.00 36.41 16.21 145.87 Mamba-2 0.51 4.66 1.02 9.32 2.02 18.62 4.02 37.22 16.22 149.02 Mamba-3 (SISO) 0.51 4.39 1.01 8.78 2.02 17.57 4.01 35.11 16.22 140.61 Mamba-3 (MIMO 𝑅 = 4 ) 0.60 4.74 1.21 9.48 2.42 18.96 4.76 37.85 19.44 151.81 4.4 Fast Mamba-3 K ernels W e complement Mamba-3’s methodological advances with optimized kernels that deliver fast inference in practical set- tings. W e implement a new series of inference kernels for Mamba-3—using T riton for the forward (prell) path and CuT e DSL for decode—and compare their per-token decode latency against the released Triton kernels for Mamba-2 and GDN in T able 6 . 5 The evaluation measures a single decode step at batch size 128 on a single H100 for both FP32 and BF16 datatypes; models are 1.5B parameters with model dimension 2048 and state dimension ∈ { 64 , 128 } . A cross all congurations, SISO achieves the lowest latency amongst baselines. MIMO, with its higher arithmetic intensity , increases the decoding FLOPs without signicantly increasing decode runtime. Our benchmarks indicate that our CuT e DSL decode implementation is competitive and that the additional comp onents of Mamba-3 ( exponential-trapezoidal update, complex-valued state, and MIMO projections) are lightw eight. This supports our overall inference-rst persp ective: Mamba-3 admits a simple, low-latency implementation while providing strong empirical performance. T able 7 b enchmarks b oth end-to-end latency across dierent de coding sequence length and prell time for the same se- quence length. The decode time is consistent with T able 6 , where Mamba-3 (SISO) is fastest; Mamba-3 (MIMO) is on par with Mamba-2; and all linear methods are faster than optimized attention as se quence length grows. W e also se e that MIMO incurs a mo derate overhead for prell, as discussed in Se ction 3.3 . Details of the benchmark are in Ap- pendix G . 5 Details on each kernel DSL and the exact kernel fusion structure is provided in Appendix G . 15 5 Related W ork 5.1 Linear- Time Se quence Mixers A growing b ody of work se eks to r eplace the quadratic softmax-base d attention mechanism (Bahdanau, Cho, and Bengio 2014 ; V aswani et al. 2017 ) with linear runtime alternatives. Prominent approaches can b e categorize d under three broad frameworks: linear attention, test-time training, and state space mo dels. Many nascent linear attention (LA) models aime d to appr oximate softmax attention through kernel feature maps (Choro- manski et al. 2022 ; Kathar opoulos et al. 2020 ), while recent models have discarded the featur e maps for raw dot-products between queries and ke ys, mo dulated by decays or masks (Y utao Sun et al. 2023 ; S. Y ang, B. W ang, Shen, et al. 2024 ). More recently , fast-weight programmers Schlag, Irie, and Schmidhuber ( 2021 ) that modulate the state memor y with key-value pairs have also fallen under the umbrella term “linear attention. ” S. Y ang, Kautz, and Hatamizadeh ( 2025 ) and S. Y ang, B. W ang, Y . Zhang, et al. ( 2025 ) originate d from this line of work and enhanced traditional linear attention by replacing the additive memor y update with a delta-rule recurrence. This has further spurred on a host of work improving the eciency and capabilities of linear models built on the delta rule (Hu et al. 2025 ; Kimi T eam et al. 2025 ). A parallel line of test-time training (T T T) or test-time regression (T TR) work views se quence mo deling as an online learning task during inference. Her e, the recurrent state represents a compressed summary of past inputs, and recurrent steps update the state to memorize new information (Y u Sun et al. 2025 ; T andon et al. 2025 ; T . Zhang et al. 2025 ). Equiva- lently , these methods can be vie wed as optimization of a global regression objective, and r ecurrent state up dates represent iterative optimization procedures such as variants of gradient descent (K. A. W ang, Shi, and Fox 2025 ). Structured state space models (SSMs) ar e another view of mo dern recurrent mo dels inspired by classical signal processing and dynamical systems. Early versions of SSMs such as S4 (Gu, Go el, and Ré 2022 ; Gupta, Gu, and Berant 2022 ; Smith, W ar- rington, and Linderman 2023 ) used linear time invariant (LTI) layers with structured state transition matrices, for example diagonal or low-rank plus diagonal, to facilitate ecient computation and stable learning of long-context tasks (Gu, Goel, and Ré 2022 ; Gupta, Gu, and Berant 2022 ; Smith, W arrington, and Linderman 2023 ). The introduction of time-var ying, input-dependent selectivity to SSMs in Mamba-1 (Gu and Dao 2024 ) reduced the disparity between self-attention and linear mo dels on information-dense modalities, notably language modeling. Subsequently , Mamba-2 (Dao and Gu 2024 ) formalized the connection between SSMs and ( linear) attention through the structured state space duality (SSD) that we build on in this work. 5.2 State Tracking and Comple x State Space Models Expressivity and State Tracking. Recent w ork characterizes the types of state that recurrent, constant-memor y mix- ers can maintain, rev ealing algorithmic deciencies in previous SSM-based models. Merrill, Petty, and Sabhar wal ( 2025 ) show that under nite precision, practical SSMs collapse to T C 0 , leading to failur es on tasks like p ermutation composi- tion over 𝑆 5 unless the primitive is extended. Similarly , Y u and Erichson ( 2025 ) prove that a single-layer Mamba is not a universal approximator . Several modications have b een propose d to improve expressivity . For instance, the same work shows that a block-biased variant regains the univ ersal appr oximation pr operty with only minor changes, either through block decomp osition or a channel-sp ecic bias. Allowing negative eigenvalues or non-triangular transitions enables lin- ear RNNs—including diagonal and Householder/DeltaNet forms—to capture parity and, under mild assumptions, regular languages (Grazzi, Siems, Zela, et al. 2025 ). Complex-valued parameterizations provide another avenue for enhanced expressivity . Complex State Space Mo dels. Structured SSMs prior to Mamba wer e frequently complex-valued, rooted in traditional SSM theory . They also generally excelled in domains such as vision and audio, which have explicit frequency-base d information content, rather than language. While some models such as H3 (Fu et al. 2023 ), RetNet (Yutao Sun et al. 2023 ), and Megalodon (Ma et al. 2024 ) kept comple x-valued SSMs while targeting language modeling, they still noticeably underperformed Transformers. Additionally , because these models were LTI and were computed using very dier ent algorithms (in particular , convo- lutions or explicit recurrence) than modern selective SSMs such as Mamba, they generally did not use the RoPE trick to handle the complex part. An exception is RetNet, which introduced a model in between linear attention and Mamba-2 that used constant scalar decays (as opposed to no decay in LA and data-dep endent decay in Mamba-2) with an additional 16 constant complex phase that was implemented through RoPE. In general, complex numb ers have been empirically found to be unhelpful for language modeling, and hence were phase d out in Mamba-1 and successors, including parallel lines of work on linear attention and test-time training. Mamba-3 represents the rst modern recurrent model with complex-valued state transitions, which were introduced for specic purposes of increasing expressivity and state-tracking ability . By incorp orating the RoPE trick, this represents, to the best of our knowledge, the rst usage of data-dependent RoPE grounded in theoretical motivations. 5.3 Multi-Input, Multi-Output S4 (Gu, Goel, and Ré 2022 ) is a single-input, single-output LTI system where each dimension of the input was assigne d its own independent SSM. Such SISO models hav e a signicantly larger r ecurrent state than classical RNNs, and ne cessitated more complicate d mathematical machinery to compute them eciently . Aiming to simplify the model, S5 (Smith, W ar- rington, and Linderman 2023 ) and LRU (Orvieto et al. 2023 ) replaced the set of SISO SSMs with a multi-input, multi-output SSM applied directly on the entire vectorized input. This change reduced the ee ctive state capacity but enabled an al- ternate computation path by dir ectly computing the recurrence with a parallel scan. While this trade-o b etween state capacity and modeling performance was less pronounced in LTI models, Mamba-1 (S6) (Gu and Dao 2024 ) and Mamba- 2 (Dao and Gu 2024 ) returned to the SISO system due to the importance of a large state size in the time-var ying setting. The computational bottleneck associate d with the increased state size was addressed with a hardware-aware parallel scan algorithm for Mamba-1 and a matrix multiplication-based algorithm for Mamba-2. The introduction of MIMO to Mamba-3 signicantly diverges from prior work. Unlike previous MIMO models, which aimed to simplify training algorithms at the cost of slightly reduced expressivity , Mamba-3’s MIMO structure is motivated to increase modeling power while preserving inference eciency . Accordingly , its state expansion is kept at Mamba-1/-2 levels to maintain modeling capabilities while trading o additional training compute. 5.4 The State Space Model Vie wpoint Although mo dern recurrent models have se veral dierent viewpoints that largely converge (Section 5.1 ), each framework has slightly dierent interpretations and motivations that can lead to dierent design spaces and e xtensions. In particular , linear attention and test-time training are more closely related and can p erhaps b e lumpe d together under a framework of associative memory that explicitly aims to memorize input data through “key-value ” stores; either through approximations to the canonical K V method (i.e., quadratic attention) in LA, or by minimizing soft optimization objectives in T T T . On the other hand, state space mo dels hav e a dierent lineage , as reected both in terminology ( e.g., 𝐴, 𝐵 , 𝐶 , 𝑋 instead of 𝑄 , 𝐾 , 𝑉 ) and in their natural extensions. Notably , the methodological improvements in Mamba-3 are all associate d with the SSM viewpoint specically and are less motivated from associative memory frameworks. 1. Exponential- Trapezoidal Discretization. The SSM viewpoint entails the discretization of a continuous ODE governing the system; our exponential-trapezoidal discr etization falls out of an impr oved discretization method. As associative memory methods do not use discretization, it is not obvious how to interpret a 3-term recurrence such as exponential-trapezoidal under alternate viewpoints. 2. Complex- V alued State Transitions. Complex SSMs have long b een a staple of dynamical systems, and it is natural to consider complex values as an extension of selective SSMs. On the other hand, the associative memory framework interprets the 𝐴 state transition as a co ecient of an obje ctive function, for example corresponding to the weight of an L2 regularization (or weight-decay) term in the optimization objective (K. A. W ang, Shi, and Fox 2025 ). Howev er , complex values are meaningless as the coecient of a regression obje ctive; hence , Mamba-3 is not obviously interpretable within these frameworks. 3. Multi-Input, Multi-Output. MIMO is a classical concept from the state space model literature and does not naturally appear in associative memory (linear attention or test-time training) frameworks. However , w e do note that the MIMO formulation intr oduced in this pap er is not directly tie d to SSM theor y—and instead is motivated from a computational perspective—and our techniques can be adapte d to other modern recurrent models as well. There continues to be vigorous progress in the development of linear-time sequence models, and the discussion her e only captures a portion of them. W e anticipate a growing space of unied frameworks, improved understanding, and new generalizations as the development of these models continually evolv es. 17 6 Conclusion And Future W ork W e introduce Mamba-3, a state space model with several methodological improvements over prior SSMs: a more power- ful recurrence via exponential-trapezoidal discretization; improved expressivity through complex-valued state transitions; and higher inference eciency and modeling abilities with a MIMO formulation. The base SISO version of Mamba-3 de- livers strong language mo deling results, both standalone and in interleav ed hybrid architectures, and advances the Pareto frontier on the performance-eciency tradeo o ver prior linear sequence models. The MIMO version trades o slower training for even stronger modeling power , while maintaining comp etitive inference eciency compared to Mamba-2. Put together , the techniques in Mamba-3 show simple and theoretically motivated improvements from the state space model viewpoint, and open up new directions and design principles for ecient sequence models. Acknowledgments. W e gratefully acknowledge the support of the Schmidt Sciences AI2050 fellowship, the Google ML and Systems Junior Faculty A wards, the Google Research Scholar program, Princeton Language and Intelligence (PLI), T ogether AI, and Carte- sia AI. KL is supported by the NSF GRFP under Grant DGE2140739. W e also thank Sukjun Hwang and Gaurav Ghosal for helpful feedback and discussions. References [1] Zeyuan Allen-Zhu. “Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers”. In: SSRN Electronic Journal (May 2025). https://ssrn.com/abstract=5240330 . [2] Anthropic. Introducing Claude Opus 4.6 . Feb. 2026. url : https : / / www . anthropic . com /news / claude- opus - 4 - 6 (visited on 02/17/2026). [3] Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Rób ert Csordás, Dan Jurafsky, and Christopher Potts. Mechanistic evaluation of Transformers and state space models . 2025. arXiv: 2505 . 15105 [cs.CL] . url : https : / / arxiv . org / abs/2505.15105 . [4] Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, D ylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeo . 2025. arXiv: 2402.18668 [cs.CL] . url : https://arxiv.org/abs/2402.18668 . [5] Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spe ctor, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models . 2024. arXiv: 2407 . 05483 [cs.CL] . url : https://arxiv.org/abs/2407.05483 . [6] Dzmitry Bahdanau, K yunghyun Cho, and Y oshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate . 2014. arXiv: 1409.0473 [cs.CL] . url : . [7] Y onatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Y ejin Choi. PIQA: Reasoning about Physical Com- monsense in Natural Language . 2019. arXiv: 1911.11641 [cs.CL] . url : . [8] Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Cop et, Pierre-Emmanuel Mazaré, Gabriel Synnaeve, and Her vé Jégou. Short window attention enables long-term memorization . 2025. arXiv: 2509.24552 [cs.LG] . url : https://arxiv.org/abs/2509.24552 . [9] Krzysztof Choromanski, V alerii Likhosherstov , David Dohan, Xingy ou Song, Andreea Gane, T amas Sarlos, Peter Hawkins, Jar ed Davis, Afroz Mohiuddin, Lukasz K aiser, David Belanger , Lucy Colwell, and A drian W eller. Rethink- ing Attention with Performers . 2022. arXiv: 2009.14794 [cs.LG] . url : . [10] Peter Clark, Isaac Cowhey, Oren Etzioni, T ushar Khot, Ashish Sabhar wal, Carissa Schoenick, and O yvind T af jord. Think you have Solved Question A nswering? Try ARC, the AI2 Reasoning Challenge . 2018. arXiv: 1803.05457 [cs.AI] . url : . [11] Tri Dao and Albert Gu. Transformers are SSMs: Generalize d Mo dels and Ecient Algorithms Through Structured State Space Duality . 2024. arXiv: 2405.21060 [cs.LG] . url : . [12] Dheeru Dua, Yizhong W ang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs . 2019. arXiv: 1903 . 00161 [cs.CL] . url : https://arxiv.org/abs/1903.00161 . [13] Daniel Y . Fu, Tri Dao, Khaled K. Saab, Armin W . Thomas, Atri Rudra, and Christopher Ré. Hungry Hungr y Hippos: T owards Language Modeling with State Space Models . 2023. arXiv: 2212.14052 [cs.LG] . url : abs/2212.14052 . 18 [14] Leo Gao, Jonathan T ow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPo, Charles Foster , Laurence Gold- ing, Jerey Hsu, Alain Le Noac’h, Haonan Li, K yle McDonell, Niklas Muennigho, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, A viya Skowron, Lintang Sutawika, Eric T ang, Anish Thite, Ben W ang, Ke vin W ang, and Andy Zou. The Language Model Evaluation Harness . V ersion v0.4.3. July 2024. doi : 10.5281/zenodo.12608602 . url : https://zenodo.org/records/12608602 . [15] Aaron Grattaori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407.21783 [cs.AI] . url : abs/2407.21783 . [16] Riccardo Grazzi, Julien Siems, Simon Schrodi, Thomas Brox, and Frank Hutter. Is Mamba Capable of In-Context Learning? 2024. arXiv: 2402.03170 [cs.LG] . url : . [17] Riccardo Grazzi, Julien Siems, Arb er Zela, Jörg K. H. Franke, Frank Hutter, and Massimiliano Pontil. Unlo cking State- Tracking in Linear RNNs Through Negative Eigenvalues . 2025. arXiv: 2411 . 12537 [cs.LG] . url : https : / / arxiv.org/abs/2411.12537 . [18] Albert Gu and T ri Dao. Mamba: Linear- Time Sequence Mo deling with Selective State Spaces . 2024. arXiv: 2312.00752 [cs.LG] . url : https://arxiv.org/abs/2312.00752 . [19] Albert Gu, Karan Goel, and Christopher Ré. Eciently Mo deling Long Sequences with Structured State Spaces . 2022. arXiv: 2111.00396 [cs.LG] . url : . [20] Albert Gu, Ankit Gupta, Karan Go el, and Christopher Ré. “On the Parameterization and Initialization of Diagonal State Space Models”. In: arXiv preprint arXiv:2206.11893 (2022). url : . [21] Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal State Spaces are as Ee ctive as Structured State Spaces . 2022. arXiv: 2203.14343 [cs.LG] . url : . [22] Alex Henry, Prudhvi Raj Dachapally , Shubham Pawar, and Y uxuan Chen. Query-Key Normalization for Transform- ers . 2020. arXiv: 2010.04245 [cs.CL] . url : . [23] Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Y ang Zhang, and Boris Ginsburg. RULER: What’s the Real Conte xt Size of Y our Long-Context Language Models? 2024. arXiv: 2404 . 06654 [cs.CL] . url : https://arxiv.org/abs/2404.06654 . [24] Jiaxi Hu, Y ongqi Pan, Jusen Du, Disen Lan, Xiaqiang T ang, Qingsong W en, Y uxuan Liang, and W eigao Sun. Comba: Improving Bilinear RNNs with Closed-loop Control . 2025. arXiv: 2506 . 02475 [cs.LG] . url : https : / / arxiv . org / abs/2506.02475 . [25] Mandar Joshi, Eunsol Choi, Daniel S. W eld, and Luke Zettlemoy er. TriviaQ A: A Large Scale Distantly Supervise d Challenge Dataset for Reading Comprehension . 2017. arXiv: 1705 . 03551 [cs.CL] . url : https: / / arxiv .org / abs/ 1705.03551 . [26] Angelos Katharopoulos, Apoor v V yas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autor e- gressive Transformers with Linear Attention . 2020. arXiv: 2006 . 16236 [cs.LG] . url : https : / / arxiv . org / abs / 2006.16236 . [27] Kimi T eam, Y u Zhang, Zongyu Lin, Xingcheng Y ao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Y ang, Zhiyuan Li, W entao Li, Enzhe Lu, W eizhou Liu, Y anru Chen, W eixin Xu, Longhui Y u, Y ejie W ang, Y u Fan, Longguang Zhong, Enming Y uan, Dehao Zhang, Yizhi Zhang, T . Y . Liu, Haiming W ang, Shengjun Fang, W eiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Y an, Zhejun Jiang, W eixiao Huang, Bohong Yin, Jiacheng Y ou, Chu W ei, Zhengtao W ang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng W ang, Huabin Zheng, Feng W ang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, W enhao Wu, Y uhao Wu, Longyu Guan, Jiawen T ao, Guohong Fu, Xinran Xu, Y uzhi W ang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Y ang, and Yulun Du. Kimi Linear: A n Expressive , Ecient Attention A rchitecture . 2025. arXiv: 2510.26692 [cs.CL] . url : . [28] T om K wiatkowski, Jennimaria Palomaki, Olivia Redeld, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina T outanova, Llion Jones, Matthew Kelcey , Ming- W ei Chang, Andr ew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. “Natural Questions: A Benchmark for Ques- tion Answering Research”. In: Transactions of the Association for Computational Linguistics 7 (2019). Ed. by Lillian Lee, Mark Johnson, Brian Roark, and Ani Nenkova, pp. 452–466. doi : 10 . 1162 / tacl _ a _ 00276 . url : https : //aclanthology.org/Q19- 1026/ . [29] W oosuk K won, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Ecient Memor y Management for Large Language Model Serving with PagedAttention . 2023. arXiv: 2309.06180 [cs.LG] . url : . [30] Baolin Li, Y ankai Jiang, Vijay Gadepally, and Devesh Tiwari. LLM Inference Ser ving: Sur vey of Recent Advances and Opportunities . 2024. arXiv: 2407.12391 [cs.DC] . url : . 19 [31] Xuezhe Ma, Xiaomeng Y ang, W enhan Xiong, Beidi Chen, Lili Y u, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou. Megalo don: Ecient LLM Pretraining and Inference with Unlimited Context Length . 2024. arXiv: 2404.08801 [cs.LG] . url : . [32] William Merrill, Jackson Petty, and Ashish Sabharwal. The Illusion of State in State-Space Models . 2025. arXiv: 2404. 08819 [cs.LG] . url : https://arxiv.org/abs/2404.08819 . [33] T o dor Mihaylov, Peter Clark, T ushar Khot, and Ashish Sabharwal. Can a Suit of A rmor Conduct Ele ctricity? A New Dataset for Open Book Question A nswering . 2018. arXiv: 1809 . 02789 [cs.CL] . url : https : / / arxiv . org / abs / 1809.02789 . [34] NVIDIA. N VIDIA H100 T ensor Core GP U White Paper . T e ch. r ep. N VIDIA, 2022. url : https://resources.nvidia. com/en- us- hopper- architecture/nvidia- h100- tensor- c . [35] NVIDIA et al. Nemotron-H: A Family of Accurate and Ecient Hybrid Mamba- Transformer Models . 2025. arXiv: 2504. 03624 [cs.CL] . url : https://arxiv.org/abs/2504.03624 . [36] OpenAI. Introducing GPT -5.3-Codex . Feb. 2026. url : https://openai.com/index/introducing- gpt- 5- 3- codex/ (visited on 02/17/2026). [37] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehr e, Razvan Pascanu, and Soham De. Resurrecting Recurrent Neural Networks for Long Sequences . 2023. arXiv: 2303 . 06349 [cs.LG] . url : https : //arxiv.org/abs/2303.06349 . [38] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBAD A dataset: W ord prediction requiring a broad discourse context . 2016. arXiv: 1606.06031 [cs.CL] . url : . [39] Guilherme Penedo, Hynek K ydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Rael, Leandro V on W erra, and Thomas W olf. The Fine W eb Datasets: Decanting the W eb for the Finest T ext Data at Scale . 2024. arXiv: 2406.17557 [cs.CL] . url : https://arxiv.org/abs/2406.17557 . [40] Pranav Rajpurkar , Jian Zhang, and Percy Liang. “Know What Y ou Don’t Know: Unanswerable Questions for SQuAD”. In: A CL 2018 . 2018. [41] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Y ejin Choi. WinoGrande: A n A dversarial Winograd Schema Challenge at Scale . 2019. arXiv: 1907.10641 [cs.CL] . url : . [42] Y ash Sarrof, Y ana V eitsman, and Michael Hahn. The Expressive Capacity of State Space Mo dels: A Formal Language Perspective . 2024. arXiv: 2405.17394 [cs.CL] . url : . [43] Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear Transformers A re Secretly Fast W eight Programmers . 2021. arXiv: 2102.11174 [cs.LG] . url : . [44] Jimmy T . H. Smith, Andrew W arrington, and Scott W . Linderman. Simplied State Space Layers for Sequence Mod- eling . 2023. arXiv: 2208.04933 [cs.LG] . url : . [45] Charlie Snell, Jaehoon Lee, Kelvin Xu, and A viral Kumar. Scaling LLM T est- Time Compute Optimally can be More Eective than Scaling Model Parameters . 2024. arXiv: 2408 . 03314 [cs.LG] . url : https: / /arxiv .org / abs / 2408. 03314 . [46] Jianlin Su, Y u Lu, Shengfeng Pan, Ahmed Murtadha, Bo W en, and Y unfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embe dding . 2023. arXiv: 2104.09864 [cs.CL] . url : . [47] Endre Süli and David F . Mayers. A n Introduction to Numerical A nalysis . Cambridge University Press, 2003. [48] Y u Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Y ann Dubois, Xinlei Chen, Xiaolong W ang, Sanmi Koy ejo, T atsunori Hashimoto, and Carlos Guestrin. Learning to (Learn at T est Time): RNNs with Ex- pressive Hidden States . 2025. arXiv: 2407.04620 [cs.LG] . url : . [49] Y utao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jiany ong W ang, and Furu W ei. Retentive Network: A Successor to T ransformer for Large Language Models . 2023. arXiv: 2307 . 08621 [cs.CL] . url : https : //arxiv.org/abs/2307.08621 . [50] Arnuv T andon, Karan Dalal, Xinhao Li, Daniel K oceja, Marcel Rød, Sam Buchanan, Xiaolong W ang, Jure Leskovec, Sanmi Koyejo, T atsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Y ejin Choi, and Y u Sun. End-to-End T est- Time Training for Long Context . 2025. arXiv: 2512.23675 [cs.LG] . url : . [51] T encent Hunyuan T eam et al. Hunyuan- T urboS: Advancing Large Language Models through Mamba- Transformer Synergy and Adaptive Chain-of- Thought . 2025. arXiv: 2505.15431 [cs.CL] . url : https:/ /arxiv.org/abs/ 2505. 15431 . [52] M. T enenbaum and H. Pollard. Ordinary Dierential Equations: A n Elementary T extb ook for Students of Mathematics, Engineering, and the Sciences . Dover Books on Mathematics. Dover Publications, 1985. isbn : 9780486649405. url : https://books.google.com/books?id=iU4zDAAAQBAJ . 20 [53] Ashish V aswani, Noam Shaze er, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz K aiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information processing systems . 2017, pp. 5998–6008. url : . [54] Roger W alee, W onmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, De epak Narayanan, Gar vit Kulshreshtha, V artika Singh, Jared Casp er, Jan Kautz, Mohammad Shoeybi, and Br yan Catanzaro. A n Empirical Study of Mamba-based Language Mo dels . 2024. arXiv: 2406 . 07887 [cs.LG] . url : https://arxiv.org/abs/2406.07887 . [55] Ke Ale xander W ang, Jiaxin Shi, and Emily B. Fox. T est-time regression: a unifying framework for designing sequence models with associative memory . 2025. arXiv: 2501.12352 [cs.LG] . url : . [56] Mitchell W ortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman No vak, Jere y Pennington, Jascha Sohl-dickstein, K elvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Small-scale proxies for large-scale Transformer training instabilities . 2023. arXiv: 2309 .14322 [cs.LG] . url : https://arxiv.org/abs/2309.14322 . [57] Y angzhen W u, Zhiqing Sun, Shanda Li, Sean W elle ck, and Yiming Y ang. Inference Scaling Laws: A n Empirical A nalysis of Compute-Optimal Inference for Problem-Solving with Language Models . 2025. arXiv: 2408 . 00724 [cs.AI] . url : https://arxiv.org/abs/2408.00724 . [58] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran W ei, Huan Lin, Jialong T ang, Jian Y ang, Jianhong T u, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, K eqin Bao, Kexin Y ang, Le Y u, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng W ang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi T ang, W enbiao Yin, Xingzhang Ren, Xinyu W ang, Xinyu Zhang, Xuancheng Ren, Y ang Fan, Y ang Su, Yichang Zhang, Yinger Zhang, Y u W an, Y uqiong Liu, Zekun W ang, Ze yu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 T e chnical Report . 2025. arXiv: 2505.09388 [cs.CL] . url : https://arxiv.org/abs/2505.09388 . [59] Bowen Y ang, Bharat V enkitesh, Dwarak T alupuru, Hangyu Lin, David Cairuz, P hil Blunsom, and Acyr Locatelli. Rope to Nop e and Back Again: A New Hybrid Attention Strategy . 2025. arXiv: 2501 . 18795 [cs.CL] . url : https : //arxiv.org/abs/2501.18795 . [60] Songlin Y ang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule . 2025. arXiv: 2412.06464 [cs.CL] . url : . [61] Songlin Y ang, Bailin W ang, Yikang Shen, Rameswar Panda, and Y oon Kim. Gated Linear Attention Transformers with Hardware-Ecient Training . 2024. arXiv: 2312.06635 [cs.LG] . url : . [62] Songlin Y ang, Bailin W ang, Y u Zhang, Yikang Shen, and Y oon Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length . 2025. arXiv: 2406.06484 [cs.LG] . url : . [63] Annan Y u and N. Benjamin Erichson. Block-Biased Mamba for Long-Range Sequence Processing . 2025. arXiv: 2505. 09022 [cs.LG] . url : https://arxiv.org/abs/2505.09022 . [64] Rowan Zellers, Ari Holtzman, Y onatan Bisk, Ali Farhadi, and Y ejin Choi. HellaSwag: Can a Machine Really Finish Y our Sentence? 2019. arXiv: 1905.07830 [cs.CL] . url : . [65] Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Y ang, Kalyan Sunkavalli, William T . Freeman, and Hao T an. T est- Time Training Done Right . 2025. arXiv: 2505 . 23884 [cs.LG] . url : https : / / arxiv . org / abs / 2505.23884 . 21 A Exponential- T rapezoidal Discretization Proposition 5 (V ariation of Constants (T enenbaum and Pollard 1985 )) . Consider the linear SSM ¤ h ( 𝑡 ) = 𝐴 ( 𝑡 ) h ( 𝑡 ) + B ( 𝑡 ) 𝑥 ( 𝑡 ) , where h ( 𝑡 ) ∈ R 𝑁 , 𝐴 ( 𝑡 ) ∈ R is a scalar decay , and B ( 𝑡 ) 𝑥 ( 𝑡 ) ∈ R 𝑁 . For Δ 𝑡 discretized time grid 𝜏 𝑡 = 𝜏 𝑡 − 1 + Δ 𝑡 , the hidden state satises equation ( 15 ), which can then be approximated to equation ( 16 ) with 𝑂 ( Δ 2 𝑡 ) error . The approximation of the remaining integral on the state-input can have varying error b ounds dep ending on the method used: an example can be found in A ppendix A.2 . h ( 𝜏 𝑡 ) = exp 𝜏 𝑡 𝜏 𝑡 − 1 𝐴 ( 𝑠 ) 𝑑 𝑠 h ( 𝜏 𝑡 − 1 ) + 𝜏 𝑡 𝜏 𝑡 − 1 exp 𝜏 𝑡 𝜏 𝐴 ( 𝑠 ) 𝑑 𝑠 B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 , (15) h 𝑡 ≈ 𝑒 Δ 𝑡 𝐴 𝑡 h 𝑡 − 1 + 𝜏 𝑡 𝜏 𝑡 − 1 𝑒 ( 𝜏 𝑡 − 𝜏 ) 𝐴 𝑡 B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 . (16) Proof. Starting from the initial linear SSM, an integrating factor 𝑧 ( 𝑡 ) B 𝑒 𝑡 0 − 𝐴 ( 𝑠 ) 𝑑 𝑠 is applied to facilitate integration. 𝑧 ( 𝑡 ) ¤ h ( 𝑡 ) = 𝑧 ( 𝑡 ) 𝐴 ( 𝑡 ) h ( 𝑡 ) + 𝑧 ( 𝑡 ) B ( 𝑡 ) 𝑥 ( 𝑡 ) Considering 𝑧 ′ ( 𝑡 ) = − 𝐴 ( 𝑡 ) 𝑧 ( 𝑡 ) ; through rearranging the terms and integrating between the time grid [ 𝜏 𝑡 − 1 , 𝜏 𝑡 ] 𝜏 𝑡 𝜏 𝑡 − 1 𝑑 𝑑 𝜏 ( 𝑧 ( 𝜏 ) h ( 𝜏 ) ) 𝑑 𝜏 = 𝜏 𝑡 𝜏 𝑡 − 1 𝑧 ( 𝜏 ) B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 results in 𝑧 ( 𝜏 𝑡 ) h ( 𝜏 𝑡 ) − 𝑧 ( 𝜏 𝑡 − 1 ) h ( 𝜏 𝑡 − 1 ) = 𝜏 𝑡 𝜏 𝑡 − 1 𝑧 ( 𝜏 ) B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 , which can be arranged in a more familiar form h ( 𝜏 𝑡 ) = 𝑧 ( 𝜏 𝑡 ) − 1 𝑧 ( 𝜏 𝑡 − 1 ) h ( 𝜏 𝑡 − 1 ) + 𝜏 𝑡 𝜏 𝑡 − 1 𝑧 ( 𝜏 𝑡 ) − 1 𝑧 ( 𝜏 ) B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 . Substituting the integrating factor 𝑧 ( 𝜏 ) corr esponds to h ( 𝜏 𝑡 ) = exp 𝜏 𝑡 𝜏 𝑡 − 1 𝐴 ( 𝑠 ) 𝑑 𝑠 h ( 𝜏 𝑡 − 1 ) + 𝜏 𝑡 𝜏 𝑡 − 1 exp 𝜏 𝑡 𝜏 𝐴 ( 𝑠 ) 𝑑 𝑠 B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 . W e approximate the state-transition integral with a right-hand assumption where ∀ 𝑠 ∈ [ 𝜏 𝑡 − 1 , 𝜏 𝑡 ] , 𝐴 ( 𝑠 ) B 𝐴 ( 𝜏 𝑡 ) which we refer to as 𝐴 𝑡 , h 𝑡 ≈ exp ( Δ 𝑡 𝐴 𝑡 ) h 𝑡 − 1 right-hand approximation + 𝜏 𝑡 𝜏 𝑡 − 1 exp ( ( 𝜏 𝑡 − 𝜏 ) 𝐴 𝑡 ) B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑 𝜏 to be approximated . incurring a local truncation error of or der 𝑂 ( Δ 2 𝑡 ) . Thus, w e have appr oximated the exponential dynamics of the adjusted underlying ODE and leave the state-input integral to be approximated with any host of methods. □ A.1 Exponential- Trapezoidal Discretization’s Mask Matrix Proof. When viewing the tensor contraction form, let us call 𝐶 = ( 𝑇 , 𝑁 ) , 𝐵 = ( 𝑆 , 𝑁 ) , 𝐿 = ( 𝑇 , 𝑆 ) , 𝑋 = ( 𝑆 , 𝑃 ) base d on the Mamba-2 paper . With this decomp osition of our mask, we can view 𝐿 = contract ( 𝑇 𝑍 , 𝑍 𝑆 → 𝑇 𝑆 ) ( 𝐿 1 , 𝐿 2 ) . The original contraction can be seen as contract ( 𝑇 𝑁 , 𝑆 𝑁 , 𝑇 𝑆 , 𝑆 𝑃 → 𝑇 𝑃 ) ( 𝐶 , 𝐵 , 𝐿, 𝑋 ) 22 W e can now view it as contract ( 𝑇 𝑁 , 𝑆 𝑁 , 𝑇 𝐽 , 𝐽 𝑆 , 𝑆 𝑃 → 𝑇 𝑃 ) ( 𝐶 , 𝐵 , 𝐿 1 , 𝐿 2 , 𝑋 ) This can be broken into the following: 𝑍 = contract ( 𝑆 𝑁 , 𝑆 𝑃 → 𝑆 𝑁 𝑃 ) ( 𝐵 , 𝑋 ) 𝑍 ′ = contract ( 𝐽 𝑆 , 𝑆 𝑁 𝑃 → 𝐽 𝑁 𝑃 ) ( 𝐿 2 , 𝑍 ) 𝐻 = contract ( 𝑇 𝐽 , 𝐽 𝑁 𝑃 → 𝑇 𝑁 𝑃 ) ( 𝐿 1 , 𝑍 ′ ) 𝑌 = contract ( 𝑇 𝑁 , 𝑇 𝑁 𝑃 → 𝑇 𝑃 ) ( 𝐶 , 𝐻 ) W e can view this step: contract ( 𝑍 𝑆 , 𝑆 𝑁 𝑃 → 𝑍 𝑁 𝑃 ) ( 𝐿 2 , 𝑍 ) as a convolution of size two applied on the state-input ( 𝐵, 𝑋 outer product) prior to the decay with the traditional SSD 𝐿 = 𝐿 1 matrix. □ A.2 Exponential- Trapezoidal Discretization Error Rate Standard assumptions. W e assume that: 𝐴 ( 𝑡 ) , B ( 𝑡 ) , 𝑥 ( 𝑡 ) are b ounded and 𝐶 3 on each timestep , so that 𝑔 ( 𝜏 ) has three bounded derivatives; the map h ↦→ 𝐴 ( 𝑡 ) h + B ( 𝑡 ) 𝑥 ( 𝑡 ) is Lipschitz in h which is true for linear systems; 𝜆 𝑡 lies in a bounde d interval so that the update is zero-stable. Proof. Let g ( 𝜏 ) B 𝑒 ( 𝑡 𝑘 − 𝜏 ) 𝐴 𝑘 B ( 𝜏 ) 𝑥 ( 𝜏 ) denote the integrand in the se cond term of Proposition 5 . Since 𝐴 ( 𝑡 ) , B ( 𝑡 ) , 𝑥 ( 𝑡 ) are 𝐶 3 on [ 𝑡 𝑘 − 1 , 𝑡 𝑘 ] , the function 𝑔 has three bounded derivatives. A second-order T aylor expansion of 𝑔 around 𝑡 𝑘 − 1 gives us, 𝑡 𝑘 𝑡 𝑘 − 1 𝑔 ( 𝜏 ) 𝑑 𝜏 = Δ 𝑡 𝑔 ( 𝑡 𝑘 − 1 ) + Δ 2 𝑡 2 𝑔 ′ ( 𝑡 𝑘 − 1 ) + Δ 3 𝑡 6 𝑔 ′′ ( 𝑡 𝑘 − 1 ) + 𝑂 ( Δ 4 𝑡 ) . Recall that the trapezoidal approximation to this integral is given by , 𝑄 𝜆 = Δ 𝑡 ( 1 − 𝜆 𝑡 ) 𝑔 ( 𝑡 𝑘 − 1 ) + 𝜆 𝑡 𝑔 ( 𝑡 𝑘 ) . Expanding 𝑔 ( 𝑡 𝑘 ) using T aylor expansion: 𝑔 ( 𝑡 𝑘 ) = 𝑔 ( 𝑡 𝑘 − 1 ) + Δ 𝑡 𝑔 ′ ( 𝑡 𝑘 − 1 ) + Δ 2 𝑡 2 𝑔 ′′ ( 𝑡 𝑘 − 1 ) + 𝑂 ( Δ 3 𝑡 ) . Substituting this into 𝑄 𝜆 , 𝑄 𝜆 = Δ 𝑡 ( 1 − 𝜆 𝑡 ) 𝑔 ( 𝑡 𝑘 − 1 ) + 𝜆 𝑡 𝑔 ( 𝑡 𝑘 ) = Δ 𝑡 𝑔 ( 𝑡 𝑘 − 1 ) + 𝜆 𝑡 Δ 2 𝑡 𝑔 ′ ( 𝑡 𝑘 − 1 ) + 𝜆 𝑡 Δ 3 𝑡 2 𝑔 ′′ ( 𝑡 𝑘 − 1 ) + 𝑂 ( Δ 4 𝑡 ) . Hence, the error is giv en by: 𝑡 𝑘 𝑡 𝑘 − 1 𝑔 ( 𝜏 ) 𝑑 𝜏 − 𝑄 𝜆 = 1 2 − 𝜆 𝑡 Δ 2 𝑡 𝑔 ′ ( 𝑡 𝑘 − 1 ) + 1 6 − 𝜆 𝑡 2 Δ 3 𝑡 𝑔 ′′ ( 𝑡 𝑘 − 1 ) + 𝑂 ( Δ 4 𝑡 ) . Under the assumption that 𝜆 𝑡 = 1 2 + 𝑐 𝑡 Δ 𝑡 , where 𝑐 𝑡 = 𝑂 ( 1 ) , then 1 2 − 𝜆 𝑡 = − 𝑐 𝑡 Δ 𝑡 = 𝑂 ( Δ 𝑡 ) and thus the Δ 2 𝑡 term is 𝑂 ( Δ 3 𝑡 ) . Therefore , 𝑡 𝑘 𝑡 𝑘 − 1 𝑔 ( 𝜏 ) 𝑑 𝜏 − 𝑄 𝜆 = 𝑂 ( Δ 3 𝑡 ) , which yields an 𝑂 ( Δ 3 𝑡 ) local truncation error . □ A.3 Exponential- Trapezoidal Parameterization Setting: All runs use the Mamba-3 (SISO) 440M model trained at Chinchilla scale, with the other architectural and opti- mization hyperparameters being the same as in T able 3 . The default mo del uses a data-dependent gate 𝜆 𝑡 = 𝜎 ( 𝑢 𝑡 ) , where 𝑢 𝑡 is a learne d projection of the current input token. In T able 8 , we try dierent parameterizations for 𝜆 𝑡 and nd that the default parameterization empirically p erforms the best. Hence, we choose the simpler default parameterization that does not enforce 𝜆 𝑡 = 1 2 + 𝑂 ( Δ 𝑡 ) . 23 T able 8: Ablations on 𝜆 𝑡 parameterization in the exponential-trap ezoidal up date. Parameterization Form of 𝜆 𝑡 ppl ↓ Default 𝜎 ( 𝑢 𝑡 ) 15.72 Fixed 1 / 2 1 2 15.76 No trapezoidal (Euler) 1 15.81 B Complex SSM Proofs B.1 Proof of Proposition 2 Proposition 2 (Complex-to-Real SSM Equivalence) . Consider a complex-valued SSM ¤ h ( 𝑡 ) = Diag 𝐴 ( 𝑡 ) + 𝑖 θ ( 𝑡 ) h ( 𝑡 ) + B ( 𝑡 ) + 𝑖 ˆ B ( 𝑡 ) 𝑥 ( 𝑡 ) , (8) 𝑦 ( 𝑡 ) = Re C ( 𝑡 ) + 𝑖 ˆ C ( 𝑡 ) ⊤ h ( 𝑡 ) , where h ( 𝑡 ) ∈ C 𝑁 / 2 , θ ( 𝑡 ) , B ( 𝑡 ) , ˆ B ( 𝑡 ) , C ( 𝑡 ) , ˆ C ( 𝑡 ) ∈ R 𝑁 / 2 , and 𝑥 ( 𝑡 ) , 𝐴 ( 𝑡 ) ∈ R . Under exponential-Euler discretization, this system is e quivalent to a real-valued SSM h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 R 𝑡 h 𝑡 − 1 + Δ 𝑡 B 𝑡 𝑥 𝑡 , (9) 𝑦 𝑡 = C ⊤ 𝑡 h 𝑡 , with state h 𝑡 ∈ R 𝑁 , projections B 𝑡 B B 𝑡 ˆ B 𝑡 ∈ R 𝑁 , C 𝑡 B C 𝑡 − ˆ C 𝑡 ∈ R 𝑁 , and a transition matrix R 𝑡 B Block { 𝑅 ( Δ 𝑡 θ t [ 𝑖 ] ) } 𝑁 / 2 𝑖 = 1 ∈ R 𝑁 × 𝑁 , 𝑅 ( 𝜃 ) B cos ( 𝜃 ) − sin ( 𝜃 ) sin ( 𝜃 ) cos ( 𝜃 ) . Proof. W e rst present the derivation for 𝑁 = 2 ; the block-diagonal structure for general ev en 𝑁 follows by grouping pairs of coordinates. Let ℎ 𝑡 + 𝑖 ˆ ℎ 𝑡 denote the complexied hidden state, with parameters 𝐴 ( 𝑡 ) + 𝑖 𝜃 ( 𝑡 ) and 𝐵 ( 𝑡 ) + 𝑖 ˆ 𝐵 ( 𝑡 ) for the transition and input, respectively . By the variation of constants formula (Proposition 5 ), applying zero-order hold and Euler’s rule over a step [ 𝑡 𝑘 − 1 , 𝑡 𝑘 ] giv es ℎ 𝑘 + 𝑖 ˆ ℎ 𝑘 = 𝑒 Δ 𝑡 ( 𝐴 𝑡 + 𝑖𝜃 𝑡 ) ( ℎ 𝑘 − 1 + 𝑖 ˆ ℎ 𝑘 − 1 ) + Δ 𝑡 ( 𝐵 𝑡 + 𝑖 ˆ 𝐵 𝑡 ) 𝑥 𝑡 . Expanding the exponential, 𝑒 Δ 𝑡 ( 𝐴 𝑡 + 𝑖𝜃 𝑡 ) = 𝑒 Δ 𝑡 𝐴 𝑡 cos ( Δ 𝑡 𝜃 𝑡 ) + 𝑖 sin ( Δ 𝑡 𝜃 𝑡 ) , so in real coordinates h 𝑡 = ℎ 𝑡 ˆ ℎ 𝑡 ∈ R 2 the recurrence becomes h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 cos ( Δ 𝑡 𝜃 𝑡 ) − sin ( Δ 𝑡 𝜃 𝑡 ) sin ( Δ 𝑡 𝜃 𝑡 ) cos ( Δ 𝑡 𝜃 𝑡 ) 𝑅 ( Δ 𝑡 𝜃 𝑡 ) h 𝑡 − 1 + Δ 𝑡 𝐵 𝑡 ˆ 𝐵 𝑡 𝑥 𝑡 . Stacking across 𝑁 / 2 such pairs yields the block-diagonal transition h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 Block { 𝑅 ( Δ 𝑡 𝜃 𝑡 [ 𝑖 ] ) } 𝑁 / 2 𝑖 = 1 h 𝑡 − 1 + Δ 𝑡 B 𝑡 ˆ B 𝑡 𝑥 𝑡 . 24 For the output, 𝑦 𝑡 = Re ( C 𝑡 + 𝑖 ˆ C 𝑡 ) ⊤ ( ℎ 𝑡 + 𝑖 ˆ ℎ 𝑡 ) = C 𝑡 − ˆ C 𝑡 ⊤ h 𝑡 , which denes the r eal projection C 𝑡 ∈ R 𝑁 in the proposition. This pr oves the equivalence b etween complex SSM and the real block-diagonal system with rotations. □ B.2 Proof of Proposition 3 Proposition 3 (Complex SSM, Data-Dependent RoPE Equivalence) . Under the notation established in Proposition 2 , con- sider the real SSM dened in equation ( 9 ) unrolled for 𝑇 time-steps. The output of the above SSM is equivalent to that of a vanilla scalar transition matrix-base d SSM ( 4 ) with a data-dependent rotar y embe dding applie d on the B , C components of the SSM, as dene d by: h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 h 𝑡 − 1 + 𝑡 𝑖 = 0 R ⊤ 𝑖 Δ 𝑡 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = 𝑡 𝑖 = 0 R ⊤ 𝑖 C 𝑡 ⊤ h 𝑡 (10) where the matrix product represents right matrix multiplication, e.g., 1 𝑖 = 0 R 𝑖 = R 0 R 1 . W e refer to the usage of a transformed real-valued SSM to compute the complex SSM as the “RoPE trick. ” Proof. Consider the SSM h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 R 𝑡 h 𝑡 − 1 + Δ 𝑡 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = C ⊤ 𝑡 h 𝑡 , (17) where (as in Pr oposition 3 ) 𝐴 𝑡 ∈ R is a scalar ( so that 𝑒 Δ 𝑡 𝐴 𝑡 is a scalar and commutes with rotations), and R 𝑡 is block- diagonal orthogonal/unitary , hence R − 1 𝑡 = R ⊤ 𝑡 and the matrices R 𝑖 , R 𝑗 commute, i.e. R 𝑖 R 𝑗 = R 𝑗 R 𝑖 . Unrolling the recurrence with the convention that an empty pr oduct is the identity , h 𝑡 = 𝑡 𝑖 = 0 𝑡 𝑠 = 𝑖 + 1 𝑒 Δ 𝑠 𝐴 𝑠 R 𝑠 Δ 𝑖 B 𝑖 𝑥 𝑖 . (18) Thus 𝑦 𝑡 = C ⊤ 𝑡 h 𝑡 = 𝑡 𝑖 = 0 C ⊤ 𝑡 𝑡 𝑠 = 𝑖 + 1 𝑒 Δ 𝑠 𝐴 𝑠 R 𝑠 Δ 𝑖 B 𝑖 𝑥 𝑖 . (19) Using its unitary property , 𝑡 𝑠 = 𝑖 + 1 R 𝑠 = 𝑡 𝑠 = 0 R 𝑠 𝑖 𝑠 = 0 R 𝑠 − 1 = 𝑡 𝑠 = 0 R 𝑠 𝑖 𝑠 = 0 R ⊤ 𝑠 . Since 𝑒 Δ 𝑠 𝐴 𝑠 are scalars, they commute with rotations; hence 𝑦 𝑡 = 𝑡 𝑖 = 0 C ⊤ 𝑡 𝑡 𝑠 = 0 R 𝑠 𝑡 𝑠 = 𝑖 + 1 𝑒 Δ 𝑠 𝐴 𝑠 𝑖 𝑠 = 0 R ⊤ 𝑠 Δ 𝑖 B 𝑖 𝑥 𝑖 (20) = 𝑡 𝑠 = 0 R ⊤ 𝑠 C 𝑡 ⊤ 𝑡 𝑖 = 0 𝑡 𝑠 = 𝑖 + 1 𝑒 Δ 𝑠 𝐴 𝑠 𝑖 𝑠 = 0 R ⊤ 𝑠 Δ 𝑖 B 𝑖 𝑥 𝑖 . (21) Dene the rotated parameters C 𝑡 : = 𝑡 𝑠 = 0 R ⊤ 𝑠 C 𝑡 and B 𝑖 : = 𝑖 𝑠 = 0 R ⊤ 𝑠 B 𝑖 . Then, 𝑦 𝑡 = C ⊤ 𝑡 𝑡 𝑖 = 0 𝑡 𝑠 = 𝑖 + 1 𝑒 Δ 𝑠 𝐴 𝑠 Δ 𝑖 B 𝑖 𝑥 𝑖 . (22) Equivalently , introducing the rotated state ˜ h 𝑡 : = 𝑡 𝑠 = 0 R ⊤ 𝑠 h 𝑡 , ˜ h 𝑡 = 𝑒 Δ 𝑡 𝐴 𝑡 ˜ h 𝑡 − 1 + Δ 𝑡 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = C ⊤ 𝑡 ˜ h 𝑡 , (23) □ 25 B.3 Proof of Proposition 4 Proposition 4 (Rotary Embedding Equivalence with Exponential- Trapezoidal Discretization) . Discretizing a complex SSM with the exponential-trap ezoidal rule (Proposition 1 ) yields the recurrence h 𝑡 = 𝛼 𝑡 h 𝑡 − 1 + 𝛽 𝑡 𝑡 − 1 𝑖 = 0 R ⊤ 𝑖 B 𝑡 − 1 𝑥 𝑡 − 1 + 𝛾 𝑡 𝑡 𝑖 = 0 R ⊤ 𝑖 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = 𝑡 𝑖 = 0 R ⊤ 𝑖 C 𝑡 ⊤ h 𝑡 . (11) Here, R 𝑡 is the blo ck-diagonal rotation matrix dened in Proposition 2 . Proof. W e begin from the complex SSM (as in Prop . 2 ) ¤ h ( 𝑡 ) = Diag ( 𝐴 ( 𝑡 ) + 𝑖 θ ( 𝑡 ) ) h ( 𝑡 ) + B ( 𝑡 ) + 𝑖 ˆ B ( 𝑡 ) 𝑥 ( 𝑡 ) , 𝑦 ( 𝑡 ) = Re ( C ( 𝑡 ) + 𝑖 ˆ C ( 𝑡 ) ) ⊤ h ( 𝑡 ) , where 𝐴 ( 𝑡 ) ∈ R is a scalar and θ ( 𝑡 ) , B ( 𝑡 ) , ˆ B ( 𝑡 ) , C ( 𝑡 ) , ˆ C ( 𝑡 ) ∈ R 𝑁 / 2 . Recall from Prop. 5 , h 𝑡 ≈ 𝑒 Δ 𝑡 ( 𝐴 𝑡 + 𝑖 θ 𝑡 ) h 𝑡 − 1 + 𝜏 𝑡 𝜏 𝑡 − 1 𝑒 ( 𝜏 𝑡 − 𝜏 ) ( 𝐴 𝑡 + 𝑖 θ 𝑡 ) B ( 𝜏 ) + 𝑖 ˆ B ( 𝜏 ) 𝑥 ( 𝜏 ) 𝑑𝜏 . Applying Prop. 1 to the above integral, w e get h 𝑡 = 𝑒 Δ 𝑡 ( 𝐴 𝑡 + 𝑖 θ 𝑡 ) h 𝑡 − 1 + 𝛽 𝑡 𝑒 𝑖 Δ 𝑡 θ 𝑡 B 𝑡 − 1 + 𝑖 ˆ B 𝑡 − 1 𝑥 𝑡 − 1 + 𝛾 𝑡 B 𝑡 + 𝑖 ˆ B 𝑡 𝑥 𝑡 , (24) where 𝛼 𝑡 : = 𝑒 Δ 𝑡 𝐴 𝑡 , 𝛽 𝑡 : = ( 1 − 𝜆 𝑡 ) Δ 𝑡 𝑒 Δ 𝑡 𝐴 𝑡 , 𝛾 𝑡 : = 𝜆 𝑡 Δ 𝑡 , Since 𝑒 Δ 𝑡 ( 𝐴 𝑡 + 𝑖 θ 𝑡 ) = 𝛼 𝑡 𝑒 𝑖 Δ 𝑡 θ 𝑡 and as shown in Prop. 2 , multiplication by 𝑒 𝑖 Δ 𝑡 θ 𝑡 is a blo ck-diagonal rotation in real coordi- nates, we get the real 𝑁 -dimensional recurrence h 𝑡 = 𝛼 𝑡 R 𝑡 h 𝑡 − 1 + 𝛽 𝑡 R 𝑡 B 𝑡 − 1 𝑥 𝑡 − 1 + 𝛾 𝑡 B 𝑡 𝑥 𝑡 , (25) 𝑦 𝑡 = C ⊤ 𝑡 h 𝑡 , where R 𝑡 B Block { 𝑅 ( Δ 𝑡 θ 𝑡 [ 𝑖 ] ) } 𝑁 / 2 𝑖 = 1 where 𝑅 ( 𝜃 ) B cos 𝜃 − sin 𝜃 sin 𝜃 cos 𝜃 , and projections B 𝑡 B B 𝑡 ˆ B 𝑡 , C 𝑡 B C 𝑡 − ˆ C 𝑡 . Note that R 𝑡 is orthogonal, so R − 1 𝑡 = R ⊤ 𝑡 , and that R 𝑖 , R 𝑗 commute, i.e., R 𝑖 R 𝑗 = R 𝑗 R 𝑖 . W e dene the following. ˜ h 𝑡 : = 𝑡 𝑠 = 0 R ⊤ 𝑠 h 𝑡 , B 𝑡 : = 𝑡 𝑠 = 0 R ⊤ 𝑠 B 𝑡 , C 𝑡 : = 𝑡 𝑠 = 0 R ⊤ 𝑠 C 𝑡 . Left-multiplying equation ( 25 ) by 𝑡 𝑠 = 0 R ⊤ 𝑠 and using R ⊤ 𝑡 R 𝑡 = 𝐼 , ˜ h 𝑡 = 𝛼 𝑡 ˜ h 𝑡 − 1 + 𝛽 𝑡 B 𝑡 − 1 𝑥 𝑡 − 1 + 𝛾 𝑡 B 𝑡 𝑥 𝑡 , 𝑦 𝑡 = C ⊤ 𝑡 ˜ h 𝑡 . This is a vanilla scalar-transition SSM with data-dependent rotary embeddings absorbe d into B , C via cumulative products of R ⊤ 𝑠 . □ 26 C MIMO for Mamba-3 Mamba with MIMO . With a given batch, head, and sequence position 𝑡 , consider the input U 𝑡 ∈ R 𝐷 . Also denote 𝑃 , 𝑅 ∈ N as the head dimension and MIMO rank, r espectively . W e rst obtain SSM parameters via a set of projections dened in terms of tensor contraction notation as follows: B 𝑡 = contract ( 𝐷 𝑁 𝑅, 𝐷 → 𝑁 𝑅 ) ( W 𝐵 , U 𝑡 ) C 𝑡 = contract ( 𝐷 𝑁 𝑅, 𝐷 → 𝑁 𝑅 ) ( W 𝐶 , U 𝑡 ) , X ′ 𝑡 = contract ( 𝑃 𝐷 , 𝐷 → 𝑃 ) ( W 𝑋 ′ , U 𝑡 ) X 𝑡 = contract ( 𝑃 𝑅 , 𝑃 → 𝑃 𝑅 ) ( W 𝑋 , X ′ 𝑡 ) , where W 𝐵 , W 𝐶 , W 𝑋 ′ , W 𝑋 are model parameters. Additionally , we obtain the residual gate term Z 𝑡 in the same manner as X 𝑡 with weights W 𝑍 ′ and W 𝑍 . This parameterization is used to pr event the parameter count fr om increasing by a factor of 𝑅 . The state update and the SSM output are then computed via the following MIMO SSM: H 𝑡 = 𝑎 𝑡 H 𝑡 − 1 + B 𝑡 X ⊤ 𝑡 ∈ R 𝑁 × 𝑃 , Y 𝑡 = H ⊤ 𝑡 C 𝑡 ∈ R 𝑃 × 𝑅 . Intermediate output Y ′ 𝑡 is obtained by the residual function 𝜙 , Y ′ 𝑡 ← 𝜙 ( Y 𝑡 , Z 𝑡 ) , where 𝜙 ( Y 𝑡 , Z 𝑡 ) : = Y 𝑡 ⊙ SiLU ( Z 𝑡 ) in our case. Finally , the layer output O 𝑡 ∈ R 𝐷 is computed via the following down projections: O ′ 𝑡 = contract ( 𝑃 𝑅 , 𝑃 𝑅 → 𝑃 ) ( W 𝑂 ′ , Y ′ 𝑡 ) O 𝑡 = contract ( 𝑃 𝐷 , 𝑃 → 𝐷 ) ( W 𝑂 , O ′ 𝑡 ) . This formulation enhances the existing Mamba-3 architecture by providing a lightweight parameterization that transforms the set of independent SISO SSMs within each head into a set of MIMO SSMs. MIMO Parameter Matching. The MIMO variant of Mamba-3 incurs additional parameters compared to its SISO counterpart. W e therefore reduce the hidden dimension of the MLP layers to parameter match the SISO variants as follows: Model 180 M 440 M 880 M 1 . 5 B SISO MLP dim 1,500 2,048 3,072 4,096 MIMO MLP dim ( 𝑅 = 4 ) 1,264 1,792 2,800 3,824 D Experimental Details Language Modeling. Our pretraining proce dures follow those of Dao and Gu ( 2024 )’s se ction D.2. All mo dels at each scale follow the same procedure and were trained with boat16. The Mamba family of models were trained using the standard expand factor of 2 and a 𝑑 state of 128 and head dimension of 64. The Transformer baselines follow Dao and Gu ( 2024 ), and the GDN baselines follow (S. Y ang, Kautz, and Hatamizadeh 2025 ) where 𝑞, 𝑘 dim = 128 , 𝑣 dim = 256 . W e utilize the Llama-3.1 tokenizer (Grattaori et al. 2024 ) for all models. W e utilize LM Evaluation Harness (Gao et al. 2024 ) to test the zero-shot language modeling capabilities of our pre- trained model on LAMBAD A (OpenAI v ersion) (Paperno et al. 2016 ), HellaSwag (Zellers et al. 2019 ), PIQA (Bisk et al. 2019 ), Arc-Easy/Arc-Challenge (Clark et al. 2018 ), WinoGrande (Sakaguchi et al. 2019 ), and OpenBookQ A (Mihaylov et al. 2018 ). Real- W orld and Synthetic Retrieval. For our real-world retrieval tasks, we evaluate on the common suite consisting of SWDE (S. Arora, Eyuboglu, et al. 2025 ), SQuAD (Rajpurkar, J. Zhang, and Liang 2018 ), FDA (S. Arora, Eyuboglu, et al. 2025 ), TriviaQ A (Joshi et al. 2017 ), NQ (K wiatkowski et al. 2019 ), and DROP (Dua et al. 2019 ). W e utilize the cloze-formatted version of the aforementioned tasks provided by S. Arora, Eyub oglu, et al. ( 2025 ) and S. Ar ora, Timalsina, et al. ( 2024 ), as the original datasets are in a question-answering format, making it challenging for solely pr etrained models. All tasks 27 T able 9: Ablations of optional norm typ e (grouped vs default) and placement (pre- vs post-gate) on pretrained hybrid Mamba-3 SISO models at the 1.5B scale. All models have BCNorm. No additional norm demonstrates the strongest in- context retrieval performance on average, while pre-gate, grouped RMS results in the best p erformance on synthetic retrieval, especially on lengths longer than its training context. Mamba-3 Norm Type LM A vg. SWDE SQD. FD A TQ A NQ Drop NIAH-Single-1 NIAH-Single-2 NIAH-Single-3 Context Length — 2048 1024 2048 4096 1024 2048 4096 1024 2048 4096 No Norm 56 . 4 58 . 5 47 . 0 65 . 9 64 . 8 33 . 4 27 . 0 100 . 0 100 . 0 36 . 2 100 . 0 100 . 0 9 . 4 99 . 8 100 . 0 8 . 8 Post-Gate Default RMS 56 . 5 54 . 5 46 . 6 61 . 9 65 . 4 31 . 9 29 . 2 100 . 0 100 . 0 100 . 0 100 . 0 99 . 8 49 . 2 87 . 6 94 . 0 62 . 0 Pre-Gate Default RMS 55 . 9 55 . 4 46 . 9 67 . 3 65 . 4 33 . 0 28 . 1 100 . 0 100 . 0 86 . 2 100 . 0 100 . 0 97 . 8 99 . 2 97 . 8 90 . 2 Post-Gate Grouped RMS 56 . 2 51 . 4 46 . 7 56 . 8 64 . 2 30 . 4 27 . 6 100 . 0 100 . 0 79 . 4 100 . 0 100 . 0 65 . 8 93 . 8 97 . 0 9 . 6 Pre-Gate Grouped RMS 56 . 1 58 . 6 47 . 3 52 . 4 65 . 7 33 . 3 28 . 5 100 . 0 100 . 0 100 . 0 100 . 0 100 . 0 96 . 0 99 . 8 97 . 2 56 . 8 were truncate d to match the training context length. The synthetic NIAH tasks (Hsieh et al. 2024 ) wer e also run with LM Evaluation Harness. State- Tracking Synthetics. Training follows a sequence length curriculum that sets the minimum length to 3 and progresses the maximum length from 40 to 160 . Final models are evaluated at 256 length. Each curriculum runs for 10 4 steps with batch size 256 . W e use one-layer models for Parity and three-layer models for Modular-arithmetic tasks. The state size is chosen to b e 64 , and we sweep 𝑑 model ∈ { 32 , 64 } and 8 learning rates logarithmically spaced between 10 − 4 and 10 − 2 , reporting the best validation accuracy . E Additional Experimental Results 1K 2K 4K 8K 16K 32K Conte xt length 10.0 10.2 10.4 10.6 10.8 P erple xity Context Length Extrapolation T rain length = 2K Gated DeltaNet Mamba-2 Mamba-3 Figure 4: Pretrained 1.5B models’ performance on the held-out Fine W eb-Edu test set at varying conte xt lengths. Mamba-3 exhibits strong length extrapolation while Mamba-2 falters at longer contexts. 28 20000 30000 40000 50000 60000 70000 80000 90000 Global Step 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 P erple xity V alidation Perplexity for 1B Pretraining Runs GatedDeltaNet Mamba-2 Mamba-3 SISO Mamba-3 MIMO Figure 5: Mamba-3 demonstrates b etter pretraining performance compared to strong baselines like Mamba-2 and Gated DeltaNet. These are the validation perplexity on Fine W eb-Edu of our fully pretrained 1.5B models. W e also compare the eectiveness of state size usage of Mamba variants to a Gated DeltaNet baseline in Figure 6 . W e highlight the diculty of directly comparing GDN versus Mamba-style models due to the diering head structure (multi- head for GDN compar ed to multi-value for Mamba). Our experiments hold GDN’s 𝑣 𝑒 𝑥 𝑝𝑎𝑛𝑑 to 2 and decrease the head dimension accor dingly to var y the relative total state size . Similar to Figure 3 , w e train 440M models to 2 × Chinchilla tokens ( 40 × token-to-parameter ratio) and sweep across 𝑑 state = { 32 , 64 , 128 } for the Mamba models and 𝑑 head dim = { 32 , 64 , 128 } for GDN. W e parameter match all models. 1 0 5 R elative T otal State Size 14.6 14.8 15.0 P r etraining P erple xity Relative T otal State Size vs Pretraining Perplexity Mamba-2 Mamba-3 Mamba-3 MIMO Gated DeltaNet Figure 6: Exploration of state size (inference sp eed proxy) versus pretraining p erplexity (p erformance proxy). Mamba-3 and Mamba-3 MIMO continue to set the Pareto frontier . F Architecture Ablations W e explore our model architecture ablations in this section. All mo dels are trained at the 440M scale to Chinchilla optimal number of tokens ( 20 × tokens to parameters) with the same experimental procedures as our pretrained models as cover ed in Appendix D unless otherwise state d. 29 B , C Bias Parameterization. The Mamba-3 mo del’s separate 𝐵 and 𝐶 biases are head-specic and channel-wise and added to both B and C after the QK-Norm. While the biases in the nal Mamba-3 model are trainable , data-independent parameters and initialized to all ones, we explore various bias parameterizations in T able 10a . W e nd our models ar e not very sensitive to the initialization of the biases as long as they are positive. W e choose the all-ones initialization due to its simplicity . W e also explore the impact of removing the 𝐵 or 𝐶 bias on performance in Table 10b ( bias is initialized with our default parameterization when utilized). Unlike in Yu and Erichson ( 2025 ), which nds that 𝐵 bias by itself is able to improv e performance on Mamba-1, our experiments nd that only having 𝐵 bias hurts performance slightly and that 𝐵 and 𝐶 biases have synergistic properties. Bias Init. Trainable ppl ↓ 1.0 ✓ 15.72 0.0 ✓ 16.57 1.0 × 15.80 U ( 0 , 1 ) ✓ 15.76 U ( − 1 , 1 ) ✓ 16.07 (a) Eect of parameterization of the 𝐵 and 𝐶 bias on model performance, measured by pretraining perplexity . W e nd our default initialization of all-ones (rst row) provides the best performance, but p erformance is not sensitiv e as long as biases are positive. B Bias C Bias ppl ↓ × × 16.52 ✓ × 16.68 × ✓ 15.98 ✓ ✓ 15.69 (b) Applying a bias to both 𝐵 and 𝐶 leads to the best perfor- mance. Only applying 𝐵 bias (Block-Biased (Y u and Erichson 2025 ) Mamba-3 variant) does not provide signicant gains over the no-bias baseline. T able 10: Ablations on 𝐵, 𝐶 bias initialization (left) and presence (right) for Mamba-3. G Inference Kernel Latency Analysis G.1 Kernel Implementations and Fusion Structur e In T able 6 , we detail the DSL (T riton, TileLang, CuT e, Py T orch) and the fusion level of the kernels used in our latency analysis. For Mamba-2 and Gate d DeltaNet (GDN), we directly use the publicly released Triton kernels from the respective authors. For Mamba-3, we implement new inference kernels with a comparable fusion structure: the forward SISO uses a Triton kernel fused with rotar y position emb eddings and the forward MIMO uses a TileLang kernel with the same fusion level while the decode path uses a CuT e kernel fuse d with gating and MIMO projection. In T ables 11 and 12 , we abbreviate IP = input projection, Conv = 1D convolution, Gate = gating, OP = output projection. Colors indicate implementation backend ( T orch , Triton , TileLang , CuT e ). T able 11: Kernel DSL and fusion structure for forward (prell) kernels. Model (Forward) Kernel DSL Fusion Level Mamba-2 Triton IP , Conv , SSM , Gate , OP Gated DeltaNet Triton IP , Conv , Chunke d Delta , Gate , OP Mamba-3 (SISO) Triton IP , SSM+Rotar y+Gate , OP Mamba-3 (MIMO) TileLang IP , SSM+Rotar y+Gate , OP G.2 Extended Pr ell and Pr ell+Decode Latency Measurements Models. W e benchmark Mamba-3 1.5B (SISO), Mamba-2 1.5B, Gated DeltaNet 1.5B, and a strong Transformer baseline implemented via the vLLM engine (v0.11.0) with Llama-3.2 1B. 6 All recurrent models are traine d at the 1.5B scale with 6 https://huggingface.co/meta- llama/Llama- 3.2- 1B . 30 T able 12: Kernel DSL and fusion structure for decode kernels. Model (Decode) K ernel DSL Fusion Level Mamba-2 Triton IP , Conv , SSM , Gate , OP Gated DeltaNet Triton IP , Conv , Recurrent Delta , Gate , OP Mamba-3 (SISO) CuT e + Triton IP , Rotary , SSM+Gate , OP Mamba-3 (MIMO) CuT e + Triton IP , Rotary , SSM+Gate , OP 𝑑 model = 2048 and 24 layers. For Mamba variants w e set state size as 128 and head dimension 64 ; for GDN we use QK head dimension as 128 . Setting. Sequence lengths were swept ov er 𝐿 ∈ { 512 , 1024 , 2048 , 4096 , 16384 } for pr ell, with an equal numb er of to- kens decoded. For all sequence lengths, we use a batch size of 128. T o report vLLM numbers at se quence length 16384 , we measure p erformance at the same se quence length with batch size 16. W e then scale the result by a factor of 8 to approximate performance at batch size 128 since dir ect measurement at this setting exceeds GP U memor y . This provides a reasonable estimate because each batch is pr ocessed independently by each SM on the GP U, so w e expect performance of Transformer models to scale linearly with batch size. For recurrent mo dels, when the size of input and output tensors exceeds GP U memor y at sequence length 16384 , we utilize a state passing approach that processes the se quence in two halves while propagating the hidden state between segments to avoid materializing the entire sequence at once. W e use a single H100-SXM 80GB GP U and report wall-clo ck times (in seconds) over three repetitions. W e observe that (i) Mamba-3 adds minimal forward-pass cost, showing that the exponential-trapezoidal update, complex state tracking, and MIMO parameterization remain lightweight; (ii) deco de latency is comp etitive across recurrent models; and (iii) recurrent mixers scale more gently with context length than vLLM Llama-3.2-1B, which grows much faster with 𝐿 due to K V -cache overhead. 31
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment