Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning
Mamba, a recently proposed linear-time sequence model, has attracted significant attention for its computational efficiency and strong empirical performance. However, a rigorous theoretical understanding of its underlying mechanisms remains limited. In this work, we provide a theoretical analysis of Mamba’s in-context learning (ICL) capability by focusing on tasks defined by low-dimensional nonlinear target functions. Specifically, we study in-context learning of a single-index model $y \approx g_*(\langle \boldsymbolβ, \boldsymbol{x} \rangle)$, which depends on only a single relevant direction $\boldsymbolβ$, referred to as feature. We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning, extracting the relevant direction directly from context examples. Consequently, we establish a test-time sample complexity that improves upon linear Transformers – analyzed to behave like kernel methods – and is comparable to nonlinear Transformers, which have been shown to surpass the Correlational Statistical Query (CSQ) lower bound and achieve near information-theoretically optimal rate in previous works. Our analysis reveals the crucial role of the nonlinear gating mechanism in Mamba for feature extraction, highlighting it as the fundamental driver behind Mamba’s ability to achieve both computational efficiency and high performance.
💡 Research Summary
The paper provides a rigorous theoretical analysis of the in‑context learning (ICL) capabilities of Mamba, a recently introduced linear‑time state‑space sequence model. The authors focus on a low‑dimensional nonlinear target class, namely the single‑index model (y \approx g_{}(\langle\beta, x\rangle)), where the unknown feature vector (\beta) lies in an (r)-dimensional subspace of a high‑dimensional ambient space. The data distribution generates Gaussian inputs, a polynomial link function (g_{}), and a small uniform label noise.
To study ICL, the authors construct prompts consisting of (N) context pairs ((x_i, y_i)) and a query ((x, y)). Each input is embedded via a mapping (\phi) that includes all degree‑1 and degree‑2 monomials, yielding an embedding dimension (\tilde d = 1 + d + d(d+1)/2). This rich embedding enables the model to capture quadratic interactions that are essential for learning the single‑index structure.
Mamba’s core is a one‑layer recurrent state‑space block with parameters (A_\ell, B_\ell, C_\ell) and a softplus‑based gating scalar (\Delta_\ell). The gating function (G_{j,\ell}(Z)=\sigma(w^\top z_j + b)) (σ is the sigmoid) modulates the contribution of each token at every time step. For analytical tractability the authors simplify the projection matrices to a diagonal form (W_B^\top W_C = \operatorname{diag}(\gamma,0)) and fix (w, b). Under these simplifications the final output used for the downstream MLP can be written as a gated weighted sum of the context labels multiplied by the inner product of the query embedding with a learned vector (\gamma).
The prediction head is a two‑layer ReLU MLP applied to the normalized Mamba output. The full parameter set is ((\gamma, u, v, a)). The test‑time ICL error is defined as the expected absolute deviation between the model’s prediction and the true label under the ICL data distribution.
Training proceeds in two stages, mirroring prior work on feature learning. In Stage I only the gating‑related vector (\gamma) is updated, using a single gradient step from a suitable initialization. This step is shown to align (\gamma) with the unknown feature (\beta), effectively recovering the low‑dimensional structure from the context. Proposition 4.1 proves that, after this stage, Mamba can estimate (\beta) with error on the order of (1/\sqrt{N}) solely from the context examples. In Stage II the Mamba parameters are frozen at (\gamma^\ast) and the MLP weights ((u, v, a)) are trained to approximate the link function (g_{*}). The authors argue that this two‑stage scheme isolates “feature recovery” from “link estimation,” making the overall dynamics analytically tractable.
The main theoretical contributions are captured in two results. First, Proposition 4.1 establishes test‑time feature learning: the gated recurrent dynamics enable Mamba to extract the relevant direction (\beta) without any explicit parameter updates at inference time. Second, Theorem 3.3 provides a sample‑complexity bound for pre‑training and for the number of context examples required at test time. Specifically, to achieve an ICL error (\epsilon), the number of pre‑training tasks and the context length both scale as (\tilde O(r/\epsilon^{2})). This improves upon the linear‑Transformer (kernel‑like) bound, which scales with the ambient dimension (d), and matches the near‑optimal rates previously shown for nonlinear Transformers that surpass the Correlational Statistical Query (CSQ) lower bound.
A comparative discussion highlights that while Transformers rely on global quadratic‑time attention to implicitly perform feature learning, Mamba achieves the same effect through linear‑time recurrent updates combined with a nonlinear gating mechanism. The authors emphasize that without the gating, Mamba would reduce to a purely linear recurrent model and lose the ability to learn low‑dimensional features.
Although the paper does not present extensive empirical experiments, it references recent benchmark studies (e.g., Grazzi et al., 2024) showing that Mamba’s empirical ICL performance matches or exceeds that of linear Transformers and is comparable to state‑of‑the‑art nonlinear Transformers, thereby corroborating the theoretical predictions.
In conclusion, the work demonstrates that Mamba’s nonlinear gating is the key driver that enables test‑time feature learning for low‑dimensional target functions, while preserving the model’s linear‑time computational advantages. This bridges the gap between efficiency and adaptability, offering a solid theoretical foundation for the use of state‑space models in large‑scale language modeling and other sequential tasks where in‑context adaptation is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment