Elastic Spectral State Space Models for Budgeted Inference
Foundation models are typically trained at a fixed computational capacity, while real-world applications require deployment across platforms with different resource constraints. Current approaches usually rely on training families of model variants or model distillation, which requires additional training and supports only a pre-selected set of sizes rather than fine-grained adaptation at runtime. In this paper, we propose Elastic Spectral State Space Models (ES-SSM), which require only one-time training at full capacity, but can be directly truncated into arbitrary scales for budgeted, runtime inference without retraining. Our ES-SSM builds on Hankel spectral filtering over a state space model (SSM), coupled with a lightweight input-adaptive gate trained under randomized spectral budgets. Using a shared masked normalization rule over the ordered spectral channels, we encourage predictive capability to concentrate in low-index components, while higher-index components act primarily as refinement. We test our algorithm across long-sequence benchmarks spanning text, logic, retrieval, vision, and audio. We demonstrate that a single ES-SSM model trained once can be truncated to provide competitive performance compared with modern Transformer and SSM baselines at similar parameter scales. Furthermore, by testing under various runtime budgets, we observe smooth and stable budget-performance curves over a wide range of truncation levels.
💡 Research Summary
The paper introduces Elastic Spectral State Space Models (ES‑SSM), a novel approach that enables a single, fully‑trained long‑sequence model to be deployed at arbitrary computational budgets without any additional retraining or distillation. The method builds on the spectral representation of state‑space models (SSMs) derived from a Hankel matrix whose eigenvalues are naturally ordered in decreasing magnitude. By treating each eigenmode as a “spectral channel,” the authors exploit the fact that low‑index channels capture the bulk of the signal energy, while higher‑index channels provide finer‑grained refinements.
Two key mechanisms make this ordering useful for runtime elasticity. First, an input‑adaptive gating network computes time‑varying mixture weights αₖ(t) for each active channel. The gate is a lightweight two‑layer MLP that maps the current input representation to logits, which are RMS‑scaled and soft‑maxed over the set of channels permitted by the current budget K. This ensures that, at inference time, only the first K channels are activated and the softmax temperature remains comparable across different budgets. Second, during training the authors employ “budget dropout”: at each update a random budget K_train ≤ K_max is sampled, and the forward‑backward pass is performed using only the first K_train channels. Consequently, the model learns to concentrate essential predictive information in low‑index channels while allowing higher‑index channels to act as optional refinements. All parameters (including the direct term D and the gating network) are shared across budgets, but gradients flow only through the active channels for a given K_train.
The architecture is integrated into a standard pre‑norm residual block, and the overall model can be viewed as a spectral SSM layer followed by the adaptive gate. The authors set K_max = 32 in all experiments, a common choice in recent spectral SSM literature, and evaluate ES‑SSM on a diverse suite of long‑sequence benchmarks: language modeling with very long contexts, logical reasoning tasks, large‑scale retrieval, vision tasks that involve sequential frames, and streaming audio modeling.
Empirical results show that (1) at full capacity ES‑SSM matches or exceeds state‑of‑the‑art Transformers (e.g., Longformer, Performer) and prior spectral SSMs (e.g., S4, S5) of comparable parameter count; (2) when the model is truncated to smaller budgets (K = 24, 16, 8, etc.), performance degrades smoothly rather than catastrophically, revealing clear “sweet‑spot” budget ranges where the truncated model attains near‑optimal accuracy; (3) the budget‑performance curves are monotonic and stable across all domains, confirming that the training‑time budget dropout successfully aligns the model’s internal representation with the truncation rule. Importantly, the same set of learned weights can be deployed on cloud GPUs, edge CPUs, or low‑power accelerators simply by selecting an appropriate K, eliminating the need for multiple model families or separate distillation pipelines.
The paper’s contributions can be summarized as follows:
- A principled method for reliable spectral truncation of SSMs, achieved by coupling a masked softmax gating mechanism with randomized budget dropout during training.
- Demonstration that a single ES‑SSM model can be dynamically scaled at inference time, providing a practical solution for heterogeneous hardware environments.
- Extensive cross‑domain evaluation showing that elasticity does not sacrifice accuracy and that the model’s performance scales gracefully with computational budget.
Limitations noted by the authors include the fixed maximum channel count (K_max = 32), which may need to be increased for extremely large models, and a modest performance drop at very low budgets (K ≤ 4) where the gating network has limited capacity to compensate. Future work could explore larger spectral bases, more sophisticated gating architectures, or hybrid designs that combine ES‑SSM with attention‑based modules.
Overall, Elastic Spectral State Space Models represent a significant step toward “train‑once‑deploy‑anywhere” long‑sequence models, offering both theoretical elegance—through the ordered Hankel spectral basis—and practical utility for real‑world AI systems that must operate under fluctuating resource constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment