Adaptive Normalization Mamba with Multi Scale Trend Decomposition and Patch MoE Encoding

Reading time: 5 minute
...

📝 Abstract

Time series forecasting in real world environments faces significant challenges non stationarity, multi scale temporal patterns, and distributional shifts that degrade model stability and accuracy. This study propose AdaMamba, a unified forecasting architecture that integrates adaptive normalization, multi scale trend extraction, and contextual sequence modeling to address these challenges. AdaMamba begins with an Adaptive Normalization Block that removes non stationary components through multi scale convolutional trend extraction and channel wise recalibration, enabling consistent detrending and variance stabilization. The normalized sequence is then processed by a Context Encoder that combines patch wise embeddings, positional encoding, and a Mamba enhanced Transformer layer with a mixture of experts feed forward module, allowing efficient modeling of both long range dependencies and local temporal dynamics. A lightweight prediction head generates multi horizon forecasts, and a denormalization mechanism reconstructs outputs by reintegrating local trends to ensure robustness under varying temporal conditions. AdaMamba provides strong representational capacity with modular extensibility, supporting deterministic prediction and compatibility with probabilistic extensions. Its design effectively mitigates covariate shift and enhances predictive reliability across heterogeneous datasets. Experimental evaluations demonstrate that AdaMamba’s combination of adaptive normalization and expert augmented contextual modeling yields consistent improvements in stability and accuracy over conventional Transformer based baselines.

💡 Analysis

Time series forecasting in real world environments faces significant challenges non stationarity, multi scale temporal patterns, and distributional shifts that degrade model stability and accuracy. This study propose AdaMamba, a unified forecasting architecture that integrates adaptive normalization, multi scale trend extraction, and contextual sequence modeling to address these challenges. AdaMamba begins with an Adaptive Normalization Block that removes non stationary components through multi scale convolutional trend extraction and channel wise recalibration, enabling consistent detrending and variance stabilization. The normalized sequence is then processed by a Context Encoder that combines patch wise embeddings, positional encoding, and a Mamba enhanced Transformer layer with a mixture of experts feed forward module, allowing efficient modeling of both long range dependencies and local temporal dynamics. A lightweight prediction head generates multi horizon forecasts, and a denormalization mechanism reconstructs outputs by reintegrating local trends to ensure robustness under varying temporal conditions. AdaMamba provides strong representational capacity with modular extensibility, supporting deterministic prediction and compatibility with probabilistic extensions. Its design effectively mitigates covariate shift and enhances predictive reliability across heterogeneous datasets. Experimental evaluations demonstrate that AdaMamba’s combination of adaptive normalization and expert augmented contextual modeling yields consistent improvements in stability and accuracy over conventional Transformer based baselines.

📄 Content

Time-series forecasting is essential in many real-world domains such as energy systems, finance, healthcare, climate science, and traffic management. Modern forecasting models must operate under challenging conditions, including nonstationarity, multi-scale temporal structure, and distribution shifts, all of which frequently arise in practical environments. These characteristics distort temporal dependencies, destabilize model training, and degrade multi-step prediction accuracy, making robust forecasting a persistent research challenge [1,2,3].

To address long-range dependencies, recent neural architectures have shifted from recurrent models toward Transformerbased forecasters. Informer [3], Autoformer [4], FEDformer [2], and Time-Series Transformer variants [5] demonstrate improved sequence modeling through sparse attention, decomposition-based representations, and frequency-domain aggregation. Despite these advances, Transformers remain sensitive to non-stationary trends and global scale shifts, often overfitting short-term patterns while failing to disentangle long-and short-term temporal dynamics [1]. Their reliance on fixed normalization and global attention results in degraded performance when the input exhibits abrupt drift, trend changes, or inconsistent variance.

Parallel to these developments, state space model (SSM)-based approaches have emerged as promising alternatives for efficient long-range modeling. Early linear state space layers (LSSL) [6] achieve strong temporal modeling with reduced computational cost, and recent breakthroughs such as Mamba introduce selective state spaces that scale linearly with sequence length while retaining long-context modeling ability [7]. However, SSM-based models also suffer when exposed to strong non-stationarity or inconsistencies in local vs. global temporal patterns. [8,9] Without explicit mechanisms for trend removal or variance stabilization, sequence models may misinterpret long-term drift as bias, negatively impacting their predictive stability. These limitations highlight the need for a forecasting model that not only captures long-range dependencies efficiently but is also explicitly designed to handle multi-scale temporal structure, trend inconsistencies, and distributional drift. To this end, this study propose AdaMamba, a unified forecasting architecture that integrates adaptive normalization, multi-scale trend extraction, and expert-augmented contextual sequence modeling.

AdaMamba begins with an Adaptive Normalization Block that performs multi-scale convolutional trend extraction followed by channel-wise recalibration, enabling robust detrending and variance stabilization under volatile temporal conditions. This approach aligns with recent findings that decomposition and normalization are crucial for stabilizing deep time-series models [1,2]. The model then embeds the normalized series using patch-wise tokenization and positional encoding before feeding it into a Context Encoder composed of a Mamba-enhanced temporal module and a Mixture-of-Experts (MoE) feed-forward layer. This design allows AdaMamba to jointly capture fine-grained short-term dynamics and broad long-range structure, overcoming the representational shortcomings of pure attention or pure statespace architectures. Finally, a lightweight prediction head produces multi-horizon forecasts, and a de-normalization stage restores the original scale and temporal trend.

Our contributions are summarized as follows:

• SE-enhanced Multi-Scale Trend Normalization. The proposed method propose a novel normalization mechanism that couples multi-scale convolutional trend extraction with SE(Squeeze and Excitation [10])-based channel recalibration, providing stable detrending and variance correction under strong non-stationarity.

• Hybrid Patch-Mamba-MoE Temporal Encoder. A new encoder design unifies patch tokenization, selective state-space modeling (Mamba), and Mixture-of-Experts feed-forward layers, enabling efficient modeling of both global temporal structure and fine-grained local dynamics.

• Trend-Consistent Normalization-Denormalization for SSMs. This study identify a key weakness of state-space forecasters-their sensitivity to drift and scale inconsistency-and propose a trend-aware normalization-denormalization pipeline that stabilizes SSM-based sequence modeling.

• Unified End-to-End Forecasting Pipeline. This paper develop the first integrated forecasting framework that jointly handles adaptive detrending, multi-scale representation learning, contextual encoding, and trendconsistent output reconstruction.

2 Related Works

Transformer-based architectures have become a major foundation for long-horizon time-series forecasting. Informer [3] introduces ProbSparse attention to reduce the cost of global self-attention, while Autoformer [4] employs decompositiondriven auto-correlation to model periodicity. FEDformer [2] operates in the frequency domain to improve efficiency, and PatchTST [

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut