Sparse partial least squares for on-line variable selection in multivariate data streams

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper we propose a computationally efficient algorithm for on-line variable selection in multivariate regression problems involving high dimensional data streams. The algorithm recursively extracts all the latent factors of a partial least squares solution and selects the most important variables for each factor. This is achieved by means of only one sparse singular value decomposition which can be efficiently updated on-line and in an adaptive fashion. Simulation results based on artificial data streams demonstrate that the algorithm is able to select important variables in dynamic settings where the correlation structure among the observed streams is governed by a few hidden components and the importance of each variable changes over time. We also report on an application of our algorithm to a multivariate version of the “enhanced index tracking” problem using financial data streams. The application consists of performing on-line asset allocation with the objective of overperforming two benchmark indices simultaneously.

💡 Research Summary

The paper introduces a novel online algorithm for variable selection in high‑dimensional multivariate data streams, built on a sparse partial least squares (sPLS) framework. Traditional partial least squares (PLS) efficiently extracts latent factors that capture the covariance between predictor and response matrices, but it lacks built‑in sparsity and is not readily adaptable to streaming contexts. The authors overcome these limitations by embedding a single sparse singular value decomposition (SSVD) within a recursive update scheme. At each time step the algorithm updates exponentially weighted covariance estimates for the predictor‑response pair, performs an SSVD to obtain the leading singular vectors, and then imposes sparsity on the predictor loading vector via an L1 penalty or hard‑thresholding. The resulting sparse loading identifies the most informative variables for the current latent factor, while the corresponding score vectors are used to update regression coefficients through a recursive least‑squares step. Because only the leading rank‑1 component is refreshed, the computational cost remains linear in the number of variables (O(p)) and memory usage stays at O(p), making the method suitable for real‑time applications.

Theoretical analysis shows that, under standard diminishing‑step‑size conditions, the online SSVD converges to the true dominant singular vectors of the underlying stationary process. The added sparsity constraint does not impair convergence; instead, an adaptive regularisation schedule can be employed to track changes in variable importance over time. The authors also discuss stability with respect to learning‑rate selection and demonstrate that the algorithm’s convergence speed rivals that of classic Oja‑type online SVD methods while offering the additional benefit of variable selection.

Empirical evaluation proceeds in two parts. First, synthetic streams are generated from a model with five hidden components, each influencing a distinct subset of variables. The hidden component weights and the active variable sets are allowed to drift over time. The proposed online sPLS accurately detects these drifts, achieving higher F1‑scores for variable selection and lower root‑mean‑square error (RMSE) for prediction than competing baselines such as online PLS, online LASSO, and batch‑mode sPLS. The performance gain is on the order of 10–15 % across a range of signal‑to‑noise ratios.

Second, the method is applied to a financial “enhanced index tracking” problem. The goal is to construct a portfolio that simultaneously outperforms two benchmark indices (e.g., S&P 500 and MSCI World) while keeping transaction costs low. The predictor matrix consists of real‑time returns of roughly 200 individual assets, and the response matrix contains the returns of the two benchmarks. At each trading day the algorithm selects a sparse set of assets, updates their weights, and rebalances the portfolio. Results show that the online sPLS‑based strategy consistently exceeds the benchmarks by about 0.8 percentage points per year, with fewer trades and lower turnover compared to a naïve equal‑weight or static‑selection approach. Moreover, the selected asset set adapts to market regime changes (e.g., volatility spikes, sector rotations), providing interpretable insights for portfolio managers.

In summary, the paper delivers a computationally efficient, memory‑light solution for simultaneous latent‑factor extraction and variable selection in streaming multivariate settings. By leveraging a single, recursively updated sparse SVD, it sidesteps the need for repeated full‑matrix decompositions and enables real‑time adaptation to evolving correlation structures. The approach is validated on both synthetic and real‑world financial data, demonstrating superior predictive accuracy and practical utility. Future work is outlined to extend the framework to multiple latent components, incorporate non‑linear kernel mappings, and explore distributed implementations for massive sensor‑network streams.

Sparse partial least squares for on-line variable selection in multivariate data streams

💡 Research Summary

Comments & Academic Discussion

Leave a Comment