From STLS to Projection-based Dictionary Selection in Sparse Regression for System Identification
In this work, we revisit dictionary-based sparse regression, in particular, Sequential Threshold Least Squares (STLS), and propose a score-guided library selection to provide practical guidance for data-driven modeling, with emphasis on SINDy-type algorithms. STLS is an algorithm to solve the $\ell_0$ sparse least-squares problem, which relies on splitting to efficiently solve the least-squares portion while handling the sparse term via proximal methods. It produces coefficient vectors whose components depend on both the projected reconstruction errors, here referred to as the scores, and the mutual coherence of dictionary terms. The first contribution of this work is a theoretical analysis of the score and dictionary-selection strategy. This could be understood in both the original and weak SINDy regime. Second, numerical experiments on ordinary and partial differential equations highlight the effectiveness of score-based screening, improving both accuracy and interpretability in dynamical system identification. These results suggest that integrating score-guided methods to refine the dictionary more accurately may help SINDy users in some cases to enhance their robustness for data-driven discovery of governing equations.
💡 Research Summary
This paper revisits the core of sparse regression for system identification, focusing on the Sequential Threshold Least Squares (STLS) algorithm that underlies many SINDy implementations. The authors first derive an explicit expression for the initial least‑squares coefficients ξ⁰ = D†y and show that the absolute value of each coefficient can be written as a ratio involving the projected reconstruction error when a single dictionary column is removed. They call this quantity a “score” and demonstrate that it quantifies how much the removal of a particular library term degrades the fit. Small scores indicate that the corresponding term will be eliminated early by STLS, while large scores signal essential contributions.
A theoretical analysis links the scores to the mutual coherence of the dictionary, revealing that even a term with a modest coefficient can survive if it is highly coherent with other columns, and conversely, a large‑coefficient term may be discarded when its removal causes only a small increase in the projection error. The authors extend the notion of scores to the weak‑SINDy regime, where time derivatives are approximated indirectly, showing that the same score formula applies after embedding the derivative operator into the library matrix.
Based on these insights, two score‑guided library selection strategies are proposed. The exhaustive version (ESR) evaluates every possible sub‑dictionary, computes the minimal score at each sparsity level, and selects the sub‑dictionary with the lowest score. Although ESR guarantees a globally optimal selection, its combinatorial cost limits it to modest library sizes. The greedy bounded version (GBSR) iteratively removes the term with the smallest score, monitors the evolution of the minimal score, and stops when a sharp increase (a “knee” in the score curve) is observed. This provides an automatic, data‑driven way to choose the sparsity level without manual threshold tuning.
The authors integrate the score‑based selection with the weak‑SINDy formulation and test the combined approach on several benchmark problems: the Lorenz system, Burgers’ equation, and a two‑dimensional wave equation. In all cases, the score‑guided methods outperform standard STLS and LASSO‑based SINDy in terms of reconstruction error and model parsimony. Notably, when the true dynamics contain small‑magnitude but dynamically critical terms (e.g., near a Hopf bifurcation), the score‑based approach retains those terms while discarding spurious high‑order monomials that would otherwise be kept by a naïve threshold. Noise robustness experiments (Gaussian noise up to 15 % of signal amplitude) show that GBSR maintains stable library selection and low error, whereas conventional STLS becomes sensitive to the choice of λ.
Limitations are acknowledged. In highly over‑complete libraries the scores of many terms become similar, reducing discriminative power. Moreover, computing the projection operators P_D and P_{D{i}} for each candidate term can be computationally intensive for large data sets; the authors suggest dimensionality‑reduction techniques (random projections) and GPU‑accelerated linear algebra as possible mitigations, but further work is needed.
In summary, the paper provides a clear theoretical link between STLS’s thresholding behavior and a projection‑error‑based score, leverages this link to design practical library‑selection algorithms, and demonstrates through extensive numerical experiments that score‑guided dictionary pruning improves accuracy, interpretability, and noise robustness of SINDy models. Future directions include extending the framework to neural‑network‑based libraries, online/real‑time identification, and rigorous statistical guarantees for the greedy selection process.
Comments & Academic Discussion
Loading comments...
Leave a Comment