A Sparse and Adaptive Prior for Time-Dependent Model Parameters
We consider the scenario where the parameters of a probabilistic model are expected to vary over time. We construct a novel prior distribution that promotes sparsity and adapts the strength of correlation between parameters at successive timesteps, based on the data. We derive approximate variational inference procedures for learning and prediction with this prior. We test the approach on two tasks: forecasting financial quantities from relevant text, and modeling language contingent on time-varying financial measurements.
💡 Research Summary
The paper addresses the problem of learning probabilistic models whose parameters evolve over time. Standard approaches either ignore temporal information (treating data as i.i.d.) or impose a single global temporal regularizer, both of which have drawbacks: the former discards useful trends, the latter can over‑smooth and requires many hyper‑parameters. The authors propose a novel Bayesian prior that simultaneously encourages sparsity at the feature level and adapts the strength of temporal correlation for each feature individually.
For each base feature i, the coefficients across T discrete time steps are collected into a vector β_i = (β_i^{(1)}, …, β_i^{(T)}). This vector is assumed to be drawn from a zero‑mean multivariate normal distribution with precision matrix Λ_i. Λ_i has a tridiagonal structure: the diagonal entries are a scalar λ_i (controlling overall scale and thus sparsity) and the off‑diagonal entries are a single scalar α_i (controlling correlation between adjacent time steps). Both λ_i and α_i are given hyper‑priors: λ_i follows an improper Jeffreys prior (p(λ) ∝ λ⁻¹) and α_i follows a truncated exponential distribution on (−C, 0] with rate τ, where C≈0.5. This construction guarantees that Λ_i is positive‑definite for α_i∈(−0.5,0.5) and allows each feature to learn its own autocorrelation strength.
The prior’s design yields two major computational benefits. First, the tridiagonal precision matrix permits O(T) operations for each feature, leading to overall time and space complexity O(I(N+T)) where I is the number of features and N the number of observations. This is a substantial improvement over a full Wishart prior, which would incur O(I(N+T²)). Second, because only one α_i per feature is needed, the number of hyper‑parameters does not explode with the number of time steps.
Exact posterior inference is intractable, so the authors employ mean‑field variational inference. They introduce factorized variational distributions: a Gamma distribution for each λ_i (parameters a_i, b_i) and a truncated exponential for each α_i (parameter κ_i). The evidence lower bound (ELBO) can be written analytically in terms of expectations under these variational distributions and the model log‑likelihood L(β). The ELBO contains terms that resemble an ℓ₂ penalty weighted by the expected precision, plus entropy terms for the variational factors. Optimization proceeds by jointly updating β (MAP estimate) and the variational parameters, using standard gradient‑based methods or coordinate ascent. The resulting β retains sparsity (many coefficients are driven to zero across all time steps) while allowing the non‑zero coefficients to vary smoothly according to the learned α_i.
The method is evaluated on two distinct tasks. The first is a financial forecasting problem: predicting the log‑volatility of stock returns from annual 10‑K reports (1996‑2005). Features consist of the 101 most frequent words (binary) plus the previous year’s volatility. The proposed model is compared against ridge, lasso, and a time‑series ridge variant (ridge‑ts). Across all years, the new prior yields lower mean‑squared error, demonstrating that it automatically balances the use of long‑term historical data versus recent observations. The second task involves language modeling conditioned on time‑varying economic variables. Here the prior is used to regularize a dynamic language model, and it achieves state‑of‑the‑art perplexity, showing that the adaptive temporal smoothing improves predictive word probabilities.
In summary, the contributions are:
- A sparse, adaptive prior that places a tridiagonal precision matrix on each feature’s time‑series coefficients, allowing per‑feature autocorrelation learning.
- An efficient variational inference scheme that yields MAP estimates for β while marginalizing λ and α.
- Empirical validation on financial risk prediction and temporally conditioned language modeling, both showing consistent improvements over strong baselines.
Limitations noted include the reliance on MAP estimates (no full posterior uncertainty for β) and the restriction to first‑order temporal dependencies. Future work could extend the prior to higher‑order correlations, incorporate empirical Bayes for τ, and integrate the prior into deep neural architectures for large‑scale non‑linear problems.
Comments & Academic Discussion
Loading comments...
Leave a Comment