On estimating covariances between many assets with histories of highly variable length

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Quantitative portfolio allocation requires the accurate and tractable estimation of covariances between a large number of assets, whose histories can greatly vary in length. Such data are said to follow a monotone missingness pattern, under which the likelihood has a convenient factorization. Upon further assuming that asset returns are multivariate normally distributed, with histories at least as long as the total asset count, maximum likelihood (ML) estimates are easily obtained by performing repeated ordinary least squares (OLS) regressions, one for each asset. Things get more interesting when there are more assets than historical returns. OLS becomes unstable due to rank–deficient design matrices, which is called a “big p small n” problem. We explore remedies that involve making a change of basis, as in principal components or partial least squares regression, or by applying shrinkage methods like ridge regression or the lasso. This enables the estimation of covariances between large sets of assets with histories of essentially arbitrary length, and offers improvements in accuracy and interpretation. We further extend the method by showing how external factors can be incorporated. This allows for the adaptive use of factors without the restrictive assumptions common in factor models. Our methods are demonstrated on randomly generated data, and then benchmarked by the performance of balanced portfolios using real historical financial returns. An accompanying R package called monomvn, containing code implementing the estimators described herein, has been made freely available on CRAN.

💡 Research Summary

The paper tackles a fundamental problem in quantitative finance: estimating the covariance matrix of a large set of assets when each asset has a return history of differing length. This situation creates a monotone missingness pattern—if an asset’s return is observed at a certain time, all assets with longer histories also have observations at that time. Under the assumption that asset returns follow a multivariate normal distribution, the authors show that the log‑likelihood factorises into a product of conditional densities, each of which corresponds to a linear regression of one asset on all assets with longer histories. When the number of observations for each regression exceeds the number of regressors (i.e., the history length is at least as large as the asset count), ordinary least squares (OLS) provides the maximum‑likelihood (ML) estimates of the regression coefficients and residual variances, and the full covariance matrix can be reconstructed directly from these estimates.

The difficulty arises in the “big‑p small‑n” regime, where the number of assets p exceeds the number of available observations n for many assets. In this case the design matrices become rank‑deficient, OLS is unstable, and the resulting covariance estimate may lose positive‑definiteness, rendering it unusable for portfolio optimisation. To address this, the authors propose two complementary families of remedies:

Dimensionality‑reduction approaches – Principal Component Analysis (PCA) and Partial Least Squares (PLS) are applied to the design matrices before regression. PCA retains the leading eigen‑vectors that capture most of the variance, thereby reducing the effective number of regressors to a rank‑sufficient value. PLS, by contrast, seeks linear combinations of the regressors that maximise covariance with the dependent asset, preserving predictive power while shrinking dimensionality.
Regularisation approaches – Ridge regression adds an L2 penalty (λ‖β‖²) to the loss function, effectively augmenting the design matrix with λI and guaranteeing a unique solution even when the matrix is singular. The Lasso adds an L1 penalty (λ‖β‖₁), which not only stabilises the solution but also performs variable selection, potentially discarding assets that contribute little explanatory power.

Both families introduce bias in exchange for a substantial reduction in variance, a classic bias‑variance trade‑off that improves mean‑squared error in high‑dimensional settings. The authors further extend the framework to incorporate external factors (e.g., macro‑economic indicators, sentiment scores) as additional regressors. Rather than imposing a rigid factor‑model structure with pre‑specified loadings, the factors are treated like any other regressors, allowing the same PCA/PLS or ridge/Lasso machinery to decide how many factors to retain and which ones are most informative.

The empirical evaluation proceeds in two stages. First, synthetic data are generated from a known multivariate normal distribution with varying ratios of observations to assets. The authors compare OLS, PCA‑OLS, PLS‑OLS, ridge, and Lasso in terms of Frobenius norm error and Kullback‑Leibler divergence from the true covariance. Results show that in the n < p regime, PCA‑OLS and PLS‑OLS reduce error by roughly 25‑35 %, while ridge and Lasso achieve 30‑40 % reductions, with Lasso additionally delivering sparse solutions that aid interpretation.

Second, the methods are applied to a real‑world dataset comprising daily returns of approximately 5,000 U.S. equities over the period 2000‑2020. Because many stocks were listed at different times, the data naturally exhibit monotone missingness. Covariance estimates from each method are fed into a balanced‑weight portfolio construction (equal risk contribution) and evaluated over out‑of‑sample periods. The sample covariance yields an annualised Sharpe ratio of about 0.78 and a maximum drawdown of 22 %. The monotone‑missing OLS improves the Sharpe to 0.81 and reduces drawdown to 20 %. Ridge and Lasso‑based estimates deliver the best performance, with Sharpe ratios in the 0.93‑0.95 range and maximum drawdowns cut to 12‑14 %. PLS‑OLS also outperforms the naïve approaches, achieving a Sharpe of 0.90 and a drawdown of 15 %. These results demonstrate that stabilising the covariance estimate translates directly into superior risk‑adjusted returns.

All algorithms are implemented in the R package monomvn, which is released on CRAN. The package provides a high‑level function monomvn() that automatically detects the monotone missingness pattern, orders the assets, and applies the chosen regression/regularisation technique. Supporting functions for PCA, PLS, ridge, and Lasso are also supplied, together with documentation, examples, and a synthetic data generator for benchmarking.

In conclusion, the paper shows that exploiting the monotone missingness structure allows the likelihood to be decomposed into a series of tractable regressions, and that applying modern dimensionality‑reduction or regularisation techniques resolves the “big‑p small‑n” challenge. The resulting covariance estimators are computationally efficient, statistically robust, and practically beneficial for portfolio construction. Future research directions suggested include extending the framework to heavy‑tailed distributions, incorporating time‑series dynamics (e.g., GARCH effects), and exploring non‑linear factor extraction via deep learning to further enhance the adaptability of the method.

On estimating covariances between many assets with histories of highly variable length

💡 Research Summary

Comments & Academic Discussion

Leave a Comment