Recently there has been much interest in data that, in statistical language, may be described as having a large crossed and severely unbalanced random effects structure. Such data sets arise for recommender engines and information retrieval problems. Many large bipartite weighted graphs have this structure too. We would like to assess the stability of algorithms fit to such data. Even for linear statistics, a naive form of bootstrap sampling can be seriously misleading and McCullagh [Bernoulli 6 (2000) 285--301] has shown that no bootstrap method is exact. We show that an alternative bootstrap separately resampling rows and columns of the data matrix satisfies a mean consistency property even in heteroscedastic crossed unbalanced random effects models. This alternative does not require the user to fit a crossed random effects model to the data.
1. Introduction. Many important statistical problems feature two interlocking sets of entities, customarily arranged as rows and columns. Unlike the usual cases by variables layout, these data fit better into a cases by cases interpretation. Examples include books and customers for a web site, movies and raters for a recommender engine, and terms and documents in information retrieval. Historically data with this structure has been studied with a crossed random effects model. The new data sets are very large and haphazardly structured, a far cry from the setting for which normal theory random effects models were developed. It can be hard to estimate the variance of features fit to data of this kind.
Parametric likelihood and Bayesian methods typically come with their own internally valid methods of estimating variances. However, the crossed random effects setting can be more complicated than what our models anticipate. If in IID sampling we suspect that our model is inadequate, then we can make a simple and direct check on it via bootstrap resampling of A. B. OWEN cases. We can even judge sampling uncertainty for computations that were not derived from any explicit model.
We would like to have a version of the bootstrap suitable to large unbalanced crossed random effect data sets. Unfortunately for those hopes, McCullagh (2000) has proved that no such bootstrap can exist, even for the basic problem of finding the variance of the grand mean of the data in a balanced setting with no missing values and homoscedastic variables. McCullagh (2000) included two reasonably well performing approximate methods for balanced data sets. They yielded a variance that was nearly correct under reasonable assumptions about the problem. One approach was to fit the random effects model and then resample from it. That option is not attractive for the kind of data set considered here. Even an oversimplified model can be hard to fit to unbalanced data, and the results will lack the face value validity that we get from the bootstrap for the IID case. The second method resampled rows and columns independently. This approach imitates the Cornfield and Tukey (1956) pigeonhole sampling model, and is preferable operationally. We call it the pigeonhole bootstrap, and show that it continues to be a reasonable estimator of variance even for seriously unbalanced data sets and inhomogenous (nonexchangeable) random effects models.
In notation to be explained further below, we find that the true variance of our statistic takes the form (ν A σ 2 A + ν B σ 2 B + σ 2 E )/N , where ν A and ν B can be calculated from the data and satisfy 1 ≪ ν ≪ N in our motivating applications. A naive bootstrap (resampling cases) will produce a variance estimate close to (σ 2 A + σ 2 B + σ 2 E )/N and thus be seriously misleading. The pigeonhole bootstrap will produce a variance estimate close to ((ν A + 2)σ 2 A + (ν B + 2)σ 2 B + 3σ 2 E )/N . It is thus mildly conservative, but not unduly so in cases where each ν ≫ 2 and σ 2 E does not dominate. McCullagh (2000) leaves open the possibility that a linear combination of several bootstrap methods will be suitable. In the present setting the pigeonhole bootstrap overestimates the variance by twice the amount of the naive bootstrap. One could therefore bootstrap both ways and subtract twice the naive variance from the pigeonhole variance. That approach, of course, brings the usual difficulties of possibly negative variance estimates. Also, sometimes we do not want the variance per se, just a histogram that we think has approximately the right width, and the variance is only a convenient way to decide if a histogram has roughly the right width. Simply accepting a bootstrap histogram that is slightly too wide may be preferable to trying to make it narrower by an amount based on the naive variance.
Many of the motivating problems come from e-commerce. There one may have to decide where on a web page to place an ad or which book to recommend. Because the data sets are so large, coarse granularity statistics can be estimated with essentially negligible sampling uncertainty. For example, the Netflix data set has over 100 million movie ratings and the average movie rating is very well determined. Finer, subtle points, such as whether classical music lovers are more likely to purchase a Harry Potter book on a Tuesday are a different matter. Some of these may be well determined and some will not. An e-commerce application can keep track of millions of subtle rules, and the small advantages so obtained can add up to something commercially valuable. Thus, the dividing line between noise artifacts and real signals is worth finding, even in problems with large data sets.
The outline of this paper is as follows. Section 2 introduces the notation for row and column entities and sample sizes, including the critical quantities ν A and ν B , as well as the random effects model we consider and the linear statistics we investigate.
This content is AI-processed based on open access ArXiv data.