Random projections for Bayesian regression
This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire $d$-dimensional distribution is approximately preserved under random projections b…
Authors: Leo N. Geppert, Katja Ickstadt, Alex
Random pro jections for Ba y esian regression Leo N. Gepp ert 1 , Katja Ic kstadt 1 , Alexander Mun teanu 2 , Jens Quedenfeld 2 , and Christian Sohler 2 1 Departmen t of Statistics { geppert, ickstadt } @statistik.uni-dortmund.de 2 Departmen t of Computer Science { alexander.munteanu, jens.quedenfeld, christian.sohler } @tu-dortmund.de T ec hnische Univ ersit¨ at Dortm und 44221 Dortm und, Germany No vem ber 30, 2015 Abstract This article deals with random pro jections applied as a data reduction tec hnique for Bay esian regression analysis. W e show sufficien t conditions under which the en tire d -dimensional distri- bution is approximately preserved under random pro jections by reducing the num b er of data p oin ts from n to k ∈ O (p oly( d/ε )) in the case n d . Under mild assumptions, we prov e that ev aluating a Gaussian lik eliho o d function based on the pro jected data instead of the original data yields a (1 + O ( ε ))-appro ximation in terms of the ` 2 W asserstein distance. Our main result sho ws that the p osterior distribution of Bay esian linear regression is appro ximated up to a small error dep ending on only an ε -fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions ov er R d for β . Our empirical ev aluations inv olve different simulated settings of Ba yesian linear regression. Our exp erimen ts underline that the proposed metho d is able to recov er the regression mo del up to small error while considerably reducing the total running time. 1 In tro duction In this pap er we consider linear regression. Using a linear map Π ∈ R k × n whose choice is still to b e defined, w e transform the original data set [ X, Y ] ∈ R n × ( d +1) in to a sketc h, i.e., a substitute data set, [Π X, Π Y ] ∈ R k × ( d +1) that is considerably smaller. Therefore, the likelihoo d function can b e ev aluated faster than on the original data. Moreo ver, we will show that the likelihoo d is v ery similar to the original one. In the context of Bay esian regression we hav e the likelihoo d L ( β | X , Y ) and additional prior information p pre ( β ) given in form of a prior distribution ov er the parameters β ∈ R d whic h w e would lik e to estimate. Our main result is to show that the resulting p osterior distribution p post ( β | X , Y ) ∝ f ( Y | β , X ) · p pre ( β ) (1) 1 will also b e well appro ximated within a small error. Please note that Bay es’ Theorem ( 1 ) con tains the probabilit y density function f ( Y | β , X ). F rom now on, we will concentrate on its interpretation as likelihoo d function L ( β | X , Y ) as a function of the unknown parameter vector β as given in ( 2 ) p post ( β | X , Y ) ∝ L ( β | X , Y ) · p pre ( β ) . (2) The main idea of our approac h is given in the following scheme: [ X , Y ] Π − − − − → [Π X , Π Y ] ↓ ↓ p post ( β | X , Y ) ≈ ε p post ( β | Π X , Π Y ) . More sp ecifically , w e can reduce the num b er of observ ations from the num b er of input p oin ts n to a target dimension k ∈ O (p oly ( d/ε )), which, in particular, is indep enden t of n . Th us, the running time of all subsequent calculations do es not further dep end on n . F or instance, a Mark ov Chain Mon te Carlo (MCMC) sampling algorithm ma y be used to obtain samples from the unkno wn distribution. Using the reduced data set will sp eed up the computations considerably . The samples remain sufficien tly accurate to resem ble the original distribution and also to mak e statistical predic- tions that are nearly indistinguishable from the predictions that w ould ha ve b een made based on the original sample. Note, that mathematically it is p ossible to achiev e a similar reduction by setting Π = X T without incurring an y error. F rom the resulting matrix [ X T X , X T Y ] it w ould b e p ossible to compute or ev aluate the exact likelihoo d resp ectiv ely p osterior in some analytically tractable cases, whic h is the standard textbo ok approac h for classical linear regression and Ba yesian linear re- gression with Gaussian prior and indep enden t normal error, (cf. Bishop 2006 ; Hastie et al. 2009 ). Ho wev er, this approach is not numerically stable leading to ill-conditioned cov ariance matrices, whic h is known from the literature (cf. Golub 1965 ; Lawson and Hanson 1995 ; Golub and v an Loan 2013 ) and eviden t from our exp eriments. F or that reason, the exact calculation is not an option in general. Instead, approximation algorithms are an alternativ e, where the error can b e con trolled. Using no reduction tec hnique is also not an option, since then even likelihoo d ev aluations dep end at least linearly on n . This do es not p ose a problem for small data sets. F or larger n this is also still p ossible, but emplo ying reduction techniques can already b e b eneficial to reduce the running time. F or data sets that do not fit into the working memory , intelligen t solutions are needed to a void frequen t swapping to slow er secondary memory . Ho wev er, for really massiv e data or infinite data streams, already the much simpler problem of computing the exact ordinary least squares estimator in one pass ov er the data requires at least Ω( n ) space ( Clarkson and W o o druff 2009 ). This makes the task imp ossible on finite memory machines when n gro ws large enough. There are different computational mo dels to deal with massiv e data sets in streaming and distributed en vironments. W e fo cus on the streaming model, formally in tro duced in Muthukrishnan ( 2005 ). A data stream algorithm is given an input stream of items, lik e n umerical v alues, p oin ts in R d or edges of a graph at a high rate. As the items arrive one b y one it maintains some sort of summary of the data that is observed so far. This can b e a subsample or a linear sketc h as describ ed ab o ve. The linearit y allows for flexible dynamic up dates of the sketc h as we will discuss later. At an y time, the memory used by the algorithm is restricted to b e sublinear, usually at most p olylogarithmic in the num b er of items. F or geometrical problems the dep endence on the dimension 2 d is often restricted to b e at most a small p olynomial. Also, the algorithm is allo wed to make only one single pass ov er the data. With these restrictions in mind, it is clear that the sk etching matrix Π ∈ R k × n cannot b e explicitly stored in memory . In fact it has to fulfill the following criteria to allo w for a streaming algorithm: 1. Π[ X, Y ] appro ximates [ X , Y ] well in the ab ov e sense. 2. Π can b e stored succinctly . 3. W e can efficiently generate the en tries of Π. W e will see that the structured randomized constructions of Π can b e pro v ably achiev ed using random bits of only limited indep endence. This means that the entries of Π need not b e fully inde- p enden t. How ev er, if we c ho ose a small num b er of entries, they b eha ve as if they w ere indep enden t. In particular the entries can b e stored implicitly , meeting the indep endence requirements by using hash functions. These can b e ev aluated very efficien tly and make the memory dep endency on n only logarithmic. Although the techniques presented in the following are employ ed to make the computations p ossible in a streaming setting, the results are of in terest also in the non-streaming setting whenever large data sets can b e reduced to meet time and memory constrain ts. 2 Bac kground and Related W ork Dimensionalit y reduction techniques like principal comp onent analysis (PCA) ( Jolliffe 2002 ) and random pro jections hav e b een widely used in Statistics and Computer Science. How ever, their fo cus is usually on reducing the n umber of v ariables. Our metho d aims to reduce the num b er of observ ations while k eeping the algebraic structure of the data. This leads to a sp eed-up in the subsequen t (frequentist or Ba yesian) regression analysis, b ecause the running times of commonly used algorithms heavily dep end on n . Basic techniques based on PCA include principle comp onen t regression and partial least squares ( Hastie et al. 2009 ). More recen t results using PCA stem from the theory of core-sets for the k -means clustering problem and address the problem of computing a small set of data that approximates the original p oin t set with resp ect to the given ob jective up to little, sa y (1 ± ε ), error ( F eldman et al. 2013 ). The concept of core-sets is related to the early works of Madigan et al. ( 2002 ) and DuMouc hel et al. ( 1999 ) on data-squashing that seeks for data compression based on the likelihoo d of the observ ations. One of the more recen t con tributions is Quiroz et al. ( 2015 ). They suggest inclusion probabilities prop ortional to the observ ations’ con tribution to the likelihoo d, which is approximated b y a Gaussian Pro cess or a thin-plate spline appro ximation. Data-squashing can lead to a considerable reduction in the necessary num b er of observ ations, ho wev er there is a lack of approximation guaran tees. These references sho w that in the adv ent of massive data sets, b esides the effort in reducing dimensionality , there is also need to reduce the num b er of observ ations without incurring loss of to o muc h statistical information. Random pro jections hav e b een studied in the context of low-rank approximation ( Cohen et al. 2015 ), least squares regression ( Sarl´ os 2006 ; Clarkson and W o o druff 2009 ), Gaus- sian pro cess regression ( Banerjee et al. 2013 ), clustering problems ( Boutsidis et al. 2010 ; Kerb er and Ragh vendra 2014 ; Cohen et al. 2015 ), classification tasks ( Paul et al. 2014 ) and compressed 3 sensing ( Cand` es et al. 2006 ; Donoho 2006 ). Random pro jections are used similarly to our work, to appro ximate a collection of subspaces, consisting only of sparse vectors ( Baraniuk et al. 2007 ). Also, Ba yesian inference has b een prop osed for efficient computation in compressed sensing ( Ji and Carin 2007 ). Recen tly there has b een a series of w orks dealing with the statistical asp ects of randomized linear algebra algorithms. In Raskutti and Mahoney ( 2015 ) and Ma et al. ( 2014 ), the statistical prop erties of subsampling approaches based on the statistic al lever age sc or es of the data are inv estigated in detail. Deviating from the w orst case algorithmic persp ective, it w as sho wn in Ma et al. ( 2014 ) that on av erage the leverage scores b eha ve quite uniformly if the data is generated follo wing a standard linear regres sion mo del. In Y ang et al. ( 2015 ), sev eral sketc hing and subsampling metho ds are used for fast preconditioning b efore solving the ordinary least squares (OLS) estimators on the subsampled data using state of the art OLS solv ers. Moreov er, they give parallel and distributed algorithms and extensive empirical ev aluations on large scale data for this task. Our work, while a ware of pro viding worst case guarantees, contin ues the discussion of statistical prop erties to the Ba yesian setting. Ba yesian regression analysis for large scale data sets has b een considered b efore. Guhaniyogi and Dunson ( 2014 ) prop osed reducing the n umber of v ariables via random pro jections as a prepro cessing step in the large d , small n scenario. They sho w that under several assumptions the approximation con verges to the desired p osterior distribution, whic h is not p ossible in general, since it was sho wn in Boutsidis and Magdon-Ismail ( 2014 ) that such dimensionality reduction oblivious to the target v ariable causes additive error in the worst case. F or the large n , small d case, Balakrishnan and Madigan ( 2006 ) prop osed a one-pass algorithm that reads the data blo c k-wise and p erforms a certain n um b er of MCMC steps. When the next blo c k is read, the algorithm keeps or replaces some of the data p oints based on w eights that k eep trac k of the imp ortance of the data. The selection rule is justified empirically but lacks theoretical guaran tees. Theoretical supp ort is only giv en in the univ ariate case based on cen tral limit theorems for Sequential Monte Carlo metho ds. When allowing more passes ov er the data, TSQR ( Demmel et al. 2012 ) is a QR decomp osition whic h works esp ecially well in the large n , small d case and can easily b e parallelized in the MapReduce setting, as studied by Constantine and Gleich ( 2011 ) and Benson et al. ( 2013 ). TSQR pro vides a n umerically stable decomp osition, whic h can be used as a prepro cessing step prior to MCMC inference which can be conducted with high accuracy . This method, how ev er, dep ending on the computational setting, has a num b er of limitations. It only works when the data is giv en row- b y-row and the exp ensive decomp osition has to b e carried out a linear num b er of times, resulting in a total low er b ound on the running time of Ω( nd 2 ) (cf. Demmel et al. 2012 ). The metho d is restricted to ` 2 regression. Our metho d of random pro jections is capable of going b ey ond these limitations. In particular, it can be extended to ` p regression and more generally to robust M - estimators ( Clarkson and W o o druff 2015 ). This flexibility comes at the price of a loss in accuracy , ho wev er, this loss is controllable b y b ounding parameters and do es not lead to inv alid inference. Another approac h b y Bardenet et al. ( 2014 ) tries to subsample the data to approximate the acceptance rule in each iteration of an MCMC sampler. The decision is shown to b e similar to the original with high probability in eac h step. The num b er of samples is highly dep endent on 4 the v ariance of the logarithm of likelihoo d ratios. The metho d may b e useful for in teresting and in tractable cases when the v ariance can b e b ounded. While frequentist linear regression can b e solved straigh tforwardly b y c omputing the pro jection of the target v ariable to the subspace spanned by the data, Bay esian regression is typically com- putationally more demanding. In some cases, it is p ossible to calculate the p osterior distribution analytically , but in general this is not true. F or that reason, an approximation of the p osterior distribution is needed. MCMC metho ds are one p ossibilit y and standard in Bay esian analysis. They are reliable, but can tak e considerable time b efore they conv erge and sample from the desired p osterior distribution. Moreo ver, the running time grows with the n umber of observ ations in the data set. The main bottleneck of a lot of Ba yesian analysis metho ds including MCMC is the rep eated ev aluation of the lik eliho o d. The running time of eac h ev aluation grows linearly with the n umber of observ ations in the data set. There are s ev eral approaches trying to reduce the computational effort for Bay esian regression analysis b y emplo ying different algorithms that may p erform more ef- ficien t in certain settings. Appro ximate Ba yesian Computing (ABC) and In tegrated Nested Laplace Appro ximations (INLA) b oth fall into this category . The main idea b ehind ABC is to av oid the exact ev aluations by appro ximating the likelihoo d function using simulations ( Csillery et al. 2010 ). INLA ( Rue et al. 2009 ; Martins et al. 2013 ) on the other hand is an appro ximation of the posterior distribution that is applicable to mo dels that fall into the class of so-called latent Gaussian mo dels. Both metho ds can lead to a considerable sp eed-up compared to standard MCMC algorithms. Note ho w ever, that the sp eed-up is ac hieved by c hanging the algorithm whic h is used to conduct the analysis. This is differen t in our approac h, whic h reduces the n um b er of observ ations in the data set while appro ximately retaining its statistical prop erties. The running times of man y algorithms including MCMC algorithms highly dep end on the n umber of observ ations, whic h means that our prop osed method also results in a speed-up of the analysis. In this article, w e fo cus on MCMC metho ds for the analysis, but in principle, as our metho d pro v ably appro ximates the p osterior, all algorithms that assess the p osterior can b e employ ed. W e did not use ABC since it is only suitable for summary statistics of very lo w dimension ( d < 10) ( Beaumont et al. 2002 ; Csillery et al. 2010 ). Ho wev er, we hav e tried INLA on a small scale, achieving comparable results as with MCMC, making the running time of the analysis indep endent of n . Lik ewise one could consider calculating the exact formulae for analytically tractable cases of the posterior. How ever, we concentrate on MCMC metho ds b ecause of their general applicabilit y and reliability . New directions in Bay esian data analysis in the context of Big Data are survey ed in W elling et al. ( 2014 ). Our w ork directly suits the criteria that is prop osed in that reference for the large n case in streaming as well as distributed computation environmen ts. 3 Preliminaries 3.1 General notation F or the sake of a brief presen tation, we in tro duce some notation. W e denote by [ n ] = { 1 , . . . , n } the set of all p ositiv e integers up to n . F or a probability measure λ , let E λ [ X ] b e the exp ected v alue of X with resp ect to λ . W e skip the subscript in E [ X ] if the probability measure is clear from the con text. F or a matrix M ∈ R n × d w e let M = U Σ V T denote its singular v alue decomp osition (SVD), 5 where U ∈ R n × d and V ∈ R d × d are unitary matrices spanning the columnspace and rowspace of M , respectively . Σ ∈ R d × d is a diagonal matrix, whose eleme n ts σ 1 ≥ . . . ≥ σ d are the singular v alues of M . W e denote by σ max = σ 1 the largest and by σ min = σ d the smallest singular v alue of M and write σ i ( M ) to make clear the σ i b elong to M . The trace of M T M equals the sum of the squared singular v alues, i.e., tr M T M = P d i =1 σ 2 i ( M ). W e assume w.l.o.g. all matrices to ha ve full rank and stress that all our pro ofs carry out similarly to our presentation if the matrices are of low er rank. One migh t even use kno wledge ab out low er rank to reduce the space and time complexities to b ounds that only depend on the rank rather than on the num b er of v ariables. 3.2 Ba yesian regression A linear regression mo del is giv en in the follo wing equation: Y = X β + ξ . Y ∈ R n is a random v ariable con taining the v alues of the resp onse, where n is the num b er of observ ations in the data set. X ∈ R n × d is a matrix con taining the v alues of the d indep endent v ari- ables. W e denote by ξ ∼ N (0 , ς 2 I n ) an n -dimensional random v ector that mo dels the unobserv able error term. The dep enden t v ariable Y then follows a Gaussian distribution, Y ∼ N ( X β , ς 2 I n ). The corresp onding probabilit y density function is f ( y | X β , Σ) = (2 π ) − n 2 | Σ | − 1 2 exp − 1 2 ς 2 k X β − y k 2 2 , where Σ = ς 2 I n . In a Bay esian setting, β ∈ R d is the unknown parameter v ector whic h is assumed to follo w an unkno wn distribution p ( β | X , Y ) called the p osterior distribution. Prior knowledge ab out β can b e mo deled using the prior distribution p ( β ). The p osterior distribution is a compromise b et ween the prior distribution and the observed data. In general, the p osterior distribution cannot b e calculated analytically . In this pap er, we deter- mine the p osterior distribution employing Marko v Chain Mon te Carlo metho ds, ev en though the p osterior is explicitly known. Regardless of the computational problems related to these explicit form ulae (cf. Section 1 ), w e fo cus on MCMC, b ecause this w ork forms the basis for further research on more complex mo dels where analytical solutions are not obtainable. Possible extensions are hi- erarc hical mo dels and mixtures of normal distributions (cf. Section 7 ). F urthermore, we follo w this strategy to rule out p ossible in teraction effects betw een sk etching and MCMC that might occur ev en in these basic cases. How ever, our empirical ev aluation indicates that there are none. 3.3 Norms and metrics Before going into details ab out random pro jections and subspace embeddings, let us first define the matrix and vector norms used in this pap er as well as the metric that w e are going to use for quan tifying the distance b et ween distributions. Definition 1 (sp ectral norm) . The sp e ctr al or op er ator norm of a matrix A ∈ R n × d is define d as k A k 2 = sup x ∈ R d \{ 0 } k Ax k 2 k x k 2 , 6 wher e k y k 2 = ( P m i =1 y 2 i ) 1 2 denotes the Euclide an ve ctor norm for any y ∈ R m . A useful fact that is straightforw ard from Definition 1 is that the sp ectral norm of a matrix M equals its largest singular v alue, i.e., we hav e k M k 2 = σ max ( M ) (cf. Horn and Johnson 1990 ). In order to quantify the distance b etw een probability measures and in particular b etw een the original p osterior and its appro ximated coun terpart w e will need some further definitions. F or this sak e, giv en t wo probability measures γ , ν ov er R d , let Λ( γ , ν ) denote the set of all join t probability measures on R d × R d with marginals γ and ν , resp ectively . Definition 2 (W asserstein distance, cf. Villani ( 2009 )) . Given two pr ob ability me asur es γ , ν on R d the ` 2 Wasserstein distanc e b etwe en γ and ν is define d as W 2 ( γ , ν ) = inf λ ∈ Λ( γ ,ν ) Z R d × R d k x − y k 2 2 d λ ( x, y ) 1 2 = inf λ ∈ Λ( γ ,ν ) E λ k x − y k 2 2 1 2 F rom the definition of the W asserstein distance w e can derive a measure of how muc h p oints dra wn from a giv en distribution will spread from the origin. The Wasserstein weight can be though t of as a norm of a probabilit y measure. Definition 3 (W asserstein weigh t) . We define the ` 2 Wasserstein weight of a pr ob ability me asur e γ as W 2 ( γ ) = W 2 ( γ , δ ) = Z R d k x k 2 2 d γ 1 2 = E γ k x k 2 2 1 2 wher e δ denotes the Dir ac delta function. 3.4 Random pro jections and ε -subspace em b eddings The following definition of so called ε -subspace embeddings will b e central to our w ork. Such an em b edding can be used to reduce the size of a giv en data matrix while preserving the algebraic structure of its spanned subspace up to (1 ± ε ) distortion. Before w e summarize sev eral metho ds to construct a subspace em b edding for a giv en input matrix, we give a formal definition. Here and in the rest of the pap er w e assume 0 < ε ≤ 1 / 2. Definition 4 ( ε -subspace embedding) . Given a matrix U ∈ R n × d with orthonormal c olumns, an inte ger k ≤ n and an appr oximation p ar ameter 0 < ε ≤ 1 / 2 , an ε -subsp ac e emb e dding for U is a map Π : R n → R k such that (1 − ε ) k U x k 2 2 ≤ k Π U x k 2 2 ≤ (1 + ε ) k U x k 2 2 (3) holds for al l x ∈ R d , or, e quivalently k U T Π T Π U − I d k 2 ≤ ε. (4) Inequalit y ( 3 ) is mainly used in this pap er, but Inequality ( 4 ) is more instructive in the sense that it makes clear that the em b edded subspace is close to the identit y , not in volving m uch scale or rotation. Note that an ε -subspace embedding Π for the columnspace of a matrix M preserv es its squared singular v alues up to (1 ± ε ) distortion, which in particular means that it also preserves its rank. W e prov e this claim for completeness. 7 Observ ation 1. L et Π b e an ε -subsp ac e emb e dding for the c olumnsp ac e of M ∈ R n × d . Then (1 − ε ) σ 2 i ( M ) ≤ σ 2 i (Π M ) ≤ (1 + ε ) σ 2 i ( M ) and (1 − 2 ε ) σ − 2 i ( M ) ≤ σ − 2 i (Π M ) ≤ (1 + 2 ε ) σ − 2 i ( M ) . Pr o of. F or the first claim, we make use of a min-max represen tation of the singular v alues that is kno wn as the Courant-Fisc her theorem (cf. Horn and Johnson 1990 ). In the following deriv ation w e choose x ∗ to b e the maximizer of ( 5 ) and S ∗ the minimizer of ( 6 ). σ 2 i (Π M ) = min S ∈ R ( i − 1) × d max S x =0 , k x k 2 =1 k Π M x k 2 2 ≤ max S ∗ x =0 , k x k 2 =1 k Π M x k 2 2 (5) = k Π M x ∗ k 2 2 ≤ (1 + ε ) k M x ∗ k 2 2 ≤ (1 + ε ) max S ∗ x =0 , k x k 2 =1 k M x k 2 2 = (1 + ε ) min S ∈ R ( i − 1) × d max S x =0 , k x k 2 =1 k M x k 2 2 (6) = (1 + ε ) σ 2 i ( M ) . The low er b ound can b e deriv ed analogously using the low er b ound of ( 3 ). No w we use the first claim to prov e the second. T o this end, we b ound the difference 1 σ 2 i ( M ) − 1 σ 2 i (Π M ) = | σ 2 i ( M ) − σ 2 i (Π M ) | σ 2 i ( M ) σ 2 i (Π M ) ≤ εσ 2 i ( M ) (1 − ε ) σ 4 i ( M ) = ε (1 − ε ) σ − 2 i ( M ) ≤ 2 ε σ − 2 i ( M ) . There are several w ays to construct an ε -subspace em b edding. One of the more recen t metho ds is using a so called gr aph-sp arsifier , which was initially introduced for the efficien t construction of sparse sub-graphs with go o d expansion properties ( Batson et al. 2012 ). The work of Boutsidis et al. ( 2013 ) adapted the tec hnique to w ork for ordinary least-squares regression. While the initial construction was deterministic, they also gav e alternativ e constructions combining the determin- istic decision rules with non-uniform random sampling techniques. Another approach is subspace preserving sampling of rows from the data matrix. This tec hnique was in tro duced by Drineas et al. ( 2006 ) for ` 2 regression and generalized to more general subspace sampling for the p -norm ( Das- gupta et al. 2009 ). The sampling is done prop ortional to the so called statistic al lever age sc or es . These tec hniques ha v e recen tly b een analyzed and extended in a statistical setting as opp osed to the algorithmic w orst case setting ( Raskutti and Mahoney 2015 ; Ma et al. 2014 ). All the aforemen- tioned metho ds are in principle applicable whenev er it is p ossible to read the input multiple times. F or instance, one needs tw o passes o ver the data to p erform the subspace sampling pro cedures, one for prepro cessing the input matrix and another for computing the probabilities and for the actual 8 sampling. This w ay one might reach a stronger reduction or b etter statistical prop erties since their (p ossibly random) construction dep ends on the input itself and therefore uses more information. In principle our appro ximation results are indep endent of the actual metho d used to calculate the embedding as long as the prop erty given in Definition 4 is fulfilled. How ever, when the num b er of observ ations gro ws really massiv e or w e deal with an infinite stream, then the data can only b e read once, given time and space constrain ts. In order to use ε -subspace embeddings in a single-pass streaming algorithm, w e consider the approac h of so called oblivious subspace embeddings in this pap er. These can be viewed as distributions o v er appropriately structured k × n matrices from whic h w e can draw a realization Π indep enden t of the input matrix. It is then guaran teed that for an y fixed matrix U as in Definition 4 and failure probability 0 < α ≤ 1 / 2, Π is an ε -subspace em b edding with probability at least 1 − α . The results of our w ork are alwa ys conditioned on the ev ent that the map Π is an ε -subspace em b edding omitting to further men tion the error probability α . The reader should keep in mind that there is the aforementioned p ossibility of failure during the phase of sketc hing the data. Instead of a subspace embedding, one might consider Π = X T a suitable choice for a sketc hing matrix leading to a sk etch of size d × ( d + 1) from which the exact lik eliho o d and p osterior can b e characterized in some analytically tractable cases. How ever, it is well kno wn that using the resulting matrix X T [ X , Y ] ma y result in n umerical instabilities leading to bad conditioning of the co v ariance matrix X T X . This effect can o ccur in the presence of collinearities or slight v ariations from orthogonality independent of the size of the data, and may lead to highly inaccurate frequen tist estimators, (cf. Lawson and Hanson 1995 ). In a Ba yesian setting, as we consider in this work, the instabilities result in extremely large v ariances of the MCMC sample, leading to simulations that do not conv erge. W e will observe this b eha vior in Section 5 . W e therefore consider the following approac hes for obtaining oblivious ε -subspace embeddings: 1. The Rademac her Matrix (RAD) : Π is obtained b y c ho osing each entry indep endently from {− 1 , 1 } with equal probability . The matrix is then rescaled by 1 √ k . This method has b een sho wn by Sarl´ os ( 2006 ) to form an ε -subspace embedding with probabilit y at least 1 − α when c ho osing essen tially k = O ( d log ( d/α ) ε 2 ). This was later impro ved to k = O ( d +log(1 /α ) ε 2 ) in Clarkson and W o o druff ( 2009 ), whic h w as recen tly sho wn to be optimal b y Nelson and Nguy ˆ en ( 2014 ). While this method yields the best reduction among the different constructions that w e consider in the presen t w ork, the RAD embedding has the disadv antage that w e need Θ( ndk ) time to apply it to an n × d matrix when streaming the input in general. If the input is giv en ro w b y ro w or at least blo c k by blo ck, our implementation applies a fast matrix multiplication algorithm to each blo ck. W e remark that it is prov ably sufficient that the {− 1 , 1 } -en tries in eac h ro w of the RAD matrix are basically four wise indep endent, i.e., when considering up to four entries of the same ro w, these b eha ve as if they were fully indep endent. Suc h random n umbers can b e generated using a hashing sc heme that generates BCH co des using a seed of size O (log n ). This has first b een noticed by Alon et al. ( 1999 ). In our implementation we ha ve used the four wise indep enden t BCH scheme as describ ed in Rusu and Dobra ( 2007 ). 2. The Subsampled Randomized Hadamard T ransform (SRHT) (originally from Ailon and Liberty ( 2009 )) is an embedding that is c hosen to be Π = RH m D where D is an m × m diagonal matrix where each entry is indep endently c hosen from {− 1 , 1 } with equal probability . The v alue of m is assumed to b e a p o wer of tw o. It is conv enien t to choose the smallest such num b er that is not 9 smaller than n . H m is the Hadamar d-matrix of order m and R is a k × m row sampling matrix. That is, each row of R contains exactly one 1-en try and is 0 ev erywhere else. The index of the 1-entry is chosen uniformly from [ m ] i.i.d. for ev ery ro w. The matrix is then rescaled by 1 √ k . Since m is often larger than n , the input data must b e padded with 0-entries to compute the pro duct Π X . Of course, it is not necessary to do that explicitly since all m ultiplications by zero can b e omitted. The target dimension needed to form an ε -subspace em b edding with probability at least 1 − α using this family of matrices w as shown by Boutsidis and Gittens ( 2013 ) to b e k = O ( ( √ d + √ log n ) 2 log( d/α ) ε 2 ), whic h improv ed up on previous results from Drineas et al. ( 2011 ). Using this metho d, w e hav e a small dep endency on n , whic h is negligible whenever n = O (exp( d )). This is often true in practice when d is reasonably large. Compared to the RAD metho d, the dep endency on the dimension d is worse by essen tially a factor of O (log d ). It is kno wn that k = Ω( d log d ) is necessary due to the sampling based approach. This was sho wn by r e duction from the coup on collectors problem, i.e., solving one problem can b e r e duc e d to solving the other. See Halko et al. ( 2011 ) for details. The b enefit that we get is that due to the inductive structure of the Hadamard matrix, the embedding can b e applied in O ( nd log k ) time, which is considerably faster. It has b een noticed in the original pap er ( Ailon and Lib erty 2009 ) that the construction is closely related to four wise indep enden t BCH co des. T o our kno wledge, there is no explicit pro of that it is sufficient to use random bits of little indep endence. Ho wev er, we use again the four wise BCH scheme for the implicit construction of the matrix D and the linear congruency generator from the standard library of C++ 11 for the uniform subsampling matrix R . W e will see in the empirical ev aluation that this works w ell in practice. 3. The Clarkson W o o druff (CW) sk etch ( Clarkson and W o o druff 2013 ) is the most recen t construction that we consider in this article. In this case the em b edding is obtained as Π = Φ D . The n × n matrix D is constructed in the same w a y as the diagonal matrix in the SRHT case. Giv en a random map h : [ n ] → [ k ] such that for every i ∈ [ n ] its image is chosen to b e h ( i ) = t ∈ [ k ] with probability 1 k , again Φ is a binary matrix whose 1-entries can b e defined by Φ h ( i ) ,i = 1. All other entries are 0. This is ob viously the fastest em b edding, due to its sparse construction. It can b e applied to any matrix X ∈ R n × d in O (nnz( X )) = O ( nd ) time, where nnz( X ) denotes the n umber of non-zero entries in X . This is referred to as input sp arsity time and is clearly optimal up to small constants, since this is the time needed to actually read the input from a data stream or external memory , whic h dominates the sketc hing phase. How ever, its disadv an tage is that the target dimension is k = Ω( d 2 ) ( Nelson and Nguy en 2013b ). Roughly sp ok en, this is necessary due to the need to obliviously and p erfectly hash d of the standard basis v ectors spanning R n . Impro ved upp er b ounds ov er the original ones of Clarkson and W o o druff ( 2013 ) sho w that k = O ( d 2 ε 2 α ) is sufficien t to draw an ε -subspace embedding from this distribution of matrices with probability at least 1 − α ( Nelson and Nguy en 2013a ). This reference also shows that it is sufficient to use only four wise indep enden t random bits to generate the diagonal matrix D . Again, in our implementation w e use the four wise indep endent BCH sc heme from Rusu and Dobra ( 2007 ). Moreo ver, Φ can b e constructed using only pairwise indep enden t entries. This can be ac hieved v ery efficien tly using the fast universal hashing scheme introduced by Dietzfelbinger et al. ( 1997 ) which w e ha ve emplo yed in our implemen tation. The space requirement is only O (log n ) for a hash function from this class. F or a really fast implementation using bit-wise op erations the actual size parameters of the sketc h are chosen to b e the smallest p o wers of tw o that are larger than the required sizes n and k . 10 sk etching metho d target dimension running time RAD O d +log(1 /α ) ε 2 O ( ndk ) SRHT O d · log( d/α ) ε 2 O ( nd log k ) CW O d 2 ε 2 α O (nnz( X )) = O ( nd ) T able 1: Comparison of the three considered ε -subspace em b eddings; nnz( X ) denotes the num b er of non-zero en tries in X , α denotes the failure probability . T able 1 summarizes the ab ov e discussion, in particular the trade-off b ehavior b et ween time and space complexity of the presented sk etc hing metho ds. While in general one is interested in the fastest p ossible application time, memory constraints migh t make it imp ossible to apply the CW sk etch due to its quadratic dep endency on d . T aking it the other w ay , for a fixed sk etching size, CW will give the w eakest approximation guaran tee (cf. Y ang et al. 2015 ). F or really large d , even the O ( d log d ) factor of SRHT might be to o large so that we ha ve to use the slow est RAD sketc hing metho d. 3.5 Extension to the streaming model The presen ted reduction tec hniques are of interest whenever we deal with medium to large sized data for reducing time and space requirements. Ho w ever, when the data gro ws massiv e, we need to put more imp ortance on the computational requirements. W e therefore wan t to briefly discuss and giv e references to some of these technical details. F or example, while the dimensions of the resulting sk etches do not dep end on n , this is not true for the em b edding matrices Π ∈ R k × n . How ev er, due to the structured constructions that we hav e surv eyed ab o ve, w e stress that the sk etching matrices can b e stored implicitly by using the differen t hash functions of limited indep endence. The hash functions used in our implementations are the four wise indep enden t BCH sc heme used in the seminal work of Alon et al. ( 1999 ) and the univ ersal hashing scheme by Dietzfelbinger et al. ( 1997 ). These can b e ev aluated v ery efficien tly using bit-wise op erations and can b e stored using a seed whose size is only O (log n ). Note that ev en this small dep endency on n is only needed in the sk etching phase. After the sketc h has b een computed, the space requiremen ts will b e indep enden t of n . A surv ey and ev aluation of alternativ e hashing sc hemes can be found in Rusu and Dobra ( 2007 ). The linearit y of the embeddings allo ws for efficient application in sequen tial streaming and in dis- tributed en vironments, see e.g. Clarkson and W o o druff ( 2009 ); W o o druff and Zhang ( 2013 ); Kannan et al. ( 2014 ). The sk etches can b e up dated in the most flexible dynamic setting, whic h is commonly referred to as the turnstile mo del ( Muth ukrishnan 2005 ). In this mo del, think of an initial matrix of all zero v alues. The stream consists of updates of the form ( i, j, u ) meaning that the en try X ij will b e up dated to X ij + u . A single entry can b e defined by one single up date or b y a sequence of not necessarily consecutive up dates. F or example a stream S = { . . . , ( i, j, +5) , . . . , ( i, j, − 3) , . . . } will result in X ij = 2. Ev en deletions are p ossible in this setting b y using negativ e up dates. Clearly this also allows for additiv e up dates of rows or columns, each consisting of consecutive single up dates to all the entries in the same row or column. At first sight this mo del might seem very tec hnical 11 and unnatural. But the usual form of storing data in a table is not appropriate or p erformant for massiv e data sets. The data is rather stored as a sequence of (key, value) pairs in arbitrary order. F or dealing with suc h unstructured data, the design of algorithms working in the turnstile mo del is of high imp ortance. F or distributed computations, note that the em b edding matrices can b e comm unicated effi- cien tly to ev ery mac hine in a computing cluster of l mac hines. This is due to the small implicit represen tation b y hash functions. No w, supp ose the data is given as X = P l i =1 X ( i ) where X ( i ) is stored on the mac hine with index i ∈ [ l ]. Note that by the ab o ve data represen tation in form of up dates, X ( i ) can consist of ro ws, columns or single en tries of X . Again, multiple up dates to the same en try are p ossible and may b e distributed to different machines. Every mac hine i ∈ [ l ] can compute a small sk etch on its o wn share X ( i ) of the data and efficien tly communicate it to one dedicated central server. A sketc h of the entire data set can b e obtained by summing up the single sk etches since Π X = P l i =1 Π X ( i ) . More details can be found in Kannan et al. ( 2014 ). Recen t implemen tations of similar distributed and parallel approac hes for OLS are giv en by Y ang et al. ( 2015 ). The ab ov e discussions make clear that our metho ds suit the criteria that need to b e satisfied when dealing with Big Data (cf. W elling et al. 2014 ). Namely , the num b er of data items that need to b e accessed at a time is only a small subset of the whole data set, particularly indep endent of n . The algorithms should b e amenable to distributed computing environmen ts lik e MapReduce. 4 Theory In this section we introduce and develop the theoretical foundations of our approach and will com bine them with existing results on ordinary least squares regression to b ound the W asserstein distance b etw een the original lik eliho o d function and its counterpart that is defined only on the considerably smaller sketc h. Empirical ev aluations supp orting and complementing our theoretical results will b e conducted in the subsequen t section. 4.1 Em b edding the lik eliho o d The following observ ation is standard (cf. Givens and Shortt ( 1984 ); Kannan and V empala ( 2009 )) and will b e helpful in b ounding the ` 2 W asserstein distance of tw o Gaussian measures. It allows us to derive such a b ound by insp ecting their means and their cov ariances separately . Observ ation 2. L et Z 1 , Z 2 ∈ R d b e r andom variables with finite first moments m 1 , m 2 < ∞ and let Z m 1 = Z 1 − m 1 , r esp e ctively, Z m 2 = Z 2 − m 2 b e their me an-c enter e d c ounterp arts. Then it holds that E k Z 1 − Z 2 k 2 2 = k m 1 − m 2 k 2 2 + E k Z m 1 − Z m 2 k 2 2 . Pr o of. E k Z 1 − Z 2 k 2 2 = E k Z m 1 − Z m 2 + m 1 − m 2 k 2 2 = E k Z m 1 − Z m 2 k 2 2 + k m 1 − m 2 k 2 2 + 2 ( m 1 − m 2 ) T E [ Z m 1 − Z m 2 ] | {z } =0 = E k Z m 1 − Z m 2 k 2 2 + k m 1 − m 2 k 2 2 12 In our first lemma w e sho w that using an ε -subspace embedding Π for the columnspace of [ X , Y ], w e can approximate the least squares regression problem up to a factor of 1 + ε . That is, w e can find a solution ν b y pro jecting Π Y into the columnspace of Π X suc h that k X ν − Y k 2 ≤ (1 + ε ) min β ∈ R d k X β − Y k 2 . Similar pro ofs can b e found in Clarkson et al. ( 2013 ) and Boutsidis et al. ( 2013 ). W e repeat the result here for completeness. Lemma 5. Given X ∈ R n × d , Y ∈ R n , let Π b e an ( ε/ 3) -subsp ac e emb e dding for the c olumnsp ac e of [ X , Y ] . Now let γ = argmin β ∈ R d k X β − Y k 2 2 and similarly define ν = argmin β ∈ R d k Π( X β − Y ) k 2 2 . Then k X ν − Y k 2 2 ≤ (1 + ε ) k X γ − Y k 2 2 . Pr o of. Let [ X , Y ] = U Σ V T denote the SVD of [ X , Y ]. No w define η 1 = Σ V T [ γ T , − 1] T and η 2 = Σ V T [ ν T , − 1] T . Using this notation we can rewrite U η 1 = X γ − Y and similarly U η 2 = X ν − Y . W e hav e that (1 − ε/ 3) k U η 2 k 2 2 ≤ k Π U η 2 k 2 2 ≤ k Π U η 1 k 2 2 ≤ (1 + ε/ 3) k U η 1 k 2 2 . The first and the last inequality are direct applications of the subspace embedding property ( 3 ), whereas the middle inequality follows from the optimality of ν in the embedded subspace. Now, b y rearranging and resubstituting terms, this yields k X ν − Y k 2 2 ≤ 1 + ε/ 3 1 − ε/ 3 k X γ − Y k 2 2 ≤ (1 + ε ) k X γ − Y k 2 2 . One can even show that a distortion of order √ ε , i.e., an O ( √ ε )-subspace embedding is already enough to get the result. This w as sho wn by using a more complicated pro of taking the geometry of the least squares solution into accoun t and using the prop ert y that the solution is obtained b y an orthogonal pro jection on to the columnspace spanned by the data matrix (cf. Clarkson and W o o druff 2009 ). Putting it the other w a y around, by using an ( ε/ 3)-subspace em b edding as in Lemma 5 , we even hav e k X ν − Y k 2 2 ≤ (1 + ε 2 ) k X γ − Y k 2 2 . (7) In the following, we inv estigate the distributions prop ortional to the lik eliho o d functions p ∝ L ( β | X , Y ) and p 0 ∝ L ( β | Π X , Π Y ) and b ound their W asserstein distance. W e b egin our contribution with a b ound on the distance of their means γ and ν , resp ectiv ely . W e generalize upon previous results for the sp ecific em b edding methods to arbitrary ε -subspace em b eddings. Lemma 6. Given X ∈ R n × d , Y ∈ R n , let Π b e an ( ε/ 3) -subsp ac e emb e dding for the c olumnsp ac e of [ X , Y ] . Now let γ = argmin β ∈ R d k X β − Y k 2 2 and similarly define ν = argmin β ∈ R d k Π( X β − Y ) k 2 2 . Then k γ − ν k 2 2 ≤ ε 2 σ 2 min ( X ) k X γ − Y k 2 2 . 13 Pr o of. Let X = U Σ V T denote the SVD of X . Let η = V T ( γ − ν ). First note that γ and ν are b oth con tained in the columnspace of V (cf. Sarl´ os 2006 ) whic h means that V T is a prop er rotation with resp ect to γ − ν . Thus, k X ( γ − ν ) k 2 2 = k U Σ V T ( γ − ν ) k 2 2 = k Σ V T ( γ − ν ) k 2 2 = X σ 2 i η 2 i ≥ X σ 2 min η 2 i = σ 2 min k V T ( γ − ν ) k 2 2 = σ 2 min k γ − ν k 2 2 . Consequen tly , it remains to b ound k X ( γ − ν ) k 2 2 . This can b e done by using the fact that the minimizer γ is obtained by pro jecting Y orthogonally onto the columnspace of X . Therefore, we ha ve X T ( X γ − Y ) = 0 (cf. Clarkson and W o o druff 2009 ). F urthermore, by Equation ( 7 ) it holds that k X ν − Y k 2 2 ≤ (1 + ε 2 ) k X γ − Y k 2 2 . Now b y plugging this into the Pythagorean theorem and rearranging we get that k X ( γ − ν ) k 2 2 = k X ν − Y k 2 2 − k X γ − Y k 2 2 ≤ ε 2 k X γ − Y k 2 2 . Putting all together this yields the prop osition k γ − ν k 2 2 ≤ 1 σ 2 min ( X ) k X ( γ − ν ) k 2 2 ≤ ε 2 σ 2 min ( X ) k X γ − Y k 2 2 . No w that we ha ve derived a b ound on the distance of the different means, recall that by Observ ation 2 , w e can assume w.l.o.g. γ = ν = 0 when w e consider the v ariances. Namely , it remains to derive a b ound on inf E k Z m 1 − Z m 2 k 2 2 , i.e., the least exp ected squared Euclidean distance of t wo p oints dra wn from a joint distribution whose marginals are the me an-c enter e d original distribution and its embedded counterpart. Of course we can b ound this quan tity by explicitly defining a prop erly chosen join t distribution and b ounding the exp ected squared distance for its particular choice. This is the idea that yields our next lemma. Lemma 7. L et p ∝ L ( β | X , Y ) and p 0 ∝ L ( β | Π X , Π Y ) . L et Z m 1 , Z m 2 b e the me an-c enter e d versions of the r andom variables Z 1 ∼ p and Z 2 ∼ p 0 that ar e distribute d ac c or ding to p and p 0 r esp e ctively. Then we have inf λ ∈ Λ( p,p 0 ) E λ k Z m 1 − Z m 2 k 2 2 ≤ ε 2 tr ( X T X ) − 1 . Pr o of. Our plan is to design a join t distribution that deterministically maps p oin ts from one dis- tribution to another in suc h a wa y that we can b ound the distance of every pair of p oints. This can be done by utilizing the Dirac delta function δ ( · ), which is a degenerate probabilit y densit y function that concentrates all probability mass at zero and has zero density otherwise. Giv en a bijection g : R d → R d w e can define such a join t distribution λ ∈ Λ( p, p 0 ) through its conditional distributions λ ( x | y ) = δ ( x − g ( y )) for every y ∈ R d . It therefore remains to define g . According to Observ ation 1 , when applying the embedding Π, the columnspace of a matrix is expanded or contracted, resp ectiv ely , by a factor of at most (1 ± ε ). W e will make use of this fact in the follo wing w ay . Let X = U Σ V T and Π X = ˜ U ˜ Σ ˜ V T denote the SVDs of X and Π X , resp ectiv ely . No w, to define the x - y -pairs that will b e mapp ed to each other b y g , we consider 14 v ectors x, x 0 , y, y 0 ∈ R d where x 0 and y 0 are contained in the columnspaces of V and ˜ V , resp ectively . T o obtain the bijection g , let the v ectors ha ve the follo wing properties for arbitrary , but fixed radius c ≥ 0: 1. k x 0 k 2 = k y 0 k 2 = c 2. x = Σ V T x 0 3. y = ˜ Σ ˜ V T y 0 4. ∃ τ > 0 : x = τ y . Observ e that by the first prop erty x 0 and y 0 lie on a d -dimensional sphere with radius c centered at 0. Therefore, there exists a rotation matrix R ∈ R d × d suc h that y 0 = Rx 0 . Note that suc h a map is bijectiv e by definition. The second item defines a map of such spheres to ellipsoids (also cen tered at 0) giv en b y Σ V T . Recall that x 0 w as c hosen from the columnspace of V . Thus, this map can b e seen as bijection b et ween the d -dimensional vector space and a d -dimensional subspace contained in n -dimensional space. The third prop erty is defined analogously . The fourth prop erty urges that x and y b oth lie on a ra y emanating from 0. Note that an y such ra y intersects eac h ellipsoid exactly once. Our bijection can b e defined accordingly as g : R d → R d x 7→ ˜ Σ ˜ V T RV Σ − 1 x b y comp osing the map Σ V T , defined in the second item, with the rotation R and finally with ˜ Σ ˜ V T from the third prop erty . The map is bijectiv e since it is obtained as the comp osition of bijections. No w, in order to b ound the distance k Z m 1 − Z m 2 k 2 2 for any realization of ( Z m 1 , Z m 2 ) according to their joint distribution defined ab ov e, we can derive a b ound on the parameter τ . Substituting the second and third prop erties into the fourth, w e get that Σ V T x 0 = τ ˜ Σ ˜ V T y 0 whic h can b e rearranged to y 0 T y 0 τ = ( y 0 T ˜ V ) ˜ Σ − 1 Σ( V T x 0 ) = X ( y 0 T ˜ V ) i ( V T x 0 ) i σ i ˜ σ i ≤ X ( y 0 T ˜ V ) i ( V T x 0 ) i σ i σ i √ 1 − ε ≤ (1 + ε ) X ( y 0 T ˜ V ) i ( V T x 0 ) i ≤ (1 + ε ) c 2 . The first inequality follo ws from ˜ σ i ≥ √ 1 − ε σ i and the second from the assumption ε ≤ 1 / 2. This ev entually means that τ ≤ (1 + ε ) since y 0 T y 0 = c 2 b y the first prop ert y . A low er b ound of τ ≥ (1 − ε ) can b e derived analogously b y using ˜ σ i ≤ √ 1 + ε σ i . 15 No w we can conclude our pro of. It follows that inf λ 0 ∈ Λ( p,p 0 ) E λ 0 k Z m 1 − Z m 2 k 2 2 ≤ E λ k Z m 1 − Z m 2 k 2 2 ≤ E λ k εZ m 1 k 2 2 = ε 2 E λ k Z m 1 k 2 2 = ε 2 tr ( X T X ) − 1 . The last equality holds since the exp ected squared norm of the mean-centered random v ariable is just the trace of its co v ariance matrix. Com bining the ab o ve results we get the follo wing lemma. Lemma 8. L et Π b e an ( ε/ 3) -subsp ac e emb e dding for the c olumnsp ac e of X . L et p ∝ L ( β | X , Y ) and p 0 ∝ L ( β | Π X , Π Y ) . Then W 2 2 ( p, p 0 ) ≤ ε 2 σ 2 min ( X ) k X γ − Y k 2 2 + ε 2 tr ( X T X ) − 1 . Pr o of. The lemma follows from Definition 2 , Observ ation 2 , Lemma 6 and Lemma 7 . Under mild assumptions we can argue that this leads to a (1 + O ( ε ))-appro ximation of the lik eliho o d with resp ect to the W asserstein weigh t (see Definition 3 ). Corollary 9. L et Π b e an ( ε/ 3) -subsp ac e emb e dding for the c olumnsp ac e of X . L et p ∝ L ( β | X , Y ) and similarly let p 0 ∝ L ( β | Π X , Π Y ) . L et κ ( X ) = σ max ( X ) /σ min ( X ) b e the c ondition numb er of X . Assume that for some ρ ∈ (0 , 1] we have k X γ k 2 ≥ ρ k Y k 2 . Then W 2 ( p 0 ) ≤ 1 + κ ( X ) ρ ε W 2 ( p ) . Pr o of. By definition, the squared ` 2 W asserstein w eight of p equals its second moment. Since p is a Gaussian measure with mean γ and cov ariance matrix ( X T X ) − 1 , we thus hav e W 2 2 ( p ) = k γ k 2 2 + tr ( X T X ) − 1 and similarly W 2 2 ( p 0 ) = k ν k 2 2 + tr ( X T Π T Π X ) − 1 . Since Π is an ( ε/ 3)-subspace embedding for the columnspace of X w e know from Observ ation 1 , that all the squared singular v alues of X are appro ximated up to less than (1 ± ε ) error and so are their inv erses. Therefore, we ha ve that tr ( X T Π T Π X ) − 1 ≤ (1 + ε ) tr ( X T X ) − 1 . (8) It remains to b ound k ν k 2 2 . T o this end we use the assumption that for some ρ ∈ (0 , 1] we hav e k X γ k 2 ≥ ρ k Y k 2 . By X T ( X γ − Y ) = 0 and applying the Pythagorean Theorem this m eans that k X γ − Y k 2 2 = k Y k 2 2 − k X γ k 2 2 ≤ k X γ k 2 2 1 ρ 2 − 1 ≤ k X γ k 2 2 ρ 2 . 16 No w we can apply the triangle inequality , Lemma 6 , Inequality ( 9 ) and Definition 1 to get k ν k 2 ≤ k γ k 2 + k ν − γ k 2 ≤ k γ k 2 + ε σ min ( X ) k X γ − Y k 2 ≤ k γ k 2 + ε ρσ min ( X ) k X γ k 2 ≤ k γ k 2 + ε ρσ min ( X ) k X k 2 k γ k 2 = k γ k 2 + ε ρ κ ( X ) k γ k 2 = 1 + κ ( X ) ρ ε k γ k 2 . Com bining this with Inequalit y ( 8 ), the claim follo ws since κ ( X ) ρ ≥ 1 and therefore (1 + ε ) ≤ (1 + κ ( X ) ρ ε ) 2 and finally taking square ro ots on b oth sides. W e stress that the assumption that there exists some constan t ρ ∈ (0 , 1] suc h that k X γ k 2 ≥ ρ k Y k 2 is v ery natural and mild in the setting of linear regression since it means that at least a constan t fraction of the dep endent v ariable Y can b e explained within the columnspace of the data X (cf. Drineas et al. 2006 ). If this is not true, then a linear mo del is not appropriate at all for the giv en data. 4.2 Ba yesian regression So far w e ha ve shown that using subspace embeddings to compress a given data set for regression yields a goo d approximation to the likelihoo d. Note that in a Ba yesian regression setting Lemma 8 already implies a similar approximation error for the p osterior distribution if the priors for β are chosen to b e uniform distributions ov er R . This is an improp er, non-informativ e choice, p pre ( β ) = 1 R d . F rom this, it follo ws that p post ( β | X , Y ) ∝ L ( β | X , Y ) · 1 R d = L ( β | X , Y ) . The remaining term is just the Gaussian likelihoo d whic h is prop er. F or regression mo dels, es- p ecially on data sets with large n , this co vers a considerable amoun t of the cases of in terest (cf. Gelman et al. 2014 ). W e will extend this to arbitrary Gaussian priors p pre ( β ) leading to our main result: an approximation guarantee for Gaussian Ba yesian regression in its most general form. T o this end, let m b e the mean of the prior distribution and let S b e derived from its cov ariance matrix by Σ = ς 2 ( S T S ) − 1 . No w, the posterior distribution is given by p post ( β | X , Y ) ∝ L ( β | X , Y ) · p pre ( β ) = 1 (2 π ς 2 ) n/ 2 · exp − 1 2 ς 2 k X β − Y k 2 2 · 1 (2 π ) d 2 | Σ | 1 2 · exp − 1 2 ς 2 k S ( β − m ) k 2 2 . Th us, w e know that up to some constants that are indep enden t of β , the exp onent of the p osterior can b e describ ed by k X β − Y k 2 2 + k S ( β − m ) k 2 2 (9) 17 whic h contains all the information to define the mean and co v ariance structure of the p osterior distribution. No w let Z = " X S # and z = " Y S m # . With these definitions we can rewrite Equation ( 9 ) ab ov e as k Z β − z k 2 2 . This, in turn, can b e treated as a (frequen tist) regression problem to which we can apply Lemma 8 . W e just ha v e to use a subspace embedding for the columnspace of [ Z , z ] instead of only embedding [ X , Y ]. W e will see that it is not necessary to do this explicitly . More precisely , embedding only the data matrix is sufficient to hav e a subspace em b edding for the en tire columnspace defined by the data and the prior information and, therefore, to ha v e a prope r appro ximation of the p osterior distribution. This is formalized in the following lemma. Lemma 10. L et M = [ M T 1 , M T 2 ] T ∈ R ( n 1 + n 2 ) × d b e an arbitr ary matrix. Supp ose Π is an ε -subsp ac e emb e dding for the c olumnsp ac e of M 1 . L et I n 2 ∈ R ( n 2 × n 2 ) b e the identity matrix. Then P = " Π 0 0 I n 2 # ∈ R ( k + n 2 ) × ( n 1 + n 2 ) is an ε -subsp ac e emb e dding for the c olumnsp ac e of M . Pr o of. Fix an arbitrary x ∈ R d . W e hav e |k P M x k 2 2 − k M x k 2 2 | = |k Π M 1 x k 2 2 + k M 2 x k 2 2 − k M 1 x k 2 2 − k M 2 x k 2 2 | = |k Π M 1 x k 2 2 − k M 1 x k 2 2 | ≤ ε k M 1 x k 2 2 ≤ ε ( k M 1 x k 2 2 + k M 2 x k 2 2 ) = ε k M x k 2 2 whic h concludes the pro of by singular v alue decomp osition M = U Σ V T and surjectivity of the linear map Σ V T . This lemma finally enables us to pro ve our main theoretical result. Theorem 11. L et Π b e an ( ε/ 3) -subsp ac e emb e dding for the c olumnsp ac e of X . L et p pre ( β ) b e an arbitr ary Gaussian distribution with me an m and c ovarianc e matrix Σ = ς 2 ( S T S ) − 1 . L et Z = " X S # and z = " Y S m # . L et µ = argmin β ∈ R d k Z β − z k 2 b e the p osterior me an. L et p ∝ L ( β | X , Y ) · p pre ( β ) and p 0 ∝ L ( β | Π X , Π Y ) · p pre ( β ) . Then W 2 2 ( p, p 0 ) ≤ ε 2 σ 2 min ( Z ) k Z µ − z k 2 2 + ε 2 tr ( Z T Z ) − 1 . 18 Pr o of. F rom our previous reasoning w e kno w that appro ximating the p osterior distribution can b e reduced to approximating a likelihoo d function that is defined in terms of the data as well as the parameters of the prior distribution. This has b een shown by rewriting Equation ( 9 ) ab o ve as k Z β − z k 2 2 . F or that reason, w e can apply Lemma 8 to get the desired result if we are given an ( ε/ 3)-subspace embedding for the columnspace of Z . Using Lemma 10 we kno w that for this, it is sufficient to use an ( ε/ 3)-subspace em b edding for the columnspace of [ X, Y ] indep endent of the co v ariance and mean that define the prior distribution. Similar to Corollary 9 we ha ve the following result concerning the p osterior distribution. Corollary 12. L et Π b e an ( ε/ 3) -subsp ac e emb e dding for the c olumnsp ac e of X . L et p pre ( β ) b e an arbitr ary Gaussian distribution with me an m and c ovarianc e matrix Σ = ς 2 ( S T S ) − 1 . L et Z = " X S # and z = " Y S m # . L et µ = argmin β ∈ R d k Z β − z k 2 b e the p osterior me an. L et p ∝ L ( β | X , Y ) · p pre ( β ) and p 0 ∝ L ( β | Π X , Π Y ) · p pre ( β ) . L et κ ( Z ) b e the c ondition numb er of Z . Assume that for some ρ ∈ (0 , 1] we have k Z µ k 2 ≥ ρ k z k 2 . Then we have W 2 ( p 0 ) ≤ 1 + κ ( Z ) ρ ε W 2 ( p ) . Both Theorem 11 and Corollary 12 show that the sk etch preserves the exp ected v alue and the co v ariance structure of the p osterior distribution very w ell. Note that for normal distributions, these parameters fully c haracterize the distribution as they are sufficient statistics. Therefore, one can see the corresp onding parameters based on the sk etched data s et as v ery accurate approximate sufficien t statistics for the p osterior distribution. 5 Sim ulation Study T o v alidate the prop osed metho d empirically , we conduct a simulation study . F or this, w e employ MCMC metho ds to obtain the p osterior distributions for the parameters of the Ba y esian regres- sions. Please note that the sketc hing tec hniques can also b e combined with other metho ds. W e concen trate on MCMC metho ds, how ever, as they are very reliable, widely-used and allow for easy c hecking of conv ergence. The differen t sk etching metho ds were implemented as describ ed in Section 3.4 and technically more detailed in the given references. All co des were written in the C++ 11 programming language and compiled using GCC 4.7. F or fast matrix multiplications w e employ ed the LAP ACK 3.5.0 library where applicable. Our R-pack age RaProR uses the ab ov e implemen- tations. The pac k age is a v ailable on its pro ject website ( Geppert et al. 2015 ). All sim ulations w ere done using R, version 3.1.2 ( R Core T eam 2014 ) and the R-pack age rstan , version 2.3 ( Stan Dev elopment T eam 2013 ). The simulations w ere conducted on an In tel Xeon E5430 quad-core CPU running at 2.66 GHz using 16 GB DDR2 memory on a Debian GNU Lin ux 7.8 distribution. The hard driv e used w as a Seagate Momen tus 7200.4 G-F orce 500 GB, SA T A 3Gb/s HDD with 7200 rpm and 16 MB cac he. 19 5.1 Data generation F or the simulation study , we create a set of data sets. W e v ary the num b er of observ ations n , the num b er of v ariables d , and the standard deviation of the error term ς . The v ariation of n is in tro duced to monitor whether the running time of the analyses based on sk etches is indeed indep enden t of n and also to see ho w well the prop osed metho d deals with growing data sets. W e c ho ose v alues of n ∈ { 50 000 , 100 000 , 500 000 , 1 000 000 } . The size of the sk etches dep ends mainly on the num b er of v ariables in the data set. F or this reason, we conduct simulations with tw o v alues of d , d = 50 and d = 100. The reason for c ho osing rather small v alues of n and d is that our aim is to compare the results of the MCMC on the sk etched data sets to the results on the resp ective original data set. The sketc hing metho ds presented here can handle larger v alues of d and arbitrary v alues of n , but emplo ying MCMC on the original data set then b ecomes unfeasible. The standard deviation of the error term is imp ortant, b ecause the go o dness of the appro ximation also dep ends on the go o dness of the mo del (cf. Lemma 8 and Theorem 11 ). Here, w e choose ς ∈ { 1 , 2 , 5 , 10 } , th us ranging from v ery well-fitting mo dels to mo dels with quite high error v ariance. The generated true v alues of β follo w a zero-inflated Poisson distribution, where the exp ected v alue of the Poisson distribution is 3 and the probability of a comp onent exhibiting an excess zero is equal to 0 . 5. This means that the comp onents of β hav e no influence, i.e. are 0, with probabilit y 0 . 5 and follow a Poi(3) distribution with probability 0 . 5. All comp onen ts that follow a P oi(3) distribution are multiplied with ( − 1) with probability 0 . 5. The data set X is obtained in t wo steps. A t first, a d -dimensional vector that represents the column means is drawn randomly from a N (0 , 25) distribution. In a second step, the actual v alues of X are dra wn from a normal distribution with the column mean as exp ected v alue and v ariance of four. The v ariance in the columns of X is th us lo wer than the error v ariance for tw o of our choices and the same or less for the other tw o c hoices. Y is then generated b y multiplying X with β and adding the error term, in accordance with the mo del. 5.2 Regression mo del W e employ a standard Ba yesian linear regression mo del (cf. Section 3.2 ) Y ∼ N ( X β , ς 2 I n ) with indep enden t uniform priors o ver R for all comp onen ts of β , which are improp er, non-informativ e prior distributions. F or ς , the uniform prior is limited to the p ositive part of R . W e choose an im- prop er uniform prior rather than an in verse gamma prior with small v alues for the h yp erparameters as Gelman ( 2006 ) indicates that such priors can hav e a sk ewing effect on the p osterior distribution. When using imprope r prior distributions, it is necessary to ensure that the p osterior distributions are prop er. F or our c hoice of uniform distributions, this do es not p ose a problem, since the uniform priors are represented b y indicators, whic h are constant ov er R , p pre ( β ) = 1 R d , and for that reason do not influence the likelihoo d. Thus, it follo ws that p post ( β | X , Y ) ∝ L ( β | X , Y ) · 1 R d = L ( β | X , Y ) . 20 ● ● ● ● ● ● ● ● 50 100 200 500 1000 2000 5000 10000 20000 50000 2 5 10 20 50 100 200 500 1000 2000 number of observations run−time in minutes ● ● ● ● ● ● ● ● ● ● ● Figure 1: Running times for sim ulated data sets with v arying num b er of observ ations and d = 50 v ariables The remaining term is just the Gaussian lik eliho o d which is prop er with resp ect to both, β and ς (cf. Gelman et al. 2014 ). Although closed-form expressions are known for this mo del, w e employ MCMC for the reasons motiv ated in Sections 2 and 3.2 . Our theoretical guarantees comprise the p osterior distributions of β . Our mo del is thus more general than the theoretical results. As an alternative, ς can be fixed to an estimated v alue obtained using k Π X β − Π Y k 2 / √ n . The results we present in the follo wing are all based on β . 5.3 Preliminary simulations In a first step, we consider the running times necessary to carry out Bay esian regression for (non- em b edded) data sets with an increasing num b er of observ ations n , emplo ying the No-U-T urn- Sampler ( Hoffman and Gelman 2014 ), whic h is implemented in the R-pac k age rstan . W e use standard settings and four c hains, which are sampled in parallel. The resulting running times are plotted in Figure 1 . The running time dep ends at least linearly on n , with o ccasional jumps that are probably due to swaps to external memory , whic h slo ws do wn the computation by a large factor. There is an outlier for the case of n = 50, which tak es a lot more time to compute than n = 100 and even more than n = 200. Since we hav e d = 50 v ariables, the total n umber of parameters (52) is higher than the n umber of observ ations for this case only . While the linear dep endency on the n umber of observ ations do es not p ose a problem for small to medium-sized data sets, for big data settings MCMC metho ds b ecome unfeasible. This underlines the usefulness of embedded data sets. Before we consider our embedding methods, w e pick up on the idea of using Π = X T as sketc hing matrix. If the data set is giv en ro w by row, the sk etch Π X = X T X can b e computed efficiently using the tensor pro duct of the ro w vectors X T X = P x T i x i , which is a sum of d × d matrices with rank one. Analoguously , we ha ve Π Y = X T Y = P x T i Y i . This results in a sketc h of size d × ( d + 1), which is the smallest p ossible sketc h for problems of full rank and has no error at all in the sense that the exact likelihoo d resp ectively p osterior distribution can b e calculated from these matrices. W e ha ve argued before that this metho d is not numerically stable in general. As w e fo cus on MCMC in this w ork, we sho w how this effect influences the run of the MCMC sampler. W e tried analyzing Bay esian linear regression mo dels based on X T [ X , Y ], using some of the data 21 0 2000 4000 6000 8000 10000 −1.5e+10 −5.0e+09 5.0e+09 Iteration β 41 Figure 2: T raceplot of MCMC sample (with four c hains) for one parameter based on data set obtained using Π = X T as sketc hing matrix. Original data set contains n = 10 000 observ ations and d = 50 v ariables sets describ ed in Section 5.1 and also a smaller data set with n = 10 000, which w as generated in the same wa y . W e hav e found that the mo dels do not con verge in practice. Increasing the num b er of iterations do es not seem to b e a remedy as the v ariance of the MCMC sample grows with more iterations. Figure 2 shows an exemplary traceplot for one parameter, consisting of four chains. The range of the sample is enormous. The v ariation is high for all of the c hains, they exhibit standard deviations in the order of 10 9 . When reducing the num b er of iterations from the 10 000 in Figure 2 to, sa y , 5 000, the range of the MCMC sample decreases markedly . How ever, there is no sign of con vergence with minimum and maximum around − 4 · 10 9 and 4 · 10 9 , resp ectiv ely . W e can deal with b oth of these issues using subspace embeddings as we can underline in our next exp erimen ts. W e conduct a series of simulations that aims at comparing our prop osed metho d to the standard metho d on the original data. T o obtain the subspace embeddings, we employ the three approac hes describ ed in Section 3 . As approximation parameter, we choose ε = 0 . 1 and ε = 0 . 2 for all three methods. W e do not recommend using v alues of ε > 0 . 2. T able 2 con tains the num b er of observ ations of the sketc hes dep ending on the n umber of v ariables and the v alue of the appro ximation parameters. The sizes for RAD and SRHT are b oth set to k = d d log d ε 2 e to b e comparable. They differ by one due to rounding errors. F or the CW sketc h we used k equal to the smallest p ow er of tw o larger than d 2 20 ε 2 . Please note that the CW embeddings generally result in a higher n umber of observ ations due to the quadratic dep endency on d . Ho wev er, the opp osite is true for 50 v ariables. This is due to the constants that we used. The constan ts are set to 1 in the case of RAD and SRHT. F or the RAD metho d this was empirically ev aluated b y V enk atasubramanian and W ang ( 2011 ). F or CW the constant ma y b e muc h smaller as indicated b y a low er bound in Nelson and Nguyen ( 2013b ). Preliminary exp eriments on small scale led to our c hoice of 1 20 . 5.4 Comparison of p osterior means T o ev aluate the results, we first compare the p osterior means of the mo dels based on the embedded data sets with the p osterior means of the mo del based on the original data set. T able 3 con tains an o verview of the sum of squared distances b etw een the embedded data sets’ p osterior means and 22 d ε RAD SRHT CW 50 0.1 20 546 20 547 16 384 50 0.2 5 136 5 137 4 096 100 0.1 47 174 47 175 65 536 100 0.2 11 793 11 794 16 384 T able 2: Number of observ ations of the sketc hes for different v alues of d and ε n sk etch ε ς = 1 ς = 2 ς = 5 ς = 10 5 · 10 4 RAD 0.1 0.052 0.025 0.021 0.834 5 · 10 4 RAD 0.2 0.014 0.781 0.892 1.512 5 · 10 4 SRHT 0.1 0.001 0.009 0.021 0.165 5 · 10 4 SRHT 0.2 0.004 0.077 0.093 0.757 5 · 10 4 CW 0.1 0.025 0.004 0.021 0.195 5 · 10 4 CW 0.2 0.016 0.040 0.156 0.915 1 · 10 5 RAD 0.1 0.836 0.958 1 · 10 5 RAD 0.2 0.061 0.777 1 · 10 5 SRHT 0.1 0.025 0.964 1 · 10 5 SRHT 0.2 0.171 0.617 1 · 10 5 CW 0.1 0.056 3.844 1 · 10 5 CW 0.2 2.624 2.937 T able 3: Sum of squared distances b etw een p osterior mean v alues of the original mo del and mo dels based on the resp ectiv e sketc hes those of the original mo del. Geometrically , this is the squared Euclidean distance of the p osterior mean vectors. As indicated by Theorem 11 , the sum of squared distances grows with the standard deviation of the error te rm. There do es not seem to b e a systematic difference in p erformance b et ween the different sketc hing metho ds. With larger ε , w e usually , but not necessarily observe an increase in the distance. Please note that some v alues are missing, b ecause the original mo dels did not conv erge within reasonable time b ounds. In addition to the comparison to the original mo dels’ mean, we also compare the p osterior means to the true means. T able 4 con tains the sum of the squared distances b etw een the true mean for d = 50 and v arying v alues of ς . The general picture lo oks very similar to the results in T able 3 . The original mo del often exhibits the smallest sum of squared distances, but sometimes mo dels based on em b edded data sets are closer to the true mean. Again, there do es not seem to b e a systematic difference b et ween the sketc hing metho ds. The squared distances do not seem to b e influenced b y the v alue of n , with some squared distances ev en exhibiting smaller v alues for larger n . 23 n sk etch ε ς = 1 ς = 2 ς = 5 ς = 10 5 · 10 4 none 0.000 0.003 0.065 4.614 5 · 10 4 RAD 0.1 0.048 0.016 0.124 1.718 5 · 10 4 RAD 0.2 0.012 0.710 0.506 10.845 5 · 10 4 SRHT 0.1 0.001 0.018 0.032 3.372 5 · 10 4 SRHT 0.2 0.005 0.059 0.046 8.721 5 · 10 4 CW 0.1 0.022 0.011 0.046 6.474 5 · 10 4 CW 0.2 0.014 0.056 0.089 1.870 1 · 10 5 none 0.065 0.035 1 · 10 5 RAD 0.1 0.007 0.031 1.354 0.679 1 · 10 5 RAD 0.2 0.033 0.009 0.117 0.579 1 · 10 5 SRHT 0.1 0.030 0.136 0.040 0.696 1 · 10 5 SRHT 0.2 0.007 0.125 0.387 0.453 1 · 10 5 CW 0.1 0.004 0.232 0.022 4.496 1 · 10 5 CW 0.2 0.011 0.072 3.484 3.473 5 · 10 5 RAD 0.1 0.009 0.223 0.563 12.920 5 · 10 5 RAD 0.2 0.045 0.322 1.729 0.658 5 · 10 5 SRHT 0.1 0.009 0.147 0.418 0.059 5 · 10 5 SRHT 0.2 0.016 0.033 0.085 2.978 5 · 10 5 CW 0.1 0.027 0.097 1.305 0.153 5 · 10 5 CW 0.2 0.050 0.009 0.135 3.579 1 · 10 6 RAD 0.1 0.001 0.016 0.126 3.967 1 · 10 6 RAD 0.2 0.080 0.011 0.072 1.357 1 · 10 6 SRHT 0.1 0.002 0.010 0.599 0.288 1 · 10 6 SRHT 0.2 0.002 0.183 2.029 4.329 1 · 10 6 CW 0.1 0.002 0.289 1.202 4.445 1 · 10 6 CW 0.2 0.003 0.047 0.100 0.395 T able 4: Sum of squared distances b etw een true mean v alues and p osterior means of mo dels based on the resp ective sk etches 5.5 Comparison of fitted v alues After this comparison on the lev el of parameters – whose num b er is not changed by sketc hing – w e will compare the mo dels on the level of observ ations, of whic h the sketc hes con tain merely a fraction of the num b er of observ ations in the original data set. W e m ultiply X with the p osterior mean v ector of β , where this p osterior mean can b e based on the original data set or on the resp ectiv e sk etches. In a frequen tist sense, these are fitted v alues ˆ Y , but all X v alues are taken from the original data set, not necessarily from the data set the mo del is based on. This is done to see how close the approximation is on the lev el of Y v alues for b oth ε = 0 . 1 and ε = 0 . 2. Figure 3 is a scatterplot whic h contains smo othed densities. The fitted v alues based on the original model are on the x -axis while the fitted v alues based on the CW sketc h (with ε = 0 . 1) are on the y -axis. Dark er shades of black stand for more observ ations. Even though the fitted v alues are based on one of the 24 −200 −150 −100 −50 0 50 −200 −150 −100 −50 0 50 based on model using original data set based on embedded model using CW ( ε = 0.1) Figure 3: Comparison of fitted v alues based on the original data set with n = 50 000, d = 50, ς = 10 and a CW sk etch with ε = 0 . 1. Darker shades of black stand for more observ ations ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● RAD ( ε = 0.1) SRHT ( ε = 0.1) CW ( ε = 0.1) RAD ( ε = 0.2) SRHT ( ε = 0.2) CW ( ε = 0.2) −4 −2 0 2 4 Figure 4: Difference of fitted v alues according to mo dels based on the respective sketc hing metho ds and fitted v alues according to mo del based on original data set with n = 50 000, d = 50, ς = 10 data sets with the highest standard deviation of the error ( n = 50 000 , ς = 10), all v alues are close or reasonably close to the bisecting line. This means that the fitted v alues obtained by the t wo models do not differ b y muc h. T o get a b etter ov erview, Figure 4 depicts the distances betw een the fitted v alues as b oxplots. Here, all three sketc hing metho ds with b oth ε = 0 . 1 and ε = 0 . 2 are included. All six sets of distances are cen tered around zero. The effect of the appro ximation parameter ε is eviden t from the b o xplot, the v ariation is larger for ε = 0 . 2 regardless of the sketc hing method. When fixing ε , all three sketc hing metho ds exhibit similar results, although the RAD sketc h seems to introduce slightly more v ariation into the differences than the other tw o sk etching methods, thus prediction for the data used in learning the mo del is highly accurate. W e ha ve also generated additional data that has not b een used in learning the mo del and emplo yed the p osterior mean to predict y -v alues for these data. The results are v ery similar to those describ ed ab ov e and presen ted in Figures 3 and 4 . Sk etching, again, introduces a little more 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● original model RAD ( ε = 0.1) SRHT ( ε = 0.1) CW ( ε = 0.1) RAD ( ε = 0.2) SRHT ( ε = 0.2) CW ( ε = 0.2) −4.15 −4.05 −3.95 −3.85 β 11 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● original model RAD ( ε = 0.1) SRHT ( ε = 0.1) CW ( ε = 0.1) RAD ( ε = 0.2) SRHT ( ε = 0.2) CW ( ε = 0.2) −0.1 0.0 0.1 0.2 β 33 Figure 5: Bo xplots of MCMC sample for tw o parameters based on data set with n = 50 000, d = 50, ς = 5 and resp ectiv e sketc hes v ariation, dep ending on ε . More formal treatmen t of prediction accuracy based on similar sketc hing tec hniques for the OLS solution is given in Raskutti and Mahoney ( 2015 ). 5.6 Comparison of p osterior distributions As we ha v e conducted Bay esian regression the mo del consists not only of a mean v alue, but of a whole p osterior distribution for eac h parameter. Figure 5 contains t wo exemplary b o xplots of MCMC samples representing marginal p osterior distributions. The original data set con tains n = 50 000 observ ations, d = 50 v ariables and has an error standard deviation of ς = 5. The medians of the MCMC samples based on the original data set are well-represen ted by the MCMC samples based on sketc hes. Ev en though the median based on an embedded data set migh t b e higher or lo wer for certain parameters, we did not find any systematic biases. The embedding in tro duces additional v ariation, which dep ends on the v alue of the approximation parameter ε , but do es not seem to b e influenced b y the choice of sketc hing metho d. In regression, a common task is the identification of imp ortant v ariables by means of v ariable selection. In a Ba yesian setting, this can b e done via credible in terv als, among other approac hes. Our results indicate that the iden tification of imp ortan t v ariables is quite accurately p ossible based on the resulting approximate mo dels. How ever, one has to take the additional v ariation in to accoun t. Exemplarily , when using 95% credible in terv als as criterion, one should not compare the endp oin ts of the credible in terv al to a fixed v alue µ . Instead, take the extra v ariation in the p osterior distribution and also possible small shifts of the mean and median into account. W e therefore recommend using smaller v alues of ε when aiming at v ariable selection (see Figure 5 ). 5.7 Comparison of running times One of our aims is to make Bay esian regression feasible on v ery large data sets. After ensuring that the results are close to those obtained by the analysis on the original data set, w e will no w con template the running time required. The total running time is comp osed of the time required for the analysis and the time required for the necessary preliminary steps: reading the data set into 26 Prepro cessing Analysis n sketc h ε ς = 1 ς = 2 ς = 5 ς = 10 ς = 1 ς = 2 ς = 5 ς = 10 5 · 10 4 none 0.32 0.41 0.43 0.44 1095.55 749.10 616.47 498.68 5 · 10 4 RAD 0.1 1.60 1.68 1.68 1.73 315.12 213.42 156.79 154.73 5 · 10 4 RAD 0.2 0.40 0.39 0.42 0.43 23.17 26.00 17.48 21.81 5 · 10 4 SRHT 0.1 0.03 0.02 0.03 0.03 317.34 278.39 181.94 166.41 5 · 10 4 SRHT 0.2 0.02 0.02 0.02 0.02 29.04 30.65 23.26 26.00 5 · 10 4 CW 0.1 0.01 0.01 0.01 0.01 375.22 293.89 164.56 171.82 5 · 10 4 CW 0.2 0.01 0.01 0.01 0.01 26.92 25.77 20.57 22.94 1 · 10 5 none 0.69 0.83 1.02 1.05 2035.81 1617.28 1 · 10 5 RAD 0.1 3.27 3.41 3.41 3.27 278.87 260.80 167.24 182.92 1 · 10 5 RAD 0.2 0.76 0.84 0.84 0.80 21.44 23.21 17.52 23.65 1 · 10 5 SRHT 0.1 0.05 0.06 0.05 0.05 284.96 282.20 128.60 196.48 1 · 10 5 SRHT 0.2 0.04 0.04 0.05 0.04 23.72 26.82 21.52 22.70 1 · 10 5 CW 0.1 0.02 0.03 0.02 0.02 257.50 278.20 186.95 198.78 1 · 10 5 CW 0.2 0.02 0.02 0.02 0.02 21.94 26.29 21.22 23.45 5 · 10 5 none 5.49 5.16 5.92 5.71 5 · 10 5 RAD 0.1 16.88 15.96 16.10 16.36 279.81 313.33 165.85 198.40 5 · 10 5 RAD 0.2 3.73 4.00 4.03 3.85 27.37 27.19 17.22 19.58 5 · 10 5 SRHT 0.1 0.20 0.21 0.20 0.21 310.20 308.01 190.78 190.18 5 · 10 5 SRHT 0.2 0.19 0.18 0.19 0.19 31.37 25.62 22.26 24.76 5 · 10 5 CW 0.1 0.09 0.09 0.09 0.10 335.74 300.32 189.33 166.92 5 · 10 5 CW 0.2 0.09 0.08 0.09 0.09 26.03 25.23 24.39 22.86 1 · 10 6 none 18.23 12.88 12.59 14.09 1 · 10 6 RAD 0.1 51.77 147.42 33.75 34.71 209.19 279.03 215.78 145.64 1 · 10 6 RAD 0.2 7.92 8.46 8.38 8.21 21.27 19.93 22.87 23.43 1 · 10 6 SRHT 0.1 0.41 0.49 0.61 0.62 341.12 264.99 294.04 154.77 1 · 10 6 SRHT 0.2 0.39 1.44 0.68 0.68 26.61 31.32 19.69 23.69 1 · 10 6 CW 0.1 0.19 0.27 0.38 0.46 281.72 232.40 175.49 144.02 1 · 10 6 CW 0.2 0.21 0.19 0.45 0.39 28.58 19.50 22.05 9.72 T able 5: Running times for data sets with d = 50. Columns 4 to 7 (“Prepro cessing”) contain the running times of the sk etching metho ds in minutes, for the original data set, the v alues represent the time required to read the data set in to memory , which is a prerequisite for every sketc hing metho d. The four columns to the right (“Analysis”) con tain the running times of the Bay esian linear regression analysis in minutes 27 0 500 1000 1500 2000 running times in minutes sketch n original RAD SRHT CW 5 ⋅ 10 4 1 ⋅ 10 5 5 ⋅ 10 4 1 ⋅ 10 5 5 ⋅ 10 4 1 ⋅ 10 5 5 ⋅ 10 4 1 ⋅ 10 5 Figure 6: T otal running times in minutes for data sets with n ∈ { 50 000 , 100 000 } , d = 50, ς = 5 and approximation parameter ε = 0 . 1. F or the sketc hed data sets, the total running time consists of the time for reading, sketc hing and analyzing the data set. F or the original data set, the sketc hing time is 0 since this step is not applied memory and calculating the sketc h. T able 5 contains the running times required for the Bay esian regression analysis (four righ tmost columns headed “Analysis”) and the running times for the preliminary steps (columns headed “Prepro cessing”). F or the original data sets, the preliminary steps consist of the time required for reading the data set into memory , for all other cases, T able 5 giv es the running times required for constructing the sketc h. The total running time of an analysis on a sketc h is obtained by summing up the reading time, the sk etching time, and the time required for the Bay esian linear regression analysis. T able 5 suggests that b oth the reading times and the sketc hing times are stable for data sets of the same size, with the p ossible exception of an outlier for n = 1 000 000 and ς = 2. Both the reading times and the sketc hing times grow with the size of the data set. The sk etching times grow for smaller v alues of ε . F or RAD sk etches with ε = 0 . 1, the sketc hing takes longer than the reading of the data set, for all other combinations the opp osite is true. CW sketc hes require the shortest amoun t of running time of the three sketc hing metho ds. Although only a few of the original data sets could b e analyzed, the “Analysis” v alues in T able 5 indicate that the running times for the Bay esian analysis increase with the n umber of observ ations in the data set. The running times for the sketc hed data sets on the other hand show no systematic increase for growing v alues of n . There is some v ariation in the running times, but this seems to b e more random chance than trend. Larger v alues of ε lead to shorter running times in the analysis, whic h indicates that the trade-off b et ween time and go o dness of the approximation is present b oth in calculating and analyzing the sketc h. The running time of the analyses is similar for all three sk etching metho ds. How ev er, the running times do seem to dep end on the v alue of ς . F or higher standard deviations of the error term, the required running time tends to b ecome less. Figure 6 exemplarily shows the total running times in minutes for data sets with d = 50 and ς = 5. This comprises reading, sketc hing (if applicable) and analyzing the data set. F or the sk etches, ε = 0 . 1 was used. T o illustrate the p oten tial impro v ement with increasing n , Figure 6 con tains the total running times for n = 50 000 and n = 100 000 side by side. F or the original data 28 sets, we observe a large increase in running time as n doubles, whereas the running times on the sk etched data sets hardly c hange. All results presen ted so far are based on data sets with d = 50 v ariables. W e hav e conducted Ba yesian analyses on data sets with d = 100 with the same parameter v alues for n and ς . The findings from these sim ulations are similar to those for d = 50. One exception is that the analysis of CW sketc hes now takes longer than the analysis of the resp ectiv e original data sets for n = 50 000. This c hanges for larger data sets. One strength of the CW metho d, ho w ever, is its e fficiency when dealing with streaming data when the num b er of v ariables is not to o large. Recall that its dep endency on d is quadratic, which means that the sketc h sizes are already very large for medium dimensional problems with d ≈ 500 or d ≈ 1 000. Setting the target dimension to a lo wer v alue is not a remedy , b ecause this results in a weak er approximation guarantee as also noticed by Y ang et al. ( 2015 ). In that case one should consider using one of the slow er and denser sketc hes. 5.8 Streaming example and concluding remarks The data sets considered so far were c hosen to b e small enough to allo w for Bay esian analysis on the full data set in a conv enien t time. This is by far not what might b e called Big Data. T o sho w that our approac h is suitable for analyzing really big amoun ts of data, we hav e generated and at the same time sketc hed a data set. The generation of the data follow ed the same rules as the other simulated data sets, but with ς = 0 . 1. The data set’s original size is 10 9 × 100 double precision v alues. This corresp onds to ab out 2 TB in CSV format or at least 750 GB in a binary represen tation. W e ha ve used the CW sk etching metho d resulting in a sketc h of size 65 536 × 100, whic h requires only around 140 MB of space in CSV format and fits into RAM. Clearly , we cannot compare the results to those on the original data set, but calculating the sum of squared distances b et ween the true mean v alues of β and the p osterior mean of the mo del adds up to 3 . 741 · 10 − 6 . The Bay esian regression analysis to ok 2781 minutes. In some cases the algebraic structure might strongly dep end on a few observ ations or v ariables. In suc h situations it is imp ortant to iden tify these or to retain their con tribution in the reduced data set. So far, our mo del assumptions did not suffer from suc h ill-behav ed situations, but no w we assess the p erformance of our metho d in this case. W e construct data sets where an imp ortant part of the target v ariable falls in to a subspace that is spanned b y a small constant n umber of observ ations. Uniform random subsampling will pick these only with probability O ( 1 n ). Oblivious subsampling tec hniques in general will ha ve trouble identifying the imp ortant observ ations. In contrast, oblivious subspace em b edding techniques preserve these effects with high probabilit y . This effect is observed in practice even when comparing one sketc h against the b est of 1000 rep etitions of uniform random subsampling. In conclusion, the sim ulation study indicates that our proposed metho d w orks w ell for simulated data sets, which are generally well-suited for conducting Bay esian linear regression. But ev en with a high v ariance of the error term (and thus a relatively bad mo del fit), our prop osal leads to results similar to those one w ould obtain on the original data set. The running time of the analysis with the prop osed sk etches is largely indep enden t of n , giving adv antages for v ery large n . Since the em b eddings can b e read in sequentially , it is not necessary to load the whole data set in to the memory at once, which reduces the required memory . 29 V ariable Description Remark cnt num b er of rental bikes used Y -v ariable se ason season of the year factor (4 levels) yr y ear (2011 or 2012) factor (2 levels) hour hour (0 to 23) factor (24 levels) holiday public holiday factor (2 lev els) we ekday day of the w eek factor (7 levels) we athersit w eather (“clear” to “rain”) factor (3 levels) atemp apparen t temp erature standardized hum h umidity standardized windsp e e d windsp eed standardized T able 6: V ariables from the bike sharing data set used in the mo del F or CW em b eddings, reading the data in and calculating the sk etch only tak es marginally longer than only reading the data in. In our exp eriments, w e found that reading in and sketc hing takes around 1 . 01 to 1 . 04 times longer. This factor is typically higher for small data sets and low er for larger data sets (cf. T able 5 ). Ho wev er, when the n umber of v ariables is large it ma y b e fa vorable to use SRHT, whose sketc hing time is only slightly larger but has considerably smaller em b edding dimension. 6 Real Data Example As a real data example, we consider the bike sharing data set tak en from F anaee-T and Gama ( 2014 ), which is av ailable in the UCI Mac hine Learning Rep ository ( Lichman 2013 ). This is only mean t as an exemplary application of the metho ds to a real data scenario and should not b e mistaken for a complete statistical analysis of the data set. The bike sharing data set contains the num b er of ren tal bik e users p er hour o ver t wo y ears as well as additional information ab out the day and the weather. See T able 6 for an ov erview of the v ariables we use in the mo del. The data set contains some additional v ariables w e do not employ , b ecause they are highly correlated with the v ariables present in the mo del. W e also made a change to the factor levels of the v ariable we athersit . In the original data set, this v ariable contains 4 lev els. The fourth level only is presen t 3 times out of total of n = 17 379 hours in the data set. T o av oid any problems with such an underrepresen ted lev el, we combine lev els 3 and 4 to obtain a factor with 3 levels. The original lev els 3 and 4 stand for “light rain” and “heavy rain”. The new level 3 can easily b e interpreted as the presence of rain. F or a more detailed description of all v ariables, please refer to the data set’s w eb page on the UCI Machine Learning Rep ository . 1 The v ariable cnt contains the num b er of ren tal bikes used p er hour and is th us a coun t v ariable. Ho wev er, there are around 850 distinct v alues, which makes analyzing cnt as a contin uous v ariable reasonable. When analyzing suc h v ariables with a linear regression model, transforming them using 1 h ttp://archiv e.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset 30 d ε RAD SRHT CW 40 0.15 6 767 6 767 8 192 40 0.20 3 807 3 807 4 096 T able 7: Number of observ ations of the sketc hes for the bik e sharing example. Differen t v alues of ε are used for RAD and SRHT sketc hes; the target dimension of CW sk etches is chosen to b e the p o wer of tw o closest to the size of the RAD and SRHT sketc hes the square-root is a common procedure. After the transformation, the v alues of cnt sho w some bi-mo dalit y , but fit reasonably well to the assumption of a normal distribution. W e use the transformed v ariable cnt as Y -v ariable and all other v ariables in T able 6 as X - v ariables. T o handle the factor v ariables, we use the R-function model.matrix to create a design matrix, whic h is then passed on to RaProR and rstan . The resulting design matrix contains n = 17 379 observ ations and d = 39 v ariables plus the intercept. Again, we calculate sk etches for all three metho ds and with t wo differen t settings of ε . Because of the size of the data set relativ e to the n umber of v ariables, we choose ε = 0 . 15 and ε = 0 . 2 for the RAD and SRHT sketc hes. F or CW, w e c ho ose v alues of k that are closest to the target dimension of the other sk etches. This results in the v alues given in T able 7 . The Bay esian model based on the original data set suggests that all men tioned v ariables are imp ortan t for the mo deling of the num b er of bikes used p er hour. Figure A.1 in the app endix giv es an ov erview of the p osterior distributions for the mo del based on the original data set. As one might exp ect, the weather has a strong influence. More bikes are ren ted when the apparen t temp erature is high and, to a lesser exten t, when the h umidit y and the wind sp eed are low. In clear or partly cloudy w eather, the num b er of rented bik es is highest, but the negativ e influence of heavier clouds is comparativ ely small. Rainy weather, ho wev er, reduces the n umber of bik e users more substantially . In addition to that, fall seems to b e the most p opular seasons for bik e sharing. Spring and summer also hav e p ositive effects in comparison to winter, but the effect sizes are smaller. This migh t seem surprising at first, esp ecially as the n umber of rental bik e users is highest in summer. This might partly b e an effect of the apparent temp erature, which is generally higher in summer. There is also a distinct hourly effect. During night time, esp ecially b et ween midnight and 5am, the n umber of rented bikes is greatly reduced. On the other hand, betw een 7am and 9pm, a lot of bik es are used, with t wo p eaks at 8am and 5pm/6pm. This might indicate that the service is used b y p eople transiting to and from w ork. Holidays – which only includes days that would otherwise b e a w orking da y , so only Monda y to F riday – hav e a negativ e influence on the num b er of rental bik e users. When taking the days of the w eek in to account, F riday and Saturda y hav e the highest p ositiv e effect while Sunday seems to b e the least p opular day . All of these effects based on days ha ve a small influence compared to the v ariables mentioned b efore. Lastly , the v ariable yr also has a p ositiv e influence whic h indicates a p ositiv e trend for this bik e sharing service. All conclusions can similarly b e derived from the mo dels based on the sketc hes. F ollowing our approac h in Section 5 , we first compare the resulting p osterior mean v alues of β . T able 8 shows the sum of squared distances b etw een posterior mean v alues of the original mo del and models based on the three sketc hing metho ds, using tw o v alues of ε eac h. There is a general increase in the sum 31 ε RAD SRHT CW 0.15 1.790 2.349 0.907 0.2 6.511 2.732 1.657 T able 8: Sum of squared distances b etw een p osterior mean v alues of the original mo del and mo dels based on the resp ectiv e sketc hes for the bike sharing data set ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● RAD ( ε = 0.15) SRHT ( ε = 0.15) CW ( ε = 0.15) RAD ( ε = 0.2) SRHT ( ε = 0.2) CW ( ε = 0.2) −1.0 −0.5 0.0 0.5 1.0 1.5 Figure 7: Difference of fitted v alues according to mo dels based on the respective sketc hing metho ds and fitted v alues according to mo del based on the original bik e sharing data set of squared distances for ε = 0 . 2 compared to ε = 0 . 15, but the amount differs depending on the sk etching metho d. This should not b e ov er-interpreted, how ev er. As these v alues represent only one realization of a random subspace em b edding, there is no evidence for systematic differences. T o see the effects of the differences in the p osterior means of β on the level of the y -v ariable, whic h is the num b er of rental bikes used, w e compare the fitted v alues as in Section 5.5 . Again, w e m ultiply the original design matrix X with the p osterior means of β , where the p osterior mean v alues are obtained from the mo del on the original design matrix and the mo dels on the resp ective sk etches. Figure 7 contains the six resulting b o xplots. As in the sim ulation study , the differences of the fitted v alues are centered around zero with only small deviations. F urther analysis indicates that the higher deviations o ccur when the num b er of bikes used is high, which means that the ma jorit y of differences in the fitted v alues are small relativ e to the observ ed v alue for that data p oin t. As a last step of our short analysis of the bike sharing data set, we will concen trate on the p osterior distributions of the factors that take the w eather situation in to account. This factor has three lev els in our data set. The first lev el stands for clear weather, which also includes partly cloudy hours, the second lev el stands for hea vier clouds or mist while the third lev el mo dels rainy w eather, which includes light rain, heavy rain, thunderstorms, and snow. The different lev els o ccur with differen t relative frequencies: around 66% of the observ ed hours fall in to level 1, 26% into lev el 2 and 8% into level 3. Figure 8 shows b o xplots of the MCMC samples based on the original design matrix and the sk etches. The v alues represent the marginal p osterior distributions of the t wo dummy v ariables 32 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● original model RAD ( ε = 0.15) SRHT ( ε = 0.15) CW ( ε = 0.15) RAD ( ε = 0.2) SRHT ( ε = 0.2) CW ( ε = 0.2) −0.5 0.0 0.5 heavy c louds vs. clear ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● original model RAD ( ε = 0.15) SRHT ( ε = 0.15) CW ( ε = 0.15) RAD ( ε = 0.2) SRHT ( ε = 0.2) CW ( ε = 0.2) −3.0 −2.5 −2.0 rain vs. clear Figure 8: Bo xplots of MCMC sample for the tw o w eather situation parameters based on the original data set and all sketc hes asso ciated with the v ariable we athersit . The scales of the b o xplots are c hosen suc h that one unit is of the same length on b oth y -axes. This allows for easy comparison of the v ariation in the t wo p osterior distributions. Again, we can see that the embedding in tro duces additional v ariation, whic h dep ends on the v alue of the approximation parameter ε , but do es not seem to b e influenced b y the choice of the sk etching method. In addition, the p osterior distributions for the factor “hea vy clouds” show less v ariation compared to the p osterior distributions for “rain”. This is as one might exp ect as the n umber of o ccurrences is more than three times higher for “heavy clouds” and a larger n umber of observ ations tends to reduce the uncertain t y . Nonetheless, it is interesting to observe that the v ariation in tro duced by the em b edding seems to b e a factor of the v ariation present in the original mo del. While the effect of the factor “rain” is undoubtedly negativ e according to the original mo del and all sk etches used here, the effect of factor “hea vy clouds” is close to zero. In the original mo del, “hea vy clouds” w ould also b e seen as an influen tial factor when using the 95% credible interv al as a criterion. The conclusion is the same for all sk etching metho ds when using ε = 0 . 15. How ev er, when using ε = 0 . 2 and CW, “hea vy clouds” w ould be seen as not influen tial. This stresses again that the endp oin ts of credible in terv als based on sk etches exhibit some additional v ariation and inference based on them may change, dep ending on the v ariation in the original mo del and the choice of ε . If v ariable selection is a fo cus of the regression analysis, we recommend choosing reasonably small ε (cf. Figure 5 and its discussion in Section 5.6 ). This example underlines that our metho d also works well on real world applications when the original data follows the mo del ass umptions reasonably w ell. 7 Conclusion Our paper deals with random pro jections as a data reduction tec hnique for Ba y esian regression. W e ha ve shown how pro jections can b e applied to compress the columnspace of a given data matrix with only little distortion. The size of the reduced data set is indep endent of the num b er n of observ ations in the original data set. Therefore, subsequen t computations can op erate within time 33 and space b ounds that are also indep endent of n , regardless of which algorithm is actually used. While our fo cus was on MCMC and the No-U-T urn-Sampler in particular, w e tried INLA as w ell and observ ed a considerable reduction in running time while ac hieving very similar results. How ever, our prop osed reduction metho d is not limited to these approaches, making it highly flexible. The presented em b edding tec hniques allow for fast application to the data set and do not need the em b edding matrices to b e stored explicitly . Thus, only very little memory is needed while sk etching the data. F urthermore, we hav e survey ed their useful prop erties when the computations are p erformed in sequen tial streaming as well as in distributed environmen ts. These scenarios are highly desirable when dealing with Big Data (cf. W elling et al. 2014 ). W e consider the situation where the likelihoo d is modeled using standard linear regression with a Gaussian error term. W e sho w that the lik eliho o d is appro ximated within small error. F urthermore, if an arbitrary Gaussian distribution is used as prior distribution, the desired p osterior distribution is also well appro ximated within small error. This includes the case of a uniform prior distribution o ver R d , an improp er, non-informative c hoice that is widely used (cf. Gelman et al. 2014 ). W e also sho w the results to b e (1 + O ( ε )) approximations of the distributions of in terest in the con text of Bay esian linear regression. As the structure of b oth mean and v ariance is preserved up to small errors, approximate sufficien t statistics for the p osterior distributions are w ell-reco vered. This giv es the user all the information that is needed for Bay esian regression analysis. In our simulation exp eriments, w e found that the approximation works well for simulated data sets. All three sk etching metho ds we considered lead to results that are v ery similar to Bay esian regression on the full data set and the true underlying v alues. The running time for the MCMC analysis based on the sketc hes is indep enden t of the num b er of observ ations n . The calculation of the embedding do es dep end on n , but requires little more time than the necessit y of only reading the data. This is esp ecially true when using the CW metho d. But for larger dimensions a CW em b edding can b e to o large. In such a case, the denser SRHT construction also p erforms v ery well and is preferable b ecause of its low er dep endency on d . RAD has even low er dep endency on d but tak es considerably more time to calculate. W e hav e applied the metho ds to a bike sharing data set by F anaee-T and Gama ( 2014 ). The appro ximation also works w ell on this real data example, giving very similar results to the original Ba yesian regression, while adding little additional v ariation to the p osterior distributions. F or future researc h, we w ould lik e to generalize our results to other classes of distributions for the lik eliho o d and to more general priors. As a first step, we hav e used hierarc hical mo dels in volving normal, Gamma and exp onen tial distributions as h yp erpriors. F or normal and Gamma distributions, the results seem promising, whereas using exp onential distributions seems more chal- lenging. The recen t results on frequentist ` p regression of W o o druff and Zhang ( 2013 ) might give rise to efficien t streaming algorithms also in the Bay esian regression setting. Another in teresting direction would b e to consider Gaussian mixtures, since they allow to approximate any con tinuous distribution. In real-world applications one might exhibit the domain-sp ecific structure to further reduce the time and space b ounds when these indicate that the data itself is of lo w rank or allo ws for sparse solution vectors. 34 Ac kno wledgements W e w ould lik e tw o thank the anon ymous referees for their v aluable com- men ts and suggestions which help ed to improv e this man uscript. This work has b een supp orted b y Deutsche F orsc hungsgemeinsc haft (DF G) within the Collab orative Research Cen ter SFB 876 “Pro viding Information b y Resource-Constrained Analysis”, pro ject C4. References Ailon, N. and Lib ert y , E. (2009). F ast dimension reduction using Rademacher series on dual BCH co des. Discr ete Comput. Ge om. , 42 (4):615–630. Alon, N., Matias, Y., and Szegedy , M. (1999). The space complexit y of approximating the frequency momen ts. J. Comput. Syst. Sci. , 58 (1):137–147. Balakrishnan, S. and Madigan, D. (2006). A one-pass sequen tial Monte Carlo metho d for Bay esian analysis of massive datasets. Bayesian Anal. , 1 (2):345–361. Banerjee, A., Dunson, D. B., and T okdar, S. T. (2013). Efficien t Gaussian pro cess regression for large datasets. Biometrika , 100 (1):75–89. Baraniuk, R., Da venport, M., Devore, R., and W akin, M. (2007). A simple pro of of the restricted isometry prop ert y for random matrices. Constr. Appr ox. , 28 (3). Bardenet, R., Doucet, A., and Holmes, C. C. (2014). T ow ards scaling up Mark o v Chain Mon te Carlo: an adaptive subsampling approach. In Pr o c. of ICML , pages 405–413. Batson, J. D., Spielman, D. A., and Sriv astav a, N. (2012). Twice-Ramanujan sparsifiers. SIAM J. Comput. , 41 (6):1704–1721. Beaumon t, M. A., Zhang, W., and Balding, D. J. (2002). Appro ximate Bay esian Computation in p opulation genetics. Genetics , 162 (4):2025–2035. Benson, A. R., Gleic h, D. F., and Demmel, J. (2013). Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In Pr o c. of IEE E Int. Conf. on Big Data , pages 264–272. Bishop, C. M. (2006). Pattern R e c o gnition and Machine L e arning . Springer. Boutsidis, C., Drineas, P ., and Magdon-Ismail, M. (2013). Near-optimal coresets for least-squares regression. IEEE T r ans. Inf. The ory , 59 (10):6880–6892. Boutsidis, C. and Gittens, A. (2013). Impro ved matrix algorithms via the Subsampled Randomized Hadamard Transform. SIAM J. Matrix Anal. Appl. , 34 (3):1301–1340. Boutsidis, C. and Magdon-Ismail, M. (2014). F aster SVD-truncated regularized least-squares. In IEEE Symp. Inf. The ory , pages 1321–1325. Boutsidis, C., Zouzias, A., and Drineas, P . (2010). Random pro jections for k -means clustering. In Pr o c. of NIPS , pages 298–306. 35 Cand ` es, E. J., Rom b erg, J. K., and T ao, T. (2006). Robust uncertaint y principles: Exact sig- nal reconstruction from highly incomplete frequency information. IEEE T r ans. Inf. The ory , 52 (2):489–509. Clarkson, K. L., Drineas, P ., Magdon-Ismail, M., Mahoney , M. W., Meng, X., and W o o druff, D. P . (2013). The fast Cauc hy transform and faster robust linear regression. In Pr o c. of SODA , pages 466–477. Clarkson, K. L. and W o o druff, D. P . (2009). Numerical linear algebra in the streaming mo del. In Pr o c. of STOC , pages 205–214. Clarkson, K. L. and W o o druff, D. P . (2013). Low rank approximation and regression in input sparsit y time. In Pr o c. of STOC , pages 81–90. Clarkson, K. L. and W o o druff, D. P . (2015). Sk etc hing for M -estimators: A unified approac h to robust regression. In Pr o c. of SODA , pages 921–939. Cohen, M. B., Elder, S., Musco, C., Musco, C., and Persu, M. (2015). Dimensionalit y reduction for k -means clustering and low rank approximation. In Pr o c. of STOC . Constan tine, P . G. and Gleich, D. F. (2011). T all and skinny QR factorizations in MapReduce arc hitectures. In Pr o c. of Int. Workshop on MapR e duc e and Appl. , pages 43–50. ACM. Csillery , K., Blum, M., Gaggiotti, O., and F rancois, O. (2010). Approximate Bay esian Computation (ABC) in practice. T r ends Ec ol. Evol. , 25 (7):410–418. Dasgupta, A., Drineas, P ., Harb, B., Kumar, R., and Mahoney , M. W. (2009). Sampling algorithms and coresets for ` p regression. SIAM J. Comput. , 38 (5):2060–2078. Demmel, J., Grigori, L., Ho emmen, M., and Langou, J. (2012). Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. , 34(1). Dietzfelbinger, M., Hagerup, T., Kata jainen, J., and P enttonen, M. (1997). A reliable randomized algorithm for the closest-pair problem. J. Algorithms , 25 (1):19–51. Donoho, D. L. (2006). Compressed sensing. IEEE T r ans. Inf. The ory , 52 (2):1289–1306. Drineas, P ., Mahoney , M. W., and Muth ukrishnan, S. (2006). Sampling algorithms for ` 2 regression and applications. In Pr o c. of SODA , pages 1127–1136. Drineas, P ., Mahoney , M. W., Muthukrishnan, S., and Sarl´ os, T. (2011). F aster least squares appro ximation. Numer. Math. , 117 (2):219–249. DuMouc hel, W., V olinsky , C., Johnson, T., Cortes, C., and Pregib on, D. (1999). Squashing flat files flatter. In Pr o c. of KDD , pages 6–15. F anaee-T, H. and Gama, J. (2014). Ev ent lab eling combining ensemble detectors and background kno wledge. Pr o g. in AI , 2(2-3):113–127. F eldman, D., Schmidt, M., and Sohler, C. (2013). T urning big data into tiny data: Constan t-size coresets for k -means, PCA and pro jective clustering. In Pr o c. of SODA , pages 1434–1453. 36 Gelman, A. (2006). Prior distributions for v ariance parameters in hierarc hical mo dels (comment on article by Browne and Draper). Bayesian Anal. , 1(3):515–534. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., V ehtari, A., and Rubin, D. B. (2014). Bayesian Data A nalysis . T exts in Statistical Science. Chapman & Hall/CR C, 3 rd edition. Gepp ert, L. N., Ic kstadt, K., Mun teanu, A., Quedenfeld, J., and Sohler, C. (2015). R aPr oR: R andom Pr oje ctions for Bayesian line ar R e gr ession, R-p ackage, V ersion 1.0 . Giv ens, C. R. and Shortt, R. M. (1984). A class of Wasserstein metrics for probability distributions. Michigan Math. J. , 31 (2):231–240. Golub, G. H. (1965). Numerical metho ds for solving linear least squares problems. Numerische Mathematik , 7(3):206–216. Golub, G. H. and v an Loan, C. F. (2013). Matrix c omputations . Johns Hopkins Universit y Press, 4 th edition. Guhaniy ogi, R. and Dunson, D. B. (2014). Bay esian compressed regression. J. Amer. Stat. Asso c. , published online. Halk o, N., Martinsson, P .-G., and T ropp, J. A. (2011). Finding structure with randomness: Proba- bilistic algorithms for constructing appro ximate matrix decomp ositions. SIAM R ev. , 53 (2):217– 288. Hastie, T., Tibshirani, R., and F riedman, J. (2009). The Elements of Statistic al L e arning: Data mining, Infer enc e, and Pr e diction . Springer, 2 nd edition. Hoffman, M. D. and Gelman, A. (2014). The No-U-T urn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. L e arn. R es. , 15 :1593–1623. Horn, R. and Johnson, C. (1990). Matrix Analysis . Cambridge Universit y Press. Ji, S. and Carin, L. (2007). Ba yesian compressive sensing and pro jection optimization. In Pr o c. of ICML , pages 377–384. Jolliffe, I. (2002). Princip al c omp onent analysis . Springer, 2 nd edition. Kannan, R. and V empala, S. (2009). Sp ectral algorithms. F ound. T r ends The or. Comput. Sci. , 4 (3-4):157–288. Kannan, R., V empala, S., and W o o druff, D. P . (2014). Principal comp onent analysis and higher correlations for distributed data. In Pr o c. of COL T , pages 1040–1057. Kerb er, M. and Raghv endra, S. (2014). Appro ximation and streaming algorithms for pro jective clustering via random pro jections. CoRR , abs/1407.2063. La wson, C. L. and Hanson, R. J. (1995). Solving le ast squar es pr oblems . Classics in applied mathematics. SIAM, Philadelphia (Pa.). Lic hman, M. (2013). UCI machine learning rep ository . 37 Ma, P ., Mahoney , M. W., and Y u, B. (2014). A statistical p ersp ectiv e on algorithmic leveraging. In Pr o c. of ICML , pages 91–99. Madigan, D., Raghav an, N., DuMouchel, W., Nason, M., P osse, C., and Ridgewa y , G. (2002). Lik eliho o d-based data squashing: A mo deling approach to instance construction. Data Min. Know l. Disc ov. , 6 (2):173–190. Martins, T., Simpson, D., Lindgren, F., and Rue, H. (2013). Bay esian computing with INLA: New features. Comput. Stat. Data A nal. , 67 :68–83. Muth ukrishnan, S. (2005). Data streams: Algorithms and applications. F ound. T r ends The or. Comput. Sci. , 1 (2). Nelson, J. and Nguyen, H. L. (2013a). OSNAP: F aster numerical linear algebra algorithms via sparser subspace embeddings. In Pr o c. of FOCS , pages 117–126. Nelson, J. and Nguyen, H. L. (2013b). Sparsit y low er b ounds for dimensionality reducing maps. In Pr o c. of STOC , pages 101–110. Nelson, J. and Nguy ˆ en, H. L. (2014). Lo wer bounds for oblivious subspace embeddings. In Pr o c. of ICALP, Part I , pages 883–894. P aul, S., Boutsidis, C., Magdon-Ismail, M., and Drineas, P . (2014). Random pro jections for linear supp ort v ector machines. ACM T r ans. Know l. Disc ov. Data , 8 (4):22. Quiroz, M., Villani, M., and Kohn, R. (2015). Sp eeding up MCMC by efficien t data subsampling. stat.ME , abs/1404.4178. R Core T eam (2014). R: A L anguage and Envir onment for Statistic al Computing . R F oundation for Statistical Computing, Vienna, Austria. Raskutti, G. and Mahoney , M. (2015). Statistical and algorithmic p ersp ectiv es on randomized sk etching for ordinary least-squares. In Pr o c. of ICML , pages 617–625. Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bay esian inference for latent Gaussian mo dels b y using integrated nested Laplace appro ximations (with discussion). J. R. Statist. So c. B , 71 :319–392. Rusu, F. and Dobra, A. (2007). Pseudo-random num b er generation for sketc h-based estimations. A CM T r ans. Datab ase Syst. , 32 (2):11. Sarl´ os, T. (2006). Impro ved approximation algorithms for large matrices via random pro jections. In Pr o c. of FOCS , pages 143–152. Stan Developmen t T eam (2013). Stan: A C++ Libr ary for Pr ob ability and Sampling, V ersion 2.3. V enk atasubramanian, S. and W ang, Q. (2011). The Johnson-Lindenstrauss transform: An empirical study . In Pr o c. of ALENEX , pages 164–173. Villani, C. (2009). Optimal tr ansp ort: Old and new . Grundlehren der mathematischen Wis- sensc haften. Springer, Berlin. 38 W elling, M., T eh, Y. W., Andrieu, C., Kominiarczuk, J., Meeds, T., Shah baba, B., and V ollmer, S. (2014). Bay esian inference & Big Data: A snapshot from a w orkshop. ISBA Bul letin , 21 (4):8–11. W o o druff, D. P . and Zhang, Q. (2013). Subspace embeddings and ` p -regression using exp onential random v ariables. In Pr o c. of COL T , pages 546–567. Y ang, J., Meng, X., and Mahoney , M. W. (2015). Implementing randomized matrix algorithms in parallel and distributed environmen ts. CoRR , abs/1502.03032. 39 App endix ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Intercept Spring Summer F all Y ear2 Hour1 Hour2 Hour3 Hour4 Hour5 Hour6 Hour7 Hour8 Hour9 Hour10 Hour11 Hour12 Hour13 Hour14 Hour15 Hour16 Hour17 Hour18 Hour19 Hour20 Hour21 Hour22 Hour23 Holiday Monday T uesday Wednesda y Thursday Friday Saturday Heavy clouds Rain App. T emp. Humidity Wind speed −5 0 5 10 −5 0 5 10 Figure A.1: Boxplots of the MCMC samples for all β parameters of the bik e sharing data set based on the original mo del 40
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment