Scalable Bayesian Non-linear Matrix Completion

Matrix completion aims to predict missing elements in a partially observed data matrix which in typical applications, such as collaborative filtering, is large and extremely sparsely observed. A standard solution is matrix factorization, which predic…

Authors: Xiangju Qin, Paul Blomstedt, Samuel Kaski

Scalable Bayesian Non-linear Matrix Completion
Scalable Bayesian Non-linear Matrix Completion Xiangju Qin ∗ , Paul Blomstedt and Samuel Kaski Department of Computer Science, Helsinki Institute for Information T echnology HIIT , Aalto Uni versity , 00076 Espoo, Finland xiangju.qin@helsinki.fi, paul.blomstedt@aalto.fi, samuel.kaski@aalto.fi Abstract Matrix completion aims to predict missing ele- ments in a partially observed data matrix which in typical applications, such as collaborative filter- ing, is large and extremely sparsely observ ed. A standard solution is matrix factorization, which pre- dicts unobserved entries as linear combinations of latent variables. W e generalize to non-linear com- binations in massive-scale matrices. Bayesian ap- proaches have been proven beneficial in linear ma- trix completion, b ut not applied in the more gen- eral non-linear case, due to limited scalability . W e introduce a Bayesian non-linear matrix completion algorithm, which is based on a recent Bayesian for - mulation of Gaussian process latent variable mod- els. T o solve the challenges reg arding scalability and computation, we propose a data-parallel dis- tributed computational approach with a restricted communication scheme. W e ev aluate our method on challenging out-of-matrix prediction tasks using both simulated and real-world data. 1 Introduction In matrix completion—one of the most widely used ap- proaches for collaborative filtering—the objectiv e is to pre- dict missing elements of a partially observed data matrix. Such problems are often characterized by large and extremely sparsely observed data sets. The classic linear solution to the problem is to find a factorization of the data matrix Y ∈ R N × D as a product of latent variables X ∈ R N × K and weights W ∈ R D × K ( K  N , D ), from which elements of Y can be predicted as Y ≈ XW T . Probabilistic ma- trix factorization (PMF) [ Salakhutdinov and Mnih, 2008b ] , formulates the problem as a probabilistic model, regular- ized by placing priors on X and W , and finds the solution as a maximum a posteriori (MAP) estimate of these matri- ces. Fully Bayesian matrix factorization [ Salakhutdinov and Mnih, 2008a ] expands this model by further placing priors on model hyperparameters, and marginalizing these along with ∗ Contact Author . Currently at Institute for Molecular Medicine Finland (FIMM), Univ ersity of Helsinki. X and W . Bayesian matrix factorization brings the advan- tages of automatic complexity control and better robustness to ov erfitting. Moreover , the solution comes with an uncer- tainty estimate, which is useful when the completed matrix is used for decision making. For instance, sparsely observed drug-target interaction matrices are used for deciding which interactions to measure next. Lawrence and Urtasun [ 2009 ] generalized PMF using a Gaussian process latent variable model (GP-L VM) formu- lation, where the relationship between X and Y is given by Y ≈ f ( X ) , with a GP-prior placed ov er f . The X is optimized to find its MAP solution. Note that this for- mulation also subsumes the linear model as a special case. Subsequently , a variational inference framew ork for fully Bayesian GP-L VM has been de veloped [ Damianou et al. , 2016; T itsias and Lawrence, 2010 ] , building on sparse GP approximations [ Qui ˜ nonero Candela and Rasmussen, 2005; Snelson and Ghahramani, 2006 ] . It parametrizes the cov ari- ance matrix implied by the GP kernel using a set of M  N auxiliary inducing variables. While Bayesian GP-L VM has been successfully used for dimensionality reduction and ex- tracting latent representations, less consideration has been giv en to its applicability in matrix completion tasks with ex- tremely sparse data. Computationally , this is a much more demanding problem, because the v ariational updates have to be performed separately for each dimension of the data ma- trix, instead of being performed as a single operation. Existing approaches for scaling up Bayesian GP-L VM make use of the insight that, conditional on the inducing v ari- ables, the data points can be decoupled for parallel compu- tations. In this line of work, Gal et al. [ 2014 ] introduced a distributed version of Bayesian GP-L VM. Dai et al. [ 2014 ] proposed a similar frame work, additionally using GPU ac- celeration to speed up local computations. Neither of the works demonstrated learning of latent v ariable models be- yond moderately-sized data, nor ha ve they been implemented for sparse matrices, which is needed for the problems consid- ered in this paper . More importantly , current distributed solu- tions require the worker nodes to communicate with the mas- ter node in every iteration, which leads to an accumulating communication overhead as the number of worker units in- creased with the size of the problem. V ander Aa et al. [ 2016 ] reported such a phenomenon for their distrib uted MPI imple- mentation of Bayesian linear matrix factorization. Finally , our experience indicates that existing distributed implemen- tations may suffer from high memory consumption. For GP-regression models, with X observed, Deisen- roth and Ng [ 2015 ] proposed a framework with particu- larly good scaling properties and ef ficient use of memory . This framework utilizes a product-of-GP-experts (PoE) for- mulation [ Cao and Fleet, 2014; Ng and Deisenroth, 2014; T resp, 2000 ] , which makes predictions using a product of in- dependent local models, each operating on a subset of the data. These types of approximations are amenable to embar- rassingly parallel computations, and can therefore be scaled up to arbitrarily large data sets, at least in principle. How- ev er, a direct application of PoE for nonlinear matrix comple- tion may not produce satisfactory predictions for two reasons. First, since the target matrix is very sparsely observed, each local model has v ery limited information to learn an infor- mativ e model without sharing information. Second, while lo- cal models could be combined into larger models to improve predictiv e performance, this is hindered by the general non- uniqueness of the latent variables in latent v ariable models. In this work, we propose a distributed computational strat- egy which is almost as communication-ef ficient as embar- rassingly parallel computation, b ut enables local models to share information and av oids the problem of non-uniqueness in aggregating local models. In a nutshell, one data subset is processed first, and the rest of the embarrassingly parallel computations are conditioned on the result. A similar idea was recently presented by [ Qin et al. , 2019 ] for Bayesian lin- ear matrix completion. The remainder of the paper proceeds as follows: In Section 2 we first pro vide a brief revie w of GP-L VMs. Then, in Section 3, we present our frame work for scalable Bayesian non-linear matrix completion. An empiri- cal ev aluation of the method, using simulations and a bench- mark dataset, is gi ven in Section 4. The paper ends with con- clusions in Section 5. 2 Gaussian Process Latent V ariable Models A Gaussian process latent v ariable model (GP-L VM) [ Lawrence, 2005 ] can be constructed from a non-linear multi- output regression model, p ( Y | F , X , σ 2 ) = D Y d =1 p ( y : ,d | f : ,d , X , σ 2 ) = D Y d =1 N Y n =1 N  y n,d | f d ( x n, : ) , σ 2  , by placing a GP prior over the unkno wn functions f 1 . . . , f d . Integrating ov er the space of functions with respect to a zero- mean GP then yields the likelihood as p ( Y | X , θ ) = D Y d =1 Z N ( y : ,d | f : ,d , σ 2 I ) N ( f : ,d | 0 , K )d f : ,d = D Y d =1 N ( y : ,d | 0 , K + σ 2 I ) , (1) where K is an N × N kernel matrix defined by a GP covari- ance function k ( x s, : , x t, : ) . W e use θ to collectively denote all parameters, including the noise variance σ 2 and the parame- ters of the cov ariance function. When values are missing, as is the case in matrix comple- tion, each factor of the likelihood (1) will only account for observed elements, thus we ha ve p ( Y | X , θ ) = D Y d =1 N ( y n d ,d | 0 , K + σ 2 I ) , (2) where n d denotes the set of indices of observed elements in column d . Furthermore, the symmetric matrices K and I will only include rows and columns corresponding to the indices n d . 2.1 Bayesian GP-L VM For Bayesian GP-L VM, we complement the likelihood (2) by placing a prior on X . A standard choice is to set p ( X ) = N Y n =1 N ( x n, : | 0 , I ) . (3) The mar ginal likelihood of Bayesian GP-L VM is obtained by integrating the model with respect to p ( X ) : p ( Y | θ ) = Z p ( Y | X , θ ) p ( X )d X . (4) While this operation is intractable in general, Titsias and Lawrence [ 2010 ] introduced a variational framework, which leads to a tractable lower bound, F ( q ) = Z q ( X ) log p ( Y | X , θ ) p ( X ) q ( X ) d X (5) = Z q ( X ) log p ( Y | X , θ )d X − Z q ( X ) log q ( X ) p ( X ) d X = Z q ( X ) log p ( Y | X , θ )d X − KL ( q ( X ) || p ( X )) , on the log of the marginal likelihood (4). For a detailed treat- ment of the framew ork, see [ Damianou et al. , 2016 ] . As a by-product of optimizing the lower bound (5), we get a variational approximation q ( X ) to the posterior p ( X | Y ) ∝ p ( Y | X ) p ( X ) , for which we assume a f actorized Gaussian form q ( X ) = N Y n =1 N ( x n, : | µ n , S n ) . (6) In our current work, we use this approximation to share in- formation between parallel computations (see Section 3.1). 2.2 Extension to Multi-view Settings Manifold relev ance determination (MRD) [ Damianou et al. , 2012 ] extends GP-L VM to a multi-view setting by reformu- lating the likelihood (1) as Q v ∈V p ( Y v | X , θ v ) , where the el- ements of V index the data views , i.e. matrices conditionally independent given a single latent matrix X . In matrix com- pletion problems, one of the views is typically the target in which prediction is carried out, while the other views consti- tute side-data. When predicting values in completely unob- served (or ne w) ro ws or columns in the target, predictions are effecti vely done ‘outside’ of the observed matrix. This can be done with the help of observed data in the side-views, and is referred to as out-of-matrix prediction. Figure 1: Overview of scalable Bayesian non-linear matrix completion. In the learning phase (left panel), a large matrix is partitioned into 4x1 subsets, which are processed in two stages: in the initial stage only one subset is processed (box with solid borders), after which the remaining subsets are processed in parallel, each coupled with the first subset using incremental learning. The prediction phase (right panel) uses a product of experts, aggre gating the means µ i and variances σ 2 i of local experts into a joint Gaussian prediction with µ PoE and σ 2 PoE . 3 Scalable Matrix Completion Using Bayesian GP-L VM This section presents a computational strategy , which en- ables Bayesian GP-L VM to be scaled up for large-scale ma- trix completion problems using a product of experts (PoE). In brief, we first partition the observation matrix Y into I dis- joint subsets Y (1) , Y (2) . . . , Y ( I ) for parallel processing. T o av oid problems resulting from the unidentifiability of latent variable models, we couple the subsets as follows: in an ini- tial stage only one subset is processed; in the second stage, the remaining subsets are processed in parallel using incremental learning (Section 3.1). For each subset, we use an implemen- tation of the v ariational inference framew ork for Bayesian GP-L VM [ T itsias and Lawrence, 2010 ] . Finally , with a set of independently learned Bayesian GP-L VMs, we use PoE to predict unobserved matrix elements (Section 3.2). The proposed strategy is summarized in Figure 1. In Sec- tion 3.3, we present a further variant of the method, which uses intermediate aggregation of the submodels, prior to PoE, to improv e predictiv e performance. The scalability of our method is briefly discussed in Section 3.4. 3.1 Coupling Parallel Infer ences Using Incremental Lear ning T o couple the parallel inferences over subsets in the second stage of our method, we use a probabilistic variant of the in- cremental learning algorithm introduced by [ Y ao et al. , 2011 ] for online learning of (non-Bayesian) non-linear latent vari- able models. Let Y (1) be the submatrix processed in the ini- tial stage. Furthermore, denote by Y ( i ) aug = [ Y (1) , Y ( i ) ] ∈ R ( N 1 + N i ) × D , i = 2 , . . . , I , the combined submatrix ob- tained by augmenting Y ( i ) with Y (1) . The corresponding combined latent matrix is denoted by X ( i ) aug = [ X (1) , X ( i ) ] ∈ R ( N 1 + N i ) × K . The objective of incremental learning is to learn the joint latent matrix X ( i ) aug without extensi ve relearning of X (1) , while still allowing it to be updated. When learning X ( i ) aug , Y ao et al. [ 2011 ] added a re gularizer to the log-lik elihood to pre vent X (1) from deviating too much from its initial estimate, and to speed up learning. The original incremental learning algo- rithm used the Frobenius norm k X (1) − ˆ X (1) k 2 F to re gularize the updated estimate of X (1) . In our current work, we use the KL-diver gence KL  q ( X (1) ) || ˆ q ( X (1) )  to ensure that the updated v ariational posterior approximation q ( X (1) ) remains close to the initial approximation ˆ q ( X (1) ) . F or X ( i ) , we use the default prior giv en in Equation (3). Thus, the variational inference for the incremental learning of Bayesian GP-L VM follows the procedure introduced in Section 2.1, with the KL terms in the lower bound of Eq. (5) taking the follo wing form: KL  q ( X ( i ) aug ) || p ( X ( i ) aug )  = KL  q ( X (1) ) || ˆ q ( X (1) )  + KL  q ( X ( i ) ) || p ( X ( i ) )  . For the augmented subsets, the inducing points and ker- nel hyperparameters are initialized to values obtained in the initial stage. For initialization of latent v ariables, we use pos- terior means for X (1) and nearest neighbors for X ( i ) [ Y ao et al. , 2011 ] . 3.2 Prediction with Pr oduct of Experts Product of experts (PoE) prediction for Gaussian process re- gression [ Deisenroth and Ng, 2015 ] uses the simple idea of combining predictions made by independent GP-models (i.e. ‘experts’) as a product: p ( y ∗ | x ∗ , X ) = I Y i =1 p i  y ∗ | x ∗ , X ( i )  , (7) where x ∗ is a gi ven test input and y ∗ is the corresponding output value to be predicted. Under this model, a prediction is proportional to a Gaussian with parameters µ poe ∗ = ( σ poe ∗ ) 2 I X i =1 σ − 2 i ( x ∗ ) µ i ( x ∗ ) , ( σ poe ∗ ) − 2 = I X i =1 σ − 2 i ( x ∗ ) . W ith latent variable models, the essential difference to the abov e is that the inputs are unobserved and must therefore be inferred. In matrix completion, we wish to predict the missing part of a partially observed test point y ∗ = ( y O ∗ , y U ∗ ) ∈ R D , where y O ∗ are the observed elements (or side views) and y U ∗ are the missing v alues (or the unobserved target view) to be reconstructed. The prediction task can be finished in two steps. First, we infer the latent input x ∗ of the test point, which inv olves maximizing the variational lower bound on the marginal lik elihood p ( y O ∗ , Y ) = Z p ( y O ∗ , Y | X , x ∗ , θ ) p ( X , x ∗ )d X d x ∗ to obtain the approximate posterior q ( X , x ∗ ) = q ( X ) q ( x ∗ ) . The lower bound has the same form as the learning objectiv e function in Equation (5), but for its maximization, the v aria- tional distrib ution q ( X ) o ver latent v ariables for training data and parameters θ remains fixed during test time. After ob- taining q ( x ∗ ) , making predictions for y U ∗ is approached as GP prediction with uncertain inputs [ Damianou et al. , 2016; Girard et al. , 2003 ] . In our distributed setting, the experts in PoE correspond to submodels learned from the augmented training subsets formed in the incremental learning phase. T o correct for the initial subset Y (1) being used in I − 1 training sets, we for- mulate a corrected PoE as follows: p ( y ∗ | Y , θ ) = p 1 ( y ∗ | Y (1) , θ (1) ) × I Y i =2 h p i ( y ∗ | Y ( i ) aug , θ ( i ) ) p 1 ( y ∗ | Y (1) , θ (1) ) − 1 i , Finally , denoting the means and v ariances of the local predic- tiv e distributions as ˆ µ i and ˆ σ 2 i , respecti vely , we compute the aggregated statistics of the corrected PoE predictions as: µ cpoe ∗ = ( σ cpoe ∗ ) 2  ˆ σ − 2 1 ˆ µ 1 + I X i =2  ˆ σ − 2 i ˆ µ i − ˆ σ − 2 1 ˆ µ 1   , ( σ cpoe ∗ ) − 2 = ˆ σ − 2 1 + I X i =2  ˆ σ − 2 i − ˆ σ − 2 1  . In their distributed GP frame work, Deisenroth and Ng [ 2015 ] used a re-weighted variant of PoE, which they coined the robust Bayesian committee machine (rBCM). Although rBCM has been shown to outperform the basic PoE for GP- regression, in our current setup, we have not observed any advantage of it over PoE. W e have therefore formulated our framew ork using standard PoE but note that the extension to rBCM is straightforward. 3.3 Impro ved Solution with Intermediate Aggregation PoE aggregates predictions from local submodels learned on data subsets, ef fectiv ely using a block-diagonal approxima- tion of the full-data covariance matrix. W ith larger submod- els, PoE provides a closer approximation to the full covari- ance matrix, which can be e xpected to result in better pre- dictiv e accuracy . Here we introduce an intermediate ag gr e- gation strategy , by which submodels are aggregated for im- prov ed predictive performance, while the initial training of submodels is still done on smaller subsets with lower com- putational cost. While latent variable models are in general non-identifiable, making a direct aggregation of local mod- els dif ficult to carry out in a meaningful w ay , the incremental learning introduced in Section 3.1, encourages identifiability among local models, alleviating the problem. The aggreg ation of submodels in volves (i) stacking to- gether local variational distributions, which are assumed to be independent across subsets, (ii) concatenating the corre- sponding data subsets, and finally (iii) aggregating the hy- perparameters of the models. The model parameters can be approximated using suitable statistics (e.g. mode, median or mean) of the distrib utions. In our implementation, we use the mode to approximate the kernel and Gaussian noise variance parameters, and use av eraging to estimate inducing variables. Since the first subset Y (1) is used multiple times through incremental learning, the corresponding v ariational distribu- tion q ( X (1) ) is obtained through the follo wing aggregation: q  X (1)  = ˆ q 1  X (1)  I Y i =2  ˆ q i  X (1)  ˆ q 1  X (1)  − 1  , = N 1 Y n =1 N  x n, : | ˆ µ ∗ n , ˆ S ∗ n  , where h ˆ S ∗ n i − 1 = h ˆ S (1) n i − 1 + I X i =2  h ˆ S ( i ) n i − 1 − h ˆ S (1) n i − 1  , ˆ µ ∗ n = ˆ S ∗ n  [ ˆ S (1) n ] − 1 ˆ µ (1) n + I X i =2  h ˆ S ( i ) n i − 1 ˆ µ ( i ) n − h ˆ S (1) n i − 1 ˆ µ (1) n  . Abov e, each of the v ariational distrib utions ˆ q i ( X (1) ) is Gaus- sian, of the form giv en by Equation (6). Note that after intermediate aggregation, each training sub- set is used only once to make predictions, and we may use the ordinary PoE formulation in Equation (7) for prediction. 3.4 Computational Cost Our method aims to leverage the scaling properties of sparse GP for training and those of PoE for prediction. Thus, for data partitioned into subsets of size N i , i = 1 , . . . , I , and assum- ing that a suf ficient number of parallel work ers is av ailable, the time complexity for training is O  max i ( N i M 2 ) · D  , where M < N i is the number of inducing points and D re- flects the fact that v ariational updates hav e to be performed separately for each dimension of sparsely observed data. For prediction, the cost is O  max i ( N 2 i )  . For incremental learn- ing and intermediate aggregation, N i refers to the size of the concatenation of multiple subsets. By intermediate aggrega- tion of submodels, we are able to trade off prediction cost against accuracy . 4 Experiments In this section, we ev aluate the predictiv e performance of the proposed method for out-of-matrix prediction problems on simulated and real-world chemogenomic data, and compare it with two alternative approaches: (i) the embarrassingly parallel or subset of data (SoD) approach, which has been widely studied to scale up Gaussian Process regression mod- els, and (ii) Macau, Bayesian multi-relational factorization with side information [ Simm et al. , 2015 ] , supporting out-of- matrix prediction. Macau is implemented with highly opti- mized C libraries, and is av ailable for experiments on large- scale data. The comparison with the SoD approach shows the advantage of our method in sharing information among sub- models, while the comparison with Macau shows the benefit of using Bayesian non-linear matrix f actorization. W e em- phasize, howe ver , that the choice and effecti veness of a model always depends on the problem at hand. Simulated data. W e generated synthetic data using non- linear signals corrupted with Gaussian noise, using matern data generator available in the GPy 1 software. The data has three views Y = { Y v : v = 1 , 2 , 3 } , the dimension of the views is as follows: N = 25,000, D 1 = 150, D 2 = 100, D 3 = 150. As the task is to perform out-of-matrix prediction, we randomly selected 20% of the rows as a test set, using the remaining rows as the training set. In addition, 80% of the data in the first vie w were masked as missing, to simulate the sparsely observed target data in a real-world application. The other two views were regarded as side information and were fully observed. Unlike Bayesian GP-L VM, Macau can- not handle missing values in the side data. Real-world data. W e performed the experiments on ExCAPE-DB data [ Sun et al. , 2017 ] , which is an aggre- gation of public compound-tar get bioactivity data and de- scribes interactions between drugs and tar gets using the pIC50 2 measure. It has 969,725 compounds and 806 tar - gets with 76,384,863 observed bioactivities. The dataset has 469 chem2vec features as side information which are gener- ated from ECFP fingerprint features for the compounds using word2vec software. W e used 3-fold cross v alidation to split the training and test set, where about 30% of the rows or com- pounds were chosen as test set in each fold. 4.1 Experimental Setup The experimental setting for MRD models is: number of inducing points 100, optimization through scaled conjugate gradients (SCG) with 500 iterations. For the SoD approach, the latent variables were initialized with PPCA method. W e ran Macau with Gibbs sampling for 1200 iterations, discarded the first 800 samples as burn-in and sa ved e very second of the remaining samples yielding in total 200 posterior samples. W e set the dimension of latent variables K=10 for ExCAPE- DB data, K=5 for simulated data for all methods. For the proposed and SoD methods, we partitioned the simulated data into 10x1 subsets and ExCAPE-DB data into 400x1 subsets. Other partitions are also possible; we hav e chosen the size of subsets such that Bayesian inference could 1 https://sheffieldml.github .io/GPy/ 2 IC50 (units in µ M) is the concentration of drug at which 50% of the target is inhibited. The lower the IC50 of the drug, the less likely the drug will be to hav e some of f-target effect (e.g. potential toxicity) that is not desired. pIC50 = − log 10 ( IC50 ) . be performed for the subsets in reasonable time on a single CPU. Notice that the views with missing values are gener- ally sparsely observed in many real-world applications, which makes it challenging to learn informativ e models for such data. F ollowing Qin et al. [ 2019 ] , we reordered the rows and columns of training data in descending order according to the proportion of observ ations in them. This makes the first subset the most densely observed block, thus making the resulting submodel informati ve and facilitating the parallel inferences in the following stages. W e ev aluated performance by predictiv e accuracy and the quality of prediction for do wnstream ranking tasks. Root mean squared error (RMSE) is a common performance mea- sure for matrix completion. In real-world applications, such as item recommendation or drug discovery , we are more in- terested in the performance of the ranking task, for instance how many of the recommended items the user actually clicks or buys, how many drugs recommended by models actually hav e the desired effect for the disease. F or this purpose, we regard matrix completion as a classification task (of whether a prediction is relev ant or not at a given threshold), use F1- and A UC-ROC score as performance metrics for ExCAPE- DB. Furthermore, follo wing Qin et al. [ 2019 ] , we use the wall-clock time 3 to measure the speed-up achieved by par - allelization. For our method, the reported wall-clock time is calculated by summing the maximum wall-clock times of submodels for each inference stage plus the wall-clock time of making prediction. For compound activity prediction tasks, we use a pIC50 cutoff (a.k.a. affinity lev el) at 5 and 6, corresponding to con- centrations of 10 µ M and 1 µ M, respectiv ely . The test set was further filtered by only keeping targets having at least 100 compounds, at least 25 acti ve compounds, and 25 inacti ve compounds, to ensure a minimum number of positi ve and negati ve data points. Macau 4 was run on compute nodes with 20 CPUs; all the other methods were run on a single CPU. Our implementation is based on the GPy 1 package. 4.2 Results The results for simulated and ExCAPE-DB data are giv en in T able 1 and 2, respectiv ely . In T able 1, column ‘Full poste- rior’ refers to the performance of MRD learned from the full data; column ‘Intermediate aggre gation’ refers to our method which works by first aggregating multiple submodels into a model with reasonable size (as long as the compute node can still accommodate the model to make predictions) and then perform predictions by aggregating predictions from multi- ple experts with PoE. It is clear from T able 1 that the model with full poste- rior performs better than other methods in terms of predictive performance; our intermediate aggregation method achiev es competitiv e results while being much faster than the full pos- terior . The intermediate aggreg ation method also performs 3 W all-clock time measures the real time between the start and the end of a program. For parallel processes, we use the wall-clock time of the slowest process. 4 W e ran the Macau version av ailable in SMURFF software: https://github .com/ExaScience/smurff. Kernel Macau SoD Proposed methods Full posterior PoE Intermediate aggregation RMSE: the smaller , the better . Linear 0.8927 ± 0.010 0.747 ± 0.034 0.685 ± 0.041 0.656 ± 0.038 0.656 ± 0.039 RBF - 0.825 ± 0.034 0.791 ± 0.048 0.683 ± 0.038 0.658 ± 0.039 Matern32 - 0.824 ± 0.032 0.772 ± 0.048 0.687 ± 0.039 0.656 ± 0.038 Spearman correlation score: the larger , the better . Linear 0.6971 ± 0.044 0.713 ± 0.056 0.726 ± 0.048 0.744 ± 0.044 0.744 ± 0.045 RBF - 0.689 ± 0.060 0.651 ± 0.080 0.721 ± 0.048 0.742 ± 0.045 Matern32 - 0.684 ± 0.064 0.672 ± 0.083 0.718 ± 0.047 0.744 ± 0.044 W all-clock time (secs.) Linear 171.44 27730.748 55191.296 57055.027 249782.967 RBF - 31371.620 53211.605 57306.449 176075.462 Matern32 - 36757.294 67197.138 71945.822 106303.016 T able 1: Comparison of performance metrics for different methods on simulated data. The results are av eraged over 5 folds. Affinity Model RMSE F1-score A UC-R OC Ratio of successful W all-clock lev el score queries time (secs.) 5 Macau 1.108 ± 0.069 0.886 ± 0.003 0.805 ± 0.009 0.319 ± 0.024 37041.8 5 SoD 0.914 ± 0.023 0.890 ± 0.011 0.791 ± 0.003 0.363 ± 0.006 63110.16 Proposed methods: 5 PoE 0.831 ± 0.021 0.900 ± 0.001 0.805 ± 0.002 0.309 ± 0.008 92419.06 5 Intermediate aggregation (nAggs=10) 0.743 ± 0.009 0.919 ± 0.005 0.811 ± 0.006 0.405 ± 0.018 93331.23 5 Intermediate aggregation (nAggs=20) 0.736 ± 0.004 0.914 ± 0.003 0.813 ± 0.004 0.455 ± 0.008 100492.9 6 Macau 1.123 ± 0.065 0.783 ± 0.013 0.799 ± 0.006 0.318 ± 0.003 37041.8 6 SoD 0.930 ± 0.021 0.787 ± 0.011 0.791 ± 0.005 0.381 ± 0.019 63110.16 Proposed methods: 6 PoE 0.837 ± 0.022 0.846 ± 0.011 0.811 ± 0.003 0.285 ± 0.004 92419.06 6 Intermediate aggregation (nAggs=10) 0.775 ± 0.028 0.851 ± 0.015 0.817 ± 0.003 0.376 ± 0.025 93331.23 6 Intermediate aggregation (nAggs=20) 0.789 ± 0.006 0.838 ± 0.005 0.816 ± 0.004 0.434 ± 0.016 100492.9 T able 2: Comparison of RMSE, F1-score, A UC-R OC score and the ratio of successful queries (i.e. queries with A UC-R OC score larger than 0.7 for the targets) for out-of-matrix prediction on ExCAPE-DB by dif ferent methods. The first three metrics are calculated for only the successful queries, the ratio is defined as #successfulQueries / #validQueries, where a query target is considered to be v alid if it has at least 100 observed bioacti vity , at least 25 active and 25 inacti ve compounds. The results are averaged o ver 3 runs. better than the SoD approach and the v ariant of our method without the intermediate aggregation step. With a linear ker- nel, all MRD models perform better than Macau. For ExCAPE-DB data, our intermediate aggregation method (by aggregating 10 or 20 submodels to obtain larger models for prediction) performs much better than all the other methods in all performance metrics for different affin- ity lev els. At both af finity lev els, all versions of our pro- posed method perform better than the SoD method in terms of RMSE, F1-score and A UC-R OC score. Again, we observed that all MRD methods perform better than Macau in all per- formance metrics. In both tables, the wall-clock times of our methods are larger than that of the SoD approach. This is due to the two-stage parallel inferences of the proposed scheme. T o summarise, the proposed method with an intermediate aggregation step achiev es a better trade-off between predic- tiv e accuracy and computation time. The proposed method performs better than the embarrassingly parallel approaches for scalable Gaussian process models and a state-of-the-art highly optimized implementation of linear Bayesian matrix factorization with side information. 5 Conclusion In this paper , we ha ve introduced a scalable approach for Bayesian non-linear matrix completion. W e have argued that a ke y factor in constructing distrib uted solutions for massi ve- scale data is to limit the communication required between computational units. T o this end, we ha ve introduced a computational scheme which leverages embarrassingly par- allel techniques de veloped for Gaussian process regression by suitably adapting them for Bayesian Gaussian process la- tent variable models. The resulting framew ork is almost as communication-efficient as embarrassingly parallel compu- tation, adding only one additional stage of communication, while achieving accuracy close to the non-distributed full data solution. Acknowledgements The authors gratefully acknowledge the computational re- sources pro vided by the Aalto Science-IT project and support by the Academy of Finland (Finnish Center for Artificial In- telligence, FCAI, and projects 319264, 292334). References [ Cao and Fleet, 2014 ] Y anshuai Cao and Da vid J. Fleet. Generalized product of experts for automatic and prin- cipled fusion of Gaussian process predictions. In Mod- ern Nonparametrics 3: Automating the Learning Pipeline workshop at NIPS , 2014. [ Dai et al. , 2014 ] Zhenwen Dai, Andreas Damianou, James Hensman, and Neil Lawrence. Gaussian process models with parallelization and GPU acceleration. arXiv preprint arXiv:1410.4984 , 2014. [ Damianou et al. , 2012 ] Andreas Damianou, Carl Henrik Ek, Michalis K. Titsias, and Neil D. Lawrence. Manifold relev ance determination. In ICML , 2012. [ Damianou et al. , 2016 ] Andreas C. Damianou, Michalis K. T itsias, and Neil D. Lawrence. V ariational inference for la- tent variables and uncertain inputs in Gaussian processes. JMLR , 17(42):1–62, 2016. [ Deisenroth and Ng, 2015 ] Marc Deisenroth and Jun W ei Ng. Distributed gaussian processes. In ICML , pages 1481– 1490, 2015. [ Gal et al. , 2014 ] Y arin Gal, Mark van der W ilk, and Carl Edward Rasmussen. Distributed variational inference in sparse Gaussian process regression and latent variable models. In NIPS , pages 3257–3265, 2014. [ Girard et al. , 2003 ] Agathe Girard, Carl Edward Ras- mussen, Joaquin Quinonero Candela, and Roderick Murray-Smith. Gaussian process priors with uncertain in- puts application to multiple-step ahead time series fore- casting. In NIPS , pages 545–552, 2003. [ Lawrence and Urtasun, 2009 ] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes. In ICML , pages 601–608, 2009. [ Lawrence, 2005 ] Neil Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. JMLR , 6(Nov):1783–1816, 2005. [ Ng and Deisenroth, 2014 ] Jun W . Ng and Marc P . Deisen- roth. Hierarchical mixture-of-experts model for lar ge- scale Gaussian process regression. arXiv pr eprint arXiv:1412.3078 , 2014. [ Qin et al. , 2019 ] Xiangju Qin, Paul Blomstedt, Eemeli Lepp ¨ aaho, Pekka Parviainen, and Samuel Kaski. Dis- tributed Bayesian matrix factorization with limited com- munication. Machine Learning , pages 1–26, 2019. [ Qui ˜ nonero Candela and Rasmussen, 2005 ] Joaquin Qui ˜ nonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. JMLR , 6:1939–1959, 2005. [ Salakhutdinov and Mnih, 2008a ] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Marko v chain Monte Carlo. In ICML , pages 880– 887, 2008. [ Salakhutdinov and Mnih, 2008b ] Ruslan Salakhutdino v and Andriy Mnih. Probabilistic matrix factorization. In NIPS , 2008. [ Simm et al. , 2015 ] Jaak Simm, Adam Aran y , Pooya Zakeri, T om Haber , J ¨ org K. W egner , Vladimir Chupakhin, Hugo Ceulemans, and Yves Moreau. Macau: Scalable bayesian multi-relational factorization with side information using mcmc. arXiv pr eprint arXiv:1509.04610 , 2015. [ Snelson and Ghahramani, 2006 ] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In NIPS , pages 1257–1264. MIT Press, 2006. [ Sun et al. , 2017 ] Jiangming Sun, Nina Jeliazk ova, Vladimir Chupakhin, Jose-Felipe Golib-Dzib, Ola Engkvist, Lars Carlsson, J ¨ org W egner , Hugo Ceulemans, Ivan Georgiev , V edrin Jeliazkov , Nikolay K ochev , Thomas J. Ashby , and Hongming Chen. ExCAPE-DB: an integrated large scale dataset facilitating big data analysis in chemogenomics. Journal of Cheminformatics , 9(1):17:1–17:9, 2017. [ T itsias and Lawrence, 2010 ] Michalis Titsias and Neil Lawrence. Bayesian Gaussian process latent variable model. In AIST A TS , pages 844–851, 2010. [ T resp, 2000 ] V olk er Tresp. A Bayesian committee machine. Neural Computation , 12(11):2719–2741, 2000. [ V ander Aa et al. , 2016 ] T om V ander Aa, Imen Chakroun, and T om Haber . Distributed Bayesian probabilistic ma- trix factorization. In 2016 IEEE International Confer ence on Cluster Computing , pages 346–349, 2016. [ Y ao et al. , 2011 ] Angela Y ao, Juergen Gall, Luc V Gool, and Raquel Urtasun. Learning probabilistic non-linear la- tent variable models for tracking complex activities. In NIPS , pages 1359–1367, 2011.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment