Ranking via Robust Binary Classification and Parallel Parameter Estimation in Large-Scale Data

Ranking via Rob ust Binary Classiﬁcation Hyokun Y un Department of Statistics Purdue Univ ersity W est Lafayette, IN 47907 yun3@purdue.edu Parameswaran Raman Department of Computer Science Purdue Univ ersity W est Lafayette, IN 47907 params@purdue.edu S. V . N. V ishwanathan Departments of Statistics and Computer Science Purdue Univ ersity W est Lafayette, IN 47907 vishy@stat.purdue.edu Abstract W e propose RoBiRank, a ranking algorithm that is motiv ated by observing a close connection between ev aluation metrics for learning to rank and loss functions for robust classiﬁcation. It shows competiti v e performance on standard bench- mark datasets against a number of other representative algorithms in the literature. W e also discuss extensions of RoBiRank to lar ge scale problems where explicit feature vectors and scores are not given. W e show that RoBiRank can be efﬁ- ciently parallelized across a large number of machines; for a task that requires 386 , 133 × 49 , 824 , 519 pairwise interactions between items to be ranked, RoBi- Rank ﬁnds solutions that are of dramatically higher quality than that can be found by a state-of-the-art competitor algorithm, gi ven the same amount of wall-clock time for computation. 1 Introduction Learning to rank is a problem of ordering a set of items according to their relev ances to a giv en context [8]. While a number of approaches hav e been proposed in the literature, in this paper we provide a new perspective by showing a close connection between ranking and a seemingly unrelated topic in machine learning, namely , robust binary classiﬁcation. In robust classiﬁcation [13], we are asked to learn a classiﬁer in the presence of outliers. Standard models for classiﬁcation such as Support V ector Machines (SVMs) and logistic regression do not perform well in this setting, since the conv exity of their loss functions does not let them giv e up their performance on any of the data points [16]; for a classiﬁcation model to be robust to outliers, it has to be capable of sacriﬁcing its performance on some of the data points. W e observe that this requirement is very similar to what standard metrics for ranking try to e valuate. Discounted Cumulativ e Gain (DCG) [17] and its normalized version NDCG, popular metrics for learning to rank, strongly emphasize the performance of a ranking algorithm at the top of the list; therefore, a good ranking algorithm in terms of these metrics has to be able to giv e up its performance at the bottom of the list if that can improv e its performance at the top. In fact, we will show that DCG and NDCG can indeed be written as a natural generalization of robust loss functions for binary classiﬁcation. Based on this observation we formulate RoBiRank, a novel model for ranking, which maximizes the lower bound of (N)DCG. Although the non-con ve xity seems unavoidable for the bound to be tight [9], our bound is based on the class of robust loss 1 functions that are found to be empirically easier to optimize [10]. Indeed, our experimental results suggest that RoBiRank reliably conv erges to a solution that is competitive as compared to other representativ e algorithms e ven though its objecti ve function is non-con vex. While standard deterministic optimization algorithms such as L-BFGS [19] can be used to estimate parameters of RoBiRank, to apply the model to large-scale datasets a more efﬁcient parameter es- timation algorithm is necessary . This is of particular interest in the context of latent collaborativ e retriev al [24]; unlike standard ranking task, here the number of items to rank is very large and ex- plicit feature vectors and scores are not gi ven. Therefore, we develop an ef ﬁcient parallel stochastic optimization algorithm for this problem. It has two very attractive characteristics: First, the time complexity of each stochastic update is indepen- dent of the size of the dataset. Also, when the algorithm is distributed across multiple number of machines, no interaction between machines is required during most part of the e xecution; therefore, the algorithm enjoys near linear scaling. This is a signiﬁcant advantage ov er serial algorithms, since it is very easy to deploy a large number of machines now adays thanks to the popularity of cloud computing services, e.g. Amazon W eb Services. W e apply our algorithm to latent collaborativ e retrie v al task on Million Song Dataset [3] which con- sists of 1,129,318 users, 386,133 songs, and 49,824,519 records; for this task, a ranking algorithm has to optimize an objective function that consists of 386 , 133 × 49 , 824 , 519 number of pairwise interactions. W ith the same amount of wall-clock time gi ven to each algorithm, RoBiRank le v erages parallel computing to outperform the state-of-the-art with a 100% lift on the ev aluation metric. 2 Robust Binary Classiﬁcation Suppose we are giv en training data which consists of n data points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) , where each x i ∈ R d is a d -dimensional feature vector and y i ∈ {− 1 , +1 } is a label associated with it. A linear model attempts to learn a d -dimensional parameter ω , and for a given feature vector x it predicts label +1 if h x, ω i ≥ 0 and − 1 otherwise. Here h· , ·i denotes the Euclidean dot product between two vectors. The quality of ω can be measured by the number of mistak es it makes: L ( ω ) := P n i =1 I ( y i · h x i , ω i < 0) . The indicator function I ( · < 0) is called the 0-1 loss function, because it has a value of 1 if the decision rule makes a mistake, and 0 otherwise. Unfortunately , since the 0-1 loss is a discrete function its minimization is difﬁcult [11]. The most popular solution to this problem in machine learning is to upper bound the 0-1 loss by an easy to optimize function [2]. For e xample, logistic re gression uses the logistic loss function σ 0 ( t ) := log 2 (1 + 2 − t ) , to come up with a continuous and con ve x objecti ve function L ( ω ) := n X i =1 σ 0 ( y i · h x i , ω i ) , (1) which upper bounds L ( ω ) . It is clear that for each i , σ 0 ( y i · h x i , ω i ) is a con vex function in ω ; therefore, L ( ω ) , a sum of con ve x functions, is also a con vex function which is relatively easier to optimize [6]. Support V ector Machines (SVMs) on the other hand can be recovered by using the hinge loss to upper bound the 0-1 loss. Howe v er , con ve x upper bounds such as L ( ω ) are known to be sensitive to outliers [16]. The basic intuition here is that when y i · h x i , ω i is a very large negati ve number for some data point i , σ ( y i · h x i , ω i ) is also very large, and therefore the optimal solution of (1) will try to decrease the loss on such outliers at the expense of its performance on “normal” data points. In order to construct robust loss functions, consider the follo wing two transformation functions: ρ 1 ( t ) := log 2 ( t + 1) , ρ 2 ( t ) := 1 − 1 log 2 ( t + 2) , (2) which, in turn, can be used to deﬁne the following loss functions: σ 1 ( t ) := ρ 1 ( σ 0 ( t )) , σ 2 ( t ) := ρ 2 ( σ 0 ( t )) . (3) One can see that σ 1 ( t ) → ∞ as t → −∞ , but at a much slower rate than σ 0 ( t ) does; its deriv ati ve σ 0 1 ( t ) → 0 as t → −∞ . Therefore, σ 1 ( · ) does not gro w as rapidly as σ 0 ( t ) on hard-to-classify data 2 points. Such loss functions are called T ype-I robust loss functions by Ding [10], who also showed that they enjoy statistical robustness properties. σ 2 ( t ) beha ves even better: σ 2 ( t ) con v erges to a constant as t → −∞ , and therefore “gives up” on hard to classify data points. Such loss functions are called T ype-II loss functions, and they also enjoy statistical rob ustness properties [10]. In terms of computation, of course, σ 1 ( · ) and σ 2 ( · ) are not con vex, and therefore the objecti ve function based on such loss functions is more difﬁcult to optimize. Howe ver , it has been observed in Ding [10] that models based on optimization of T ype-I functions are often empirically much more successful than those which optimize T ype-II functions. Furthermore, the solutions of T ype-I optimization are more stable to the choice of parameter initialization. Intuiti vely , this is because T ype-II functions asymptote to a constant, reducing the gradient to almost zero in a large fraction of the parameter space; therefore, it is difﬁcult for a gradient-based algorithm to determine which direction to pursue. See Ding [10] for more details. 3 Ranking Model via Robust Binary Classiﬁcation Let X = { x 1 , x 2 , . . . , x n } be a set of contexts, and Y = { y 1 , y 2 , . . . , y m } be a set of items to be ranked. For example, in movie recommender systems X is the set of users and Y is the set of movies. In some problem settings, only a subset of Y is relev ant to a giv en context x ∈ X ; e.g. in document retriev al systems, only a subset of documents is rele vant to a query . Therefore, we deﬁne Y x ⊂ Y to be a set of items relev ant to context x . Observed data can be described by a set W := { W xy } x ∈X ,y ∈Y x where W xy is a real-valued score gi v en to item y in context x . W e adopt a standard problem setting used in the literature of learning to rank. F or each context x and an item y ∈ Y x , we aim to learn a scoring function f ( x, y ) : X × Y x → R that induces a ranking on the item set Y x ; the higher the score, the more important the associated item is in the giv en context. T o learn such a function, we ﬁrst extract joint features of x and y , which will be denoted by φ ( x, y ) . Then, we parametrize f ( · , · ) using a parameter ω , which yields the linear model f ω ( x, y ) := h φ ( x, y ) , ω i , where, as before, h· , ·i denotes the Euclidean dot product between two vectors. ω induces a ranking on the set of items Y x ; we deﬁne rank ω ( x, y ) to be the rank of item y in a gi v en context x induced by ω . Observe that rank ω ( x, y ) can also be written as a sum of 0-1 loss functions (see e.g. Usunier et al. [23]): rank ω ( x, y ) = X y 0 ∈Y x ,y 0 6 = y I ( f ω ( x, y ) − f ω ( x, y 0 ) < 0) . (4) 3.1 Basic Model If an item y is very rele v ant in conte xt x , a good parameter ω should position y at the top of the list; in other words, rank ω ( x, y ) has to be small, which moti v ates the following objecti v e function: L ( ω ) := X x ∈X c x X y ∈Y x v ( W xy ) · rank ω ( x, y ) , (5) where c x is an weighting factor for each context x , and v ( · ) : R + → R + quantiﬁes the relev ance lev el of y on x . Note that { c x } and v ( W xy ) can be chosen to reﬂect the metric the model is going to be ev aluated on (this will be discussed in Section 3.2). Note that (5) can be rewritten using (4) as a sum of indicator functions. Follo wing the strate gy in Section 2, one can form an upper bound of (5) by bounding each 0-1 loss function by a logistic loss function: L ( ω ) := X x ∈X c x X y ∈Y x v ( W xy ) · X y 0 ∈Y x ,y 0 6 = y σ 0 ( f ω ( x, y ) − f ω ( x, y 0 )) . (6) Just like (1), (6) is con v ex in ω and hence easy to minimize. 3.2 DCG Although (6) enjoys con ve xity , it may not be a good objective function for ranking. This is because in most applications of learning to rank, it is more important to do well at the top of the list than at the bottom, as users typically pay attention only to the top few items. Therefore, it is desirable to 3 give up performance on the lower part of the list in order to gain quality at the top. This intuition is similar to that of robust classiﬁcation in Section 2; a stronger connection will be sho wn belo w . Discounted Cumulativ e Gain (DCG) [17] is one of the most popular metrics for ranking. For each context x ∈ X , it is deﬁned as: DCG ( ω ) := c x X y ∈Y x v ( W xy ) log 2 ( rank ω ( x, y ) + 2) , (7) where v ( t ) = 2 t − 1 and c x = 1 . Since 1 / log ( t + 2) decreases quickly and then asymptotes to a constant as t increases, this metric emphasizes the quality of the ranking at the top of the list. Normalized DCG (NDCG) simply normalizes the metric to bound it between 0 and 1 by calculating the maximum achiev able DCG v alue m x and dividing by it [17]. 3.3 RoBiRank Now we formulate RoBiRank, which optimizes the lower bound of metrics for ranking in form (7). Observe that max ω DCG ( ω ) can be re written as min ω X x ∈X c x X y ∈Y x v ( W xy ) ·  1 − 1 log 2 ( rank ω ( x, y ) + 2)  . (8) Using (4) and the deﬁnition of the transformation function ρ 2 ( · ) in (2), we can re write the objecti ve function in (8) as: L 2 ( ω ) := X x ∈X c x X y ∈Y x v ( W xy ) · ρ 2   X y 0 ∈Y x ,y 0 6 = y I ( f ω ( x, y ) − f ω ( x, y 0 ) < 0)   . (9) Since ρ 2 ( · ) is a monotonically increasing function, we can bound (9) with a continuous function by bounding each indicator function using the logistic loss: L 2 ( ω ) := X x ∈X c x X y ∈Y x v ( W xy ) · ρ 2   X y 0 ∈Y x ,y 0 6 = y σ 0 ( f ω ( x, y ) − f ω ( x, y 0 ))   . (10) This is reminiscent of the basic model in (6); as we applied the transformation function ρ 2 ( · ) on the logistic loss function σ 0 ( · ) to construct the robust loss function σ 2 ( · ) in (3), we are again applying the same transformation on (6) to construct a loss function that respects the DCG metric used in ranking. In fact, (10) can be seen as a generalization of robust binary classiﬁcation by applying the transformation on a gr oup of logistic losses instead of a single logistic loss. In both rob ust classiﬁcation and ranking, the transformation ρ 2 ( · ) enables models to gi v e up on part of the problem to achiev e better ov erall performance. As we discussed in Section 2, howe ver , transformation of logistic loss using ρ 2 ( · ) results in T ype- II loss function, which is very difﬁcult to optimize. Hence, instead of ρ 2 ( · ) we use an alternative transformation function ρ 1 ( · ) , which generates T ype-I loss function, to deﬁne the objecti v e function of RoBiRank: L 1 ( ω ) := X x ∈X c x X y ∈Y x v ( W xy ) · ρ 1   X y 0 ∈Y x ,y 0 6 = y σ 0 ( f ω ( x, y ) − f ω ( x, y 0 ))   . (11) Since ρ 1 ( t ) ≥ ρ 2 ( t ) for ev ery t > 0 , we have L 1 ( ω ) ≥ L 2 ( ω ) ≥ L 2 ( ω ) for every ω . Note that L 1 ( ω ) is continuous and twice differentiable. Therefore, standard gradient-based optimization techniques can be applied to minimize it. As is standard, a regularizer on ω can be added to avoid ov erﬁtting; for simplicity , we use the ` 2 -norm in our experiments. 3.4 Standard Learning to Rank Experiments W e conducted experiments to check the performance of the objectiv e function (11) in a standard learning to rank setting, with a small number of labels to rank. W e pitch RoBiRank against the 4 following algorithms: RankSVM [15], the ranking algorithm of Le and Smola [14] (called LSRank in the sequel), InfNormPush [22], IRPush [1], and 8 standard ranking algorithms implemented in RankLib 1 namely MAR T , RankNet, RankBoost, AdaRank, CoordAscent, LambdaMAR T , ListNet and RandomForests. W e use three sources of datasets: LETOR 3.0 [8] , LETOR 4.0 2 and Y AHOO L TRC [20], which are standard benchmarks for learning to rank algorithms. T able 1 shows their summary statistics. Each dataset consists of ﬁv e folds; we consider the ﬁrst fold, and use the training, validation, and test splits provided. W e train with different values of the regularization parameter, and select a parameter with the best NDCG value on the validation dataset. The performance of the model with this parameter on the test dataset is reported. W e used an optimized implementation of the L-BFGS algorithm provided by the T oolkit for Advanced Optimization (T A O) 3 for estimating the parameter of RoBiRank. For the other algorithms, we either implemented them using our framework or used the implementations provided by the authors. 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2004 RoBiRank RankSVM LSRank InfNormPush IRPush 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2004 RoBiRank MAR T RankNet RankBoost AdaRank CoordAscent LambdaMAR T ListNet RandomForests Figure 1: Comparison of RoBiRank with a number of competing algorithms. Plots are split into two for ease of visualization W e use v alues of NDCG at dif ferent le vels of truncation as our e v aluation metric [17]; see Figure 1. RoBiRank outperforms its competitors on most of the datasets, ho wev er , due to space constraints we only present plots for the TD 2004 dataset in the main body of the paper . Other plots can be found in Appendix A. The performance of RankSVM seems insensitiv e to the lev el of truncation for NDCG. On the other hand, RoBiRank, which uses non-con vex loss function to concentrate its performance at the top of the ranked list, performs much better especially at low truncation levels. It is also interesting to note that the NDCG@k curve of LSRank is similar to that of RoBiRank, but RoBiRank consistently outperforms at each level. RoBiRank dominates Inf-Push and IR-Push at all levels. When compared to standard algorithms, Figure 1 (right), again RoBiRank outperforms especially at the top of the list. Overall, RoBiRank outperforms IRPush and InfNormPush on all datasets except TD 2003 and OHSUMED where IRPush seems to fare better at the top of the list. Compared to the 8 standard algorithms, again RobiRank either outperforms or performs comparably to the best algorithm e xcept on two datasets (TD 2003 and HP 2003), where MAR T and Random Forests overtak e RobiRank at few values of NDCG. W e present a summary of the NDCG values obtained by each algorithm in T able 1 in the appendix. On the MSLR30K dataset, some of the additional algorithms like InfNorm- Push and IRPush did not complete within the time period a v ailable; indicated by dashes in the table. 4 Latent Collaborative Retrie val For each context x and an item y ∈ Y , the standard problem setting of learning to rank requires training data to contain feature vector φ ( x, y ) and score W xy assigned on the x, y pair . When the number of contexts |X | or the number of items |Y | is large, it might be difﬁcult to deﬁne φ ( x, y ) and measure W xy for all x, y pairs. Therefore, in most learning to rank problems we deﬁne the set 1 http://sourceforge.net/p/lemur/wiki/RankLib 2 http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx 3 http://www .mcs.anl.gov/research/projects/tao/index.html 5 of rele vant items Y x ⊂ Y to be much smaller than Y for each context x , and then collect data only for Y x . Nonetheless, this may not be realistic in all situations; in a movie recommender system, for example, for each user e very mo vie is somewhat rele v ant. On the other hand, implicit user feedback data is much more abundant. F or example, a lot of users on Netﬂix would simply watch movie streams on the system but do not leav e an explicit rating. By the action of watching a movie, ho we ver , they implicitly express their preference. Such data consist only of positiv e feedback, unlik e traditional learning to rank datasets which have score W xy between each context-item pair x, y . Again, we may not be able to extract feature v ectors for each x, y pair . In such a situation, we can attempt to learn the score function f ( x, y ) without a feature vector φ ( x, y ) by embedding each conte xt and item in an Euclidean latent space; speciﬁcally , we redeﬁne the score function to be: f ( x, y ) := h U x , V y i , where U x ∈ R d is the embedding of the context x and V y ∈ R d is that of the item y . Then, we can learn these embeddings by a ranking model. This approach was introduced in W eston et al. [24], and was called latent collaborative r etrie val . Now we specialize RoBiRank model for this task. Let us deﬁne Ω to be the set of conte xt-item pairs ( x, y ) which was observed in the dataset. Let v ( W xy ) = 1 if ( x, y ) ∈ Ω , and 0 otherwise; this is a natural choice since the score information is not av ailable. For simplicity , we set c x = 1 for ev ery x . Now RoBiRank (11) specializes to: L 1 ( U, V ) = X ( x,y ) ∈ Ω ρ 1   X y 0 6 = y σ 0 ( f ( U x , V y ) − f ( U x , V y 0 ))   . (12) Note that now the summation inside the parenthesis of (12) is over all items Y instead of a smaller set Y x , therefore we omit specifying the range of y 0 from no w on. T o av oid o verﬁtting, a regularizer is added to (12); for simplicity we use the Frobenius norm of U and V in our experiments. 4.1 Stochastic Optimization When the size of the data | Ω | or the number of items |Y | is large, ho wev er , methods that require exact evaluat ion of the function value and its gradient will become very slow since the ev aluation takes O ( | Ω | · |Y | ) computation. In this case, stochastic optimization methods are desirable [4]; in this subsection, we will de velop a stochastic gradient descent algorithm whose comple xity is independent of | Ω | and |Y | . For simplicity , let θ be a concatenation of all parameters { U x } x ∈X , { V y } y ∈Y . The gradient ∇ θ L 1 ( U, V ) of (12) is X ( x,y ) ∈ Ω ∇ θ ρ 1   X y 0 6 = y σ 0 ( f ( U x , V y ) − f ( U x , V y 0 ))   . Finding an unbiased estimator of the gradient whose computation is independent of | Ω | is not difﬁ- cult; if we sample a pair ( x, y ) uniformly from Ω , then it is easy to see that the follo wing estimator | Ω | · ∇ θ ρ 1   X y 0 6 = y σ 0 ( f ( U x , V y ) − f ( U x , V y 0 ))   (13) is unbiased. This still in v olves a summation over Y , ho wev er , so it requires O ( |Y | ) calculation. Since ρ 1 ( · ) is a nonlinear function it seems unlik ely that an unbiased stochastic gradient which randomizes over Y can be found; nonetheless, to achieve conv ergence guarantees of the stochastic gradient descent algorithm, unbiasedness of the estimator is necessary [18]. W e attack this problem by linearizing the objecti ve function by parameter e xpansion. Note the following property of ρ 1 ( · ) [5]: ρ 1 ( t ) = log 2 ( t + 1) ≤ − log 2 ξ + ξ · ( t + 1) − 1 log 2 . (14) 6 This holds for any ξ > 0 , and the bound is tight when ξ = 1 t +1 . Now introducing an auxiliary parameter ξ xy for each ( x, y ) ∈ Ω and applying this bound, we obtain an upper bound of (12) as L ( U, V , ξ ) := X ( x,y ) ∈ Ω − log 2 ξ xy + ξ xy  P y 0 6 = y σ 0 ( f ( U x , V y ) − f ( U x , V y 0 )) + 1  − 1 log 2 . (15) Now we propose an iterative algorithm in which, each iteration consists of ( U, V ) -step and ξ -step; in the ( U, V ) -step we minimize (15) in ( U, V ) and in the ξ -step we minimize in ξ . Pseudo-code can be found in Algorithm 1 in Appendix B. ( U, V ) -step The partial deriv ativ e of (15) in terms of U and V can be calculated as: ∇ U,V L ( U, V , ξ ) := 1 log 2 P ( x,y ) ∈ Ω ξ xy  P y 0 6 = y ∇ U,V σ 0 ( f ( U x , V y ) − f ( U x , V y 0 ))  . Now it is easy to see that the following stochastic procedure unbiasedly estimates the abo ve gradient: • Sample ( x, y ) uniformly from Ω • Sample y 0 uniformly from Y \ { y } • Estimate the gradient by | Ω | · ( |Y | − 1) · ξ xy log 2 · ∇ U,V σ 0 ( f ( U x , V y ) − f ( U x , V y 0 )) . (16) Therefore a stochastic gradient descent algorithm based on (16) will con ver ge to a local minimum of the objectiv e function (15) with probability one [21]. Note that the time complexity of calculating (16) is independent of | Ω | and |Y | . Also, it is a function of only U x and V y ; the gradient is zero in terms of other variables. ξ -step When U and V are ﬁxed, minimization of ξ xy variable is independent of each other and a simple analytic solution exists: ξ xy = 1 P y 0 6 = y σ 0 ( f ( U x ,V y ) − f ( U x ,V y 0 ))+1 . This of course requires O ( |Y | ) work. In principle, we can av oid summation ov er Y by taking stochastic gradient in terms of ξ xy as we did for U and V . Howe v er , since the exact solution is very simple to compute and also because most of the computation time is spent on ( U, V ) -step rather than ξ -step, we found this update rule to be efﬁcient. 4.2 Parallelization The linearization trick in (15) not only enables us to construct an efﬁcient stochastic gradient al- gorithm, but also makes possible to efﬁciently parallelize the algorithm across multiple number of machines. Due to lack of space, details are relegated to Appendix C. 4.3 Experiments In this subsection we use the Million Song Dataset (MSD) [3], which consists of 1,129,318 users ( |X | ), 386,133 songs ( |Y | ), and 49,824,519 records ( | Ω | ) of a user x playing a song y in the training dataset. The objectiv e is to predict the songs from the test dataset that a user is going to listen to 4 . Since explicit ratings are not giv en, NDCG is not applicable for this task; we use precision at 1 and 10 [17] as our ev aluation metric. In our ﬁrst experiment we study the scaling behavior of RoBiRank as a function of number of machines. RoBiRank p denotes the parallel version of RoBiRank which is distributed across p machines. In Figure 2 (left) we plot mean Precision@1 as a function of the number of machines × the number of seconds elapsed; this is a proxy for CPU time. If an algorithm linearly scales across multiple processors, then all lines in the ﬁgure should o verlap with each other . As can be seen RoBiRank exhibits near ideal speed up when going from 4 to 32 machines 5 . In our next experiment we compare RoBiRank with a state of the art algorithm from W eston et al. [24], which optimizes a similar objecti ve function (17). W e compare how fast the quality of the 4 the original data also pro vides the number of times a song was played by a user , but we ignored this in our experiment. 5 The graph for RoBiRank 1 is hard to see because it was run for only 100,000 CPU-seconds. 7 0 0 . 5 1 1 . 5 2 2 . 5 3 · 10 6 0 0 . 1 0 . 2 0 . 3 number of machines × seconds elapsed Mean Precision@1 RoBiRank 4 RoBiRank 16 RoBiRank 32 RoBiRank 1 0 0 . 2 0 . 4 0 . 6 0 . 8 1 · 10 5 0 0 . 1 0 . 2 0 . 3 seconds elapsed Mean Precision@1 W eston et al. (2012) RoBiRank 1 RoBiRank 4 RoBiRank 16 RoBiRank 32 0 0 . 2 0 . 4 0 . 6 0 . 8 1 · 10 5 0 5 · 10 − 2 0 . 1 0 . 15 0 . 2 seconds elapsed Mean Precision@10 W eston et al. (2012) RoBiRank 1 RoBiRank 4 RoBiRank 16 RoBiRank 32 Figure 2: Left: the scaling behavior of RoBiRank on Million Song Dataset. Center , Right: Perfor - mance comparison of RoBiRank and W eston et al. [24] when the same amount of wall-clock time for computation is giv en. solution improves as a function of wall clock time. Since the authors of W eston et al. [24] do not make av ailable their code, we implemented their algorithm within our framew ork using the same data structures and libraries used by our method. Furthermore, for a fair comparison, we used the same initialization for U and V and performed an identical grid-search over the step size parameter for both algorithms. It can be seen from Figure 2 (center, right) that on a single machine the algorithm of W eston et al. [24] is very competitiv e and outperforms RoBiRank. The reason for this might be the introduction of the additional ξ v ariables in RoBiRank, which slows down conv ergence. Howe ver , RoBiRank training can be distributed across processors, while it is not clear how to parallelize the algorithm of W eston et al. [24]. Consequently , RoBiRank 32 which uses 32 machines for its computation can produce a signiﬁcantly better model within the same wall clock time windo w . 5 Related W ork In terms of modeling, viewing ranking problems as generalization of binary classiﬁcation problems is not a new idea; for example, RankSVM deﬁnes the objecti ve function as a sum of hinge losses, similarly to our basic model (6) in Section 3.1. Ho wev er , it does not directly optimize the ranking metric such as NDCG; the objectiv e function and the metric are not immediately related to each other . In this respect, our approach is closer to that of Le and Smola [14] which constructs a con ve x upper bound on the ranking metric and Chapelle et al. [9] which improv es the bound by introducing non-con ve xity . The objectiv e function of Chapelle et al. [9] is also motiv ated by ramp loss, which is used for robust classiﬁcation; nonetheless, to our knowledge the direct connection between the ranking metrics in form (7) (DCG, NDCG) and the robust loss (3) is our novel contribution. Also, our objecti ve function is designed to speciﬁcally bound the ranking metric, while Chapelle et al. [9] proposes a general recipe to improv e existing con v ex bounds. Stochastic optimization of the objectiv e function for latent collaborativ e retriev al has been also ex- plored in W eston et al. [24]. They attempt to minimize X ( x,y ) ∈ Ω Φ   1 + X y 0 6 = y I ( f ( U x , V y ) − f ( U x , V y 0 ) < 0)   , (17) where Φ( t ) = P t k =1 1 k . This is similar to our objecti v e function (15); Φ( · ) and ρ 2 ( · ) are asymptoti- cally equiv alent. Howe v er , we argue that our formulation (15) has two major adv antages. First, it is a continuous and differentiable function, therefore gradient-based algorithms such as L-BFGS and stochastic gradient descent hav e conv ergence guarantees. On the other hand, the objectiv e function of W eston et al. [24] is not ev en continuous, since their formulation is based on a function Φ( · ) that is deﬁned for only natural numbers. Also, through the linearization trick in (15) we are able to obtain an unbiased stochastic gradient, which is necessary for the con ver gence guarantee, and to parallelize the algorithm across multiple machines as discussed in Section 4.2. It is unclear how these techniques can be adapted for the objectiv e function of W eston et al. [24]. 8 6 Conclusion In this paper , we dev eloped RoBiRank, a nov el model on ranking, based on insights and techniques from the literature of robust binary classiﬁcation. Then, we proposed a scalable and parallelizable stochastic optimization algorithm that can be applied to the task of latent collaborati ve retriev al which large-scale data without feature vectors and explicit scores have to take care of. Experimen- tal results on both learning to rank datasets and latent collaborative retrie val dataset suggest the advantage of our approach. As a ﬁnal note, the experiments in Section 4.3 are arguably unfair towards WSABIE. For instance, one could en visage using cle ver engineering tricks to derive a parallel variant of WSABIE ( e.g . , by averaging gradients from various machines), which might outperform RoBiRank on the MSD dataset. While performance on a speciﬁc dataset might be better , we would lose global con v ergence guarantees. Therefore, rather than obsess over the performance of a speciﬁc algorithm on a speciﬁc dataset, via this paper we hope to draw the attention of the community to the need for developing principled parallel algorithms for this important problem. References [1] S. Agarwal. The inﬁnite push: A ne w support v ector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In SDM , pages 839–850. SIAM, 2011. [2] P . L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Con ve xity , classiﬁcation, and risk bounds. Journal of the American Statistical Association , 101(473):138–156, 2006. [3] T . Bertin-Mahieux, D. P . Ellis, B. Whitman, and P . Lamere. The million song dataset. In Pr o- ceedings of the 12th International Confer ence on Music Information Retrieval (ISMIR 2011) , 2011. [4] L. Bottou and O. Bousquet. The tradeoffs of large-scale learning. Optimization for Machine Learning , page 351, 2011. [5] G. Bouchard. Efﬁcient bounds for the softmax function, applications to inference in hybrid models. 2007. [6] S. Bo yd and L. V andenberghe. Conve x Optimization . Cambridge Uni versity Press, Cambridge, England, 2004. [7] D. Buffoni, P . Gallinari, N. Usunier , and C. Calauz ` enes. Learning scoring functions with order-preserving losses and standardized supervision. In Pr oceedings of the 28th International Confer ence on Machine Learning (ICML-11) , pages 825–832, 2011. [8] O. Chapelle and Y . Chang. Y ahoo! learning to rank challenge overvie w . Journal of Machine Learning Resear ch-Pr oceedings T rac k , 14:1–24, 2011. [9] O. Chapelle, C. B. Do, C. H. T eo, Q. V . Le, and A. J. Smola. Tighter bounds for structured estimation. In Advances in neural information pr ocessing systems , pages 281–288, 2008. [10] N. Ding. Statistical Machine Learning in T -Exponential F amily of Distributions . PhD thesis, PhD thesis, Purdue Univ ersity , W est Lafayette, Indiana, USA, 2013. [11] V . Feldman, V . Guruswami, P . Raghavendra, and Y . W u. Agnostic learning of monomials by halfspaces is hard. SIAM J ournal on Computing , 41(6):1558–1590, 2012. [12] R. Gemulla, E. Nijkamp, P . J. Haas, and Y . Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Conference on Knowledge Discovery and Data Mining , pages 69–77, 2011. [13] P . J. Huber . Robust Statistics . John Wile y and Sons, Ne w Y ork, 1981. [14] Q. V . Le and A. J. Smola. Direct optimization of ranking measures. T echnical Report 0704.3359, arXiv , April 2007. http://arxiv.org/abs/0704.3359 . [15] C.-P . Lee and C.-J. Lin. Lar ge-scale linear ranksvm. Neural Computation , 2013. T o Appear . [16] P . Long and R. Servedio. Random classiﬁcation noise defeats all conv ex potential boosters. Machine Learning J ournal , 78(3):287–304, 2010. [17] C. D. Manning, P . Raghav an, and H. Sch ¨ utze. Intr oduction to Information Retrieval . Cam- bridge Univ ersity Press, 2008. URL http://nlp.stanford.edu/IR- book/ . 9 [18] A. Nemirovski, A. Juditsky , G. Lan, and A. Shapiro. Rob ust stochastic approximation approach to stochastic programming. SIAM J ournal on Optimization , 19(4):1574–1609, 2009. [19] J. Nocedal and S. J. Wright. Numerical Optimization . Springer Series in Operations Research. Springer , 2nd edition, 2006. [20] T . Qin, T .-Y . Liu, J. Xu, and H. Li. Letor: A benchmark collection for research on learning to rank for information retriev al. Information Retrieval , 13(4):346–374, 2010. [21] H. E. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics , 22:400–407, 1951. [22] C. Rudin. The p-norm push: A simple conv ex ranking algorithm that concentrates at the top of the list. The J ournal of Machine Learning Resear c h , 10:2233–2271, 2009. [23] N. Usunier , D. Buffoni, and P . Gallinari. Ranking with ordered weighted pairwise classiﬁca- tion. In Pr oceedings of the International Confer ence on Machine Learning , 2009. [24] J. W eston, C. W ang, R. W eiss, and A. Berenzweig. Latent collaborative retriev al. arXiv pr eprint arXiv:1206.4603 , 2012. 10 A Additional Results f or Learning to Rank Experiments In appendix A, we present results from additional experiments that could not be accommodated in the main paper due to space constraints. Figure 3 shows how RoBiRank fares against InfNorm- Push and IRPush on various datasets we used. Figure 4 shows a similar comparison against the 8 algorithms present in RankLib. T able 1 provides descriptiv e statistics of all the datasets we ran our experiments, Ov erall NDCG v alues obtained and v alues of the corresponding regularization param- eters. Overall NDCG values hav e been omitted for the RankLib algorithms as the library doesn’t support its calculation directly . A.1 Sensitivity to Initialization W e also inv estigated the sensitivity of parameter estimation to the choice of initial parameter . W e initialized ω randomly with 10 different seed v alues. Blue lines in Figure 6 sho w mean and standard deviation of NDCG values at dif ferent lev els of truncation; as can be seen, e v en though our objecti ve function is non-con vex, L-BFGS reliably con v erges to solutions with similar test performance. This conclusion is in line with the obs ervation of Ding [10]. W e also tried two more variants; initialization by all-zeroes (red line) and the solution of RankSVM (black line). In most cases it did not affect the quality of solution, but on TD 2003 and HP 2004 datasets, zero initialization gav e slightly better results. A.2 Comparison with other baselines W e also compared RoBiRank against other baselines, namely - Identity Loss (obtained by replacing ρ 1 by the identity result in the con ve x loss of Buffoni et al. [7]). W e sho w the results of these experiments on small-medium LET OR datasets and on a lar ge dataset (million song dataset) in T able 2 and Figure 5. As can be seen, RoBiRank comprehensiv ely outperforms these baselines. B Pseudocode of the Serial Algorithm Algorithm 1 Serial parameter estimation algorithm for latent collaborativ e retrie val η : step size repeat // ( U, V ) -step repeat Sample ( x, y ) uniformly from Ω Sample y 0 uniformly from Y \ { y } U x ← U x − η · ξ xy · ∇ U x σ 0 ( f ( U x , V y ) − f ( U x , V y 0 )) V y ← V y − η · ξ xy · ∇ V y σ 0 ( f ( U x , V y ) − f ( U x , V y 0 )) until con ver gence in U, V // ξ -step for ( x, y ) ∈ Ω do ξ xy ← 1 P y 0 6 = y σ 0 ( f ( U x ,V y ) − f ( U x ,V y 0 ))+1 end for until con ver gence in U, V and ξ C Description of Parallel Algorithm Suppose there are p number of machines. The set of contexts X is randomly partitioned into mu- tually exclusiv e and exhausti ve subsets X (1) , X (2) , . . . , X ( p ) which are of approximately the same size. This partitioning is ﬁxed and does not change over time. The partition on X induces par- titions on other variables as follows: U ( q ) := { U x } x ∈X ( q ) , Ω ( q ) :=  ( x, y ) ∈ Ω : x ∈ X ( q )  , ξ ( q ) := { ξ xy } ( x,y ) ∈ Ω ( q ) , for 1 ≤ q ≤ p . 11 Name |X | avg. Mean NDCG Regularization P arameter |Y x | RoBiRank RankSVM LSRank InfNormPush IRPush RoBiRank RankSVM LSRank InfNormPush IRPush TD 2003 50 981 0.9719 0.9219 0.9721 0.9514 0.9685 10 − 5 10 − 3 10 − 1 1 10 − 4 TD 2004 75 989 0.9708 0.9084 0.9648 0.9521 0.9601 10 − 6 10 − 1 10 4 10 − 2 10 − 4 Y ahoo! 1 29,921 24 0.8921 0.7960 0.871 0.8692 0.8892 10 − 9 10 3 10 4 10 10 − 9 Y ahoo! 2 6,330 27 0.9067 0.8126 0.8624 0.8826 0.9068 10 − 9 10 5 10 4 10 10 − 7 HP 2003 150 984 0.9960 0.9927 0.9981 0.9832 0.9939 10 − 3 10 − 1 10 − 4 1 10 − 2 HP 2004 75 992 0.9967 0.9918 0.9946 0.9863 0.9949 10 − 4 10 − 1 10 2 10 − 2 10 − 2 OHSUMED 106 169 0.8229 0.6626 0.8184 0.7949 0.8417 10 − 3 10 − 5 10 4 1 10 − 3 MSLR30K 31,531 120 0.7812 0.5841 0.727 - - 1 10 3 10 4 - - MQ 2007 1,692 41 0.8903 0.7950 0.8688 0.8717 0.8810 10 − 9 10 − 3 10 4 10 10 − 6 MQ 2008 784 19 0.9221 0.8703 0.9133 0.8929 0.9052 10 − 5 10 3 10 4 10 10 − 5 T able 1: Descriptiv e Statistics of Datasets and Experimental Results in Section 3.4. 12 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2003 RoBiRank RankSVM LSRank InfNormPush IRPush 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2004 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k Y ahoo Learning to Rank - 1 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k Y ahoo Learning to Rank - 2 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k HP 2003 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k HP 2004 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k OHSUMED 5 10 15 20 0 0 . 2 0 . 4 0 . 6 0 . 8 1 k NDCG@k MSLR30K 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k MQ 2007 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k MQ 2008 Figure 3: Comparison of RoBiRank, RankSVM, LSRank [14], Inf-Push and IR-Push 13 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2003 RoBiRank MAR T RankNet RankBoost AdaRank CoordAscent LambdaMAR T ListNet RandomForests 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2004 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k Y ahoo Learning to Rank - 1 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k Y ahoo Learning to Rank - 2 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k HP 2003 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k HP 2004 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k OHSUMED 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k MQ 2007 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k MQ 2008 Figure 4: Comparison of RoBiRank, MAR T , RankNet, RankBoost, AdaRank, CoordAscent, Lamb- daMAR T , ListNet and RandomF orests 14 Name RoBiRank Identity Loss TD 2003 0.9719 0.9575 TD 2004 0.9708 0.9456 HP 2003 0.9960 0.9855 HP 2004 0.9967 0.9841 MQ 2007 0.8903 0.7973 MQ 2008 0.9221 0.8039 MSD 29% 17% T able 2: Comparison of RoBiRank against Identity Loss as described in Section A.2. W e report ov erall NDCG for experiments on small-medium datasets, while on the million song dataset (MSD) we report Precision@1. 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2003 RoBiRank IdentityLoss 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k TD 2004 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k HP 2003 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k HP 2004 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k MQ 2007 5 10 15 20 0 . 4 0 . 6 0 . 8 1 k NDCG@k MQ 2008 Figure 5: Comparison of RoBiRank with other baselines (Identity Loss), see Section A.2 15 5 10 15 20 0 . 55 0 . 6 0 . 65 k NDCG@k TD 2003 random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 55 0 . 6 0 . 65 0 . 7 0 . 75 0 . 8 k NDCG@k TD 2004 random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 74 0 . 76 0 . 78 0 . 8 0 . 82 0 . 84 0 . 86 k NDCG@k Y ahoo Learning to Rank - 1 random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 76 0 . 78 0 . 8 0 . 82 0 . 84 0 . 86 k NDCG@k Y ahoo Learning to Rank - 2 random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 75 0 . 8 0 . 85 0 . 9 0 . 95 k NDCG@k HP 2003 random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 75 0 . 8 0 . 85 0 . 9 0 . 95 k NDCG@k HP 2004 random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 46 0 . 48 0 . 5 0 . 52 k NDCG@k OHSUMED random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 4 0 . 45 0 . 5 k NDCG@k MSLR30K random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 7 0 . 75 0 . 8 k NDCG@k MQ 2007 random initialization zero initialization initialization by RankSVM 5 10 15 20 0 . 75 0 . 8 0 . 85 0 . 9 k NDCG@k MQ 2008 random initialization zero initialization initialization by RankSVM Figure 6: Performance of RoBiRank based on different initialization methods 16 Each machine q stores v ariables U ( q ) , ξ ( q ) and Ω ( q ) . Since the partition on X is ﬁxed, these v ariables are local to each machine and are not communicated. Now we describe ho w to parallelize each step of the algorithm: the pseudo-code can be found in Algorithm 2. Algorithm 2 Multi-machine parameter estimation algorithm for latent collaborativ e retrie val η : step size repeat // parallel ( U, V ) -step repeat Sample a partition  Y (1) , Y (2) , . . . , Y ( q )  for all machine q ∈ { 1 , 2 , . . . , p } do in parallel Fetch all V y ∈ V ( q ) repeat Sample ( x, y ) uniformly from  ( x, y ) ∈ Ω ( q ) , y ∈ Y ( q )  Sample y 0 uniformly from Y ( q ) \ { y } U x ← U x − η · ξ xy · ∇ U x σ 0 ( f ( U x , V y ) − f ( U x , V y 0 )) V y ← V y − η · ξ xy · ∇ V y σ 0 ( f ( U x , V y ) − f ( U x , V y 0 )) until predeﬁned time limit is exceeded end for until con ver gence in U, V // parallel ξ -step for all machine q ∈ { 1 , 2 , . . . , p } do in parallel Fetch all V y ∈ V for ( x, y ) ∈ Ω ( q ) do ξ xy ← 1 P y 0 6 = y σ 0 ( f ( U x ,V y ) − f ( U x ,V y 0 ))+1 end for end for until con ver gence in U, V and ξ ( U, V ) -step At the start of each ( U, V ) -step, a new partition on Y is sampled to divide Y into Y (1) , Y (2) , . . . , Y ( p ) which are also mutually exclusi ve, exhaustiv e and of approximately the same size. The difference here is that unlik e the partition on X , a new partition on Y is sampled for e very ( U, V ) -step. Let us deﬁne V ( q ) := { V y } y ∈Y ( q ) . After the partition on Y is sampled, each machine q fetches V y ’ s in V ( q ) from where it was pre viously stored; in the very ﬁrst iteration which no previous information exists, each machine generates and initializes these parameters instead. No w let us deﬁne L ( q ) ( U ( q ) , V ( q ) , ξ ( q ) ) := X ( x,y ) ∈ Ω ( q ) ,y ∈Y ( q ) − log 2 ξ xy + ξ xy  P y 0 ∈Y ( q ) ,y 0 6 = y σ 0 ( f ( U x , V y ) − f ( U x , V y 0 )) + 1  − 1 log 2 . In parallel setting, each machine q runs stochastic gradient descent on L ( q ) ( U ( q ) , V ( q ) , ξ ( q ) ) instead of the original function L ( U, V , ξ ) . Since there is no overlap between machines on the parameters they update and the data they access, ev ery machine can progress independently of each other . Although the algorithm takes only a fraction of data into consideration at a time, this procedure is also guaranteed to conv erge to a local optimum of the original function L ( U, V , ξ ) according to Stratiﬁed Stochastic Gradient Descent (SSGD) scheme of Gemulla et al. [12]. The intuition is as follows: if we take expectation o ver the random partition on Y , we have ∇ U,V L ( U, V , ξ ) = q 2 · E   X 1 ≤ q ≤ p ∇ U,V L ( q ) ( U ( q ) , V ( q ) , ξ ( q ) )   . 17 Therefore, although there is some discrepancy between the function we take stochastic gradient on and the function we actually aim to minimize, in the long run the bias will be washed out and the algorithm will con ver ge to a local optimum of the objecti ve function L ( U, V , ξ ) . ξ -step In this step, all machines synchronize to retrie ve ev ery entry of V . Then, each machine can update ξ ( q ) independently of each other . When the size of V is very large and cannot be ﬁt into the main memory of a single machine, V can be partitioned as in ( U, V ) -step and updates can be calculated in a round-robin way . Note that this parallelization scheme requires each machine to allocate only 1 p -fraction of memory that would be required for a single-machine execution. Therefore, in terms of space complexity the algorithm scales linearly with the number of machines. 18

Ranking via Robust Binary Classification and Parallel Parameter Estimation in Large-Scale Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment