Representation Learning and Pairwise Ranking for Implicit Feedback in Recommendation Systems

JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 1 Representation Lear ning and P airwise Ranking f or Implicit F eedbac k in Recommendation Systems Sumit Sidana, Mikhail T roﬁmov , Oleg Horodnitskii, Charlotte Laclau, Y ur y Maximov , Massih-Reza Amini Abstract —In this paper , we propose a nov el ranking frame work for collaborativ e ﬁltering with the overall aim of learning user preferences ov er items by minimizing a pairwise r anking loss. We sho w the minimization problem in volv es dependent random variab les and provide a theoretical analysis b y proving the consistency of the empirical r isk minimization in the worst case where all users choose a minimal number of positiv e and negative items. W e fur ther derive a Neural-Network model that jointly learns a new representation of users and items in an embedded space as well as the pref erence relation of users over the pairs of items . The learning objective is based on three scenarios of ranking losses that control the ability of the model to maintain the ordering over the items induced from the users’ preferences , as well as, the capacity of the dot-product deﬁned in the learned embedded space to produce the ordering. The proposed model is by nature suitable f or implicit f eedback and inv olves the estimation of only v er y fe w parameters. Through e xtensive e xperiments on sev eral real-world benchmarks on implicit data, we show the interest of learning the preference and the embedding sim ultaneously when compared to lear ning those separately . We also demonstrate that our approach is very competitive with the best state-of-the-art collaborative ﬁltering techniques proposed for implicit f eedback. Index T erms —Recommender Systems; Learning-to-rank; Neural Networks; Collaborativ e Filtering F 1 I N T R O D U C T I O N In the recent years, r ecommender systems (RS) have at- tracted a lot of interest in both industry and academic resear ch communities, mainly due to new challenges that the design of a decisive and efﬁcient RS presents. Given a set of customers (or users), the goal of RS is to provide a personalized recommendation of products to users which would likely to be of their interest. Common examples of ap- plications include the r ecommendation of movies (Netﬂix, Amazon Prime V ideo), music (Pandora), videos (Y ouT ube), news content (Outbrain) or advertisements (Google). The development of an efﬁcient RS is critical from both the company and the consumer perspective. On one hand, users usually face a very large number of options: for instance, Amazon proposes over 20,000 movies in its selection, and it is ther efore important to help them to take the best possible decision by narrowing down the choices they have to make. On the other hand, major companies report signiﬁcant in- crease of their trafﬁc and sales coming from personalized recommendations: Amazon declares that 35% of its sales is generated by recommendations, two-thirds of the movies watched on Netﬂix ar e recommended and 28% of ChoiceS- • Univ . Grenoble Alpes/CNRS. E-mail: { fname.lname } @univ-grenoble-alpes.fr • Federal Research Center ”Computer Science and Control” of Russian Academy of Sciences . E-mail: mikhail.troﬁmov@phystech.edu • Center for Energy Systems, Skolkovo Institute of Science . E-mail: Oleg.Gorodnitskii@skoltech.ru • T -4 and CNLS, Los Alamos National Laboratory, and Center for Energy Systems, Skolkovo Institute of Science and T echnology . E-mail: yury@lanl.gov Manuscript received April 19, 2005; revised August 26, 2015. tream users said that they would buy more music, pr ovided the fact that they meet their tastes and interests. 1 T wo main approaches have been proposed to tackle this problem [1]. The ﬁrst one, referred to as Content-Based rec- ommendation technique [2] makes use of existing contextual information about the users (e.g. demographic information) or items (e.g. textual description) for recommendation. The second approach, referr ed to as collaborative ﬁltering (CF) and undoubtedly the most popular one, relies on the past interactions and recommends items to users based on the feedback provided by other similar users. Feedback can be explicit , in the form of ratings; or implicit , which includes clicks, browsing over an item or listening to a song. Such implicit feedback is readily available in abundance but is more challenging to take into account as it does not clearly depict the prefer ence of a user for an item. Explicit feedback, on the other hand, is very hard to get in abundance. The adaptation of CF systems designed for another type of feedback has been shown to be sub-optimal as the basic hypothesis of these systems inherently depends on the nature of the feedback [3]. Further , learning a suitable repr esentation of users and items has been shown to be the bottleneck of these systems [4], mostly in the cases where contextual information over users and items which allow to have a richer repr esentation is unavailable. In this paper we are interested in the learning of user prefer ences mostly pr ovided in the form of implicit feedback in RS. Our aim is twofold and concerns: 1. T alk of Xavier Amatriain - Recommender Systems - Machine Learning Summer School 2014 @ CMU. JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 2 1) the development of a theoretical framework for learn- ing user prefer ence in recommender systems and its analysis in the worst case where all users provide a minimum of positive/negative feedback; 2) the design of a new neural-network model based on this framework that learns the preference of users over pairs of items and their representations in an embedded space simultaneously without requiring any contextual information. W e extensively validate our proposed approach over standard benchmarks with implicit feedback by comparing it to state of the art models. The remainder of this paper is organized as follows. In Section 2, we deﬁne the notations and the pr oposed framework, and analyze its theor etical properties. Then, Section 3 provides an overview of existing related methods. Section 4 is devoted to numerical experiments on four r eal- world benchmark data sets including binarized versions of MovieLens and Netﬂix, and one r eal data set on online advertising. W e compare differ ent versions of our model with state-of-the-art methods showing the appropriateness of our contribution. Finally , we summarize the study and give possible future r esearch perspectives in Section 5. 2 U S E R P R E F E R E N C E A N D E M B E D D I N G L E A R N I N G W I T H N E U R A L N E T S W e denote by U ⊆ N (resp. I ⊆ N ) the set of indexes over users (resp. the set of indexes over items). Further , for each user u ∈ U , we consider two subsets of items I − u ⊂ I and I + u ⊂ I such that; i ) I − u 6 = ∅ and I + u 6 = ∅ , ii ) for any pair of items ( i, i 0 ) ∈ I + u × I − u ; u has a prefer - ence, symbolized by  u . Hence i  u i 0 implies that, user u pr efers item i over item i 0 . From this prefer ence r elation, a desired output y i,u,i 0 ∈ {− 1 , +1 } is deﬁned over each triplet ( i, u, i 0 ) ∈ I + u × U × I − u as: y i,u,i 0 =  1 if i  u i 0 , − 1 otherwise. (1) 2.1 Learning objective The learning task we address is to ﬁnd a scoring function f from the class of functions F = { f | f : I × U × I → R } that minimizes the ranking loss: L ( f ) = E   1 |I + u ||I − u | X i ∈I + u X i 0 ∈I − u 1 y i,u,i 0 f ( i,u,i 0 ) < 0   , (2) where | . | measures the cardinality of sets and 1 π is the indicator function which is equal to 1 , if the predicate π is true, and 0 otherwise. Here we suppose that there exists a mapping function Φ : U × I → X ⊆ R k that projects a pair of user and item indices into a feature space of dimension k , and a function g : X × X → R such that each function f ∈ F can be decomposed as: ∀ u ∈ U , ( i, i 0 ) ∈ I + u ×I − u , f ( i, u, i 0 ) = g (Φ( u, i )) − g (Φ( u, i 0 )) . (3) In the next section we will pr esent a Neural-Network model that learns the mapping function Φ and outputs the function f based on a non-linear transformation of the user-item feature representation, deﬁning the function g . The previous loss (2) is a pairwise ranking loss and it is related to the Area under the ROC curve [5]. The learning objective is, hence, to ﬁnd a function f from the class of functions F with a small expected risk, by minimizing the empirical err or over a training set S = { ( z i,u,i 0 . = ( i, u, i 0 ) , y i,u,i 0 ) | u ∈ U , ( i, i 0 ) ∈ I + u × I − u } , constituted over N users, U = { 1 , . . . , N } , and their respec- tive preferences over M items, I = { 1 , . . . , M } and is given by: ˆ L ( f , S ) = 1 N X u ∈U 1 |I + u ||I − u | X i ∈I + u X i 0 ∈I − u 1 y i,u,i 0 ( f ( i,u,i 0 )) < 0 = 1 N X u ∈U 1 |I + u ||I − u | X i ∈I + u X i 0 ∈I − u 1 y i,u,i 0 ( g (Φ( u,i )) − g (Φ( u,i 0 ))) < 0 . (4) However this minimization problem involves dependent random variables as for each user u and item i ; all compar- isons g (Φ( u, i )) − g (Φ( u, i 0 )); i 0 ∈ I − u involved in the em- pirical err or (4) shar e the same observation Φ( u, i ) . Differ ent studies proposed generalization error bounds for learning with interdependent data [6]. Among the prominent works that address this problem ar e a series of contributions based on the idea of graph coloring introduced in [7], and which consists in dividing a graph Ω = ( V , E ) that links dependent variables repr esented by its nodes V into J sets of indepen- dent variables, called the exact proper fractional cover of Ω and deﬁned as: Deﬁnition 1 (Exact proper fractional cover of Ω , [7]). Let Ω = ( V , E ) be a graph. C = { ( M j , ω j ) } j ∈{ 1 ,...,J } , for some positive integer J , with M j ⊆ V and ω j ∈ [0 , 1] is an exact proper fractional cover of Ω , if: i) it is proper: ∀ j, M j is an independent set , i.e., there is no connections between vertices in M j ; ii) it is an exact fractional cover of Ω : ∀ v ∈ V , P j : v ∈M j ω j = 1 . The weight W ( C ) of C is given by: W ( C ) . = P J j =1 ω j and the minimum weight χ ∗ (Ω) = min C ∈K (Ω) W ( C ) over the set K (Ω) of all exact pr oper fractional covers of Ω is the fractional chr omatic number of Ω . Figure 1 depicts an exact proper fractional cover corre- sponding to the pr oblem we consider for a toy problem with M = 1 user u , and |I + u | = 2 items preferr ed over |I − u | = 3 other ones. In this case, the nodes of the dependency graph correspond to 6 pairs constituted by; pairs of the user and each of the preferr ed items, with the pairs constituted by the user and each of the no preferred items, involved in the em- pirical loss (4). Among all the sets containing independent pairs of examples, the one shown in Figure 1, ( c ) is the exact proper fractional cover of the Ω and the fractional chromatic number is in this case χ ∗ (Ω) = |I − u | = 3 . By mixing the idea of graph coloring with the Laplace transform, Hoeffding like concentration inequalities for the sum of dependent random variables are proposed by [7]. In [8] this result is extended to provide a generalization of the JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 3 x 1 + x 2 + x 3 - x 1 + x 1 - ( , ) x 1 + x 2 - ( , ) x 1 + x 3 - ( , ) x 2 + x 1 - ( , ) x 2 + x 2 - ( , ) x 2 + x 3 - ( , ) x 1 + x 1 - ( , ) x 1 + x 2 - ( , ) x 1 + x 3 - ( , ) x 2 + x 1 - ( , ) x 2 + x 2 - ( , ) x 2 + x 3 - ( , ) M 1 M 2 M 3 x 1 + x 1 - ( , ) x 1 + x 2 - ( , ) x 1 + x 3 - ( , ) x 2 + x 1 - ( , ) x 2 + x 2 - ( , ) x 2 + x 3 - ( , ) M 1 , w 1 =1 M 3 , w 3 = 1 M 2 , w 2 = 4 1 x 1 + x 1 - ( , ) x 1 + x 2 - ( , ) x 1 + x 3 - ( , ) x 2 + x 1 - ( , ) x 2 + x 2 - ( , ) x 2 + x 3 - ( , ) M 1 , w 1 =1 M 3 , w 3 = 1 M 2 , w 2 =1 ( a ) ( b ) ( c ) x 1 + x 2 + x 1 - x 2 - x 3 - x 2 - x 1 - Fig. 1: A toy problem with 1 user who prefers |I + u | = 2 items over |I − u | = 3 other ones (top). The dyadic repr esentation of pairs constituted with the representation of the user and each of the representations of preferred and non-preferred items (middle). Differ ent covering of the dependent set, ( a ) and ( b ) ; as well as the exact proper fractional cover , ( c ) , corresponding to the smallest disjoint sets containing independent pairs. bounded differences inequality of [9] to the case of interde- pendent random variables. This extension then paved the way for the deﬁnition of the fractional Rademacher complexity that generalizes the idea of Rademacher complexity and allows one to derive generalization bounds for scenarios where the training data are made of dependent data. In the worst case scenario where all users provide the lowest interactions over the items, which constitutes the bottleneck of all recommendation systems: ∀ u ∈ S, |I − u | = n − ∗ = min u 0 ∈ S |I − u 0 | , and |I + u | = n + ∗ = min u 0 ∈ S |I + u 0 | , the empirical loss (4) is upper-bounded by: ˆ L ( f , S ) ≤ ˆ L ∗ ( f , S ) = 1 N 1 n − ∗ n + ∗ X u ∈U X i ∈I + u X i 0 ∈I − u 1 y i,u,i 0 f ( i,u,i 0 ) < 0 . (5) Following [10, Proposition 4], a generalization error bound can be derived for the second term of the inequality above based on local Rademacher Complexities that implies second-order (i.e. variance) information inducing faster con- vergence rates. For sake of presentation and in order to be in line with the learning repr esentations of users and items in an embed- ded space introduced in Section 2.2, let us consider kernel- based hypotheses with κ : X × X → R a positive semi-deﬁnite (PSD) kernel and Φ : U × I → X its associated feature mapping function. Further we consider linear functions in the featur e space with bounded norm: G B = { g w ◦ Φ : ( u, i ) ∈ U × I 7→ h w , Φ( u, i ) i | || w || ≤ B } (6) where w is the weight vector deﬁning the kernel-based hypotheses and h· , ·i denotes the dot product. W e further deﬁne the following associated function class: F B = { z i,u,i 0 . = ( i, u, i 0 ) 7→ g w (Φ( u, i )) − g w (Φ( u, i 0 )) | g w ∈ G B } , and the parameterized family F B ,r which, for r > 0 , is deﬁned as: F B ,r = { f : f ∈ F B , V [ f ] . = V z ,y [ 1 y f ( z ) ] ≤ r } , where V [ . ] denotes the variance. The fractional Rademacher complexity intr oduced in [8] entails our analysis: R S ( F ) = 2 m E ξ n − ∗ X j =1 E M j sup f ∈F X α ∈M j z α ∈ S ξ α f ( z α ) , where m = N × n + ∗ × n − ∗ is the total number of triplets z in the training set and ( ξ i ) m i =1 is a sequence of independent Rademacher variables verifying P ( ξ i = 1) = P ( ξ i = − 1) = 1 2 . Theorem 1. Let U be a set of M independent users, such that each user u ∈ U prefers n + ∗ items over n − ∗ ones in a predeﬁned set of I items. Let S = { ( z i,u,i 0 . = ( i, u, i 0 ) , y i,u,i 0 ) | u ∈ U , ( i, i 0 ) ∈ I + u × I − u } be the associated training set, then for any 1 > δ > 0 the following generalization bound holds for all f ∈ F B ,r with pr obability at least 1 − δ : L ( f ) ≤ ˆ L ∗ ( f , S ) + 2 B C ( S ) N n + ∗ + 5 2 s 2 B C ( S ) N n + ∗ + r r 2 ! s log 1 δ n + ∗ + 25 48 log 1 δ n + ∗ , JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 4 where C ( S ) = s 1 n − ∗ P n − ∗ j =1 E M j  P α ∈M j z α ∈ S d ( z α , z α ))  , z α = ( i α , u α , i 0 α ) and d ( z α , z α ) = κ (Φ( u α , i α ) , Φ( u α , i α )) + κ (Φ( u α , i 0 α ) , Φ( u α , i 0 α )) − 2 κ (Φ( u α , i α ) , Φ( u α , i 0 α )) . The pr oof is given in Appendix. This result suggests that : • even though the training set S contains interdependent observations; following [11, theorem 2.1, p. 38], theor em 1 gives insights on the consistency of the empirical risk minimization principle with respect to (5), • in the case where the feature space X ⊆ R k is of ﬁnite dimension; lower values of k involves lower kernel es- timation and hence lower complexity term C ( S ) which implies a tighter generalization bound. 2.2 A Neural Netw ork model to learn user preference Some studies proposed to ﬁnd the dyadic repr esentation of users and items in an embedded space, using neighborhood similarity information [12] or the Bayesian Personalized Ranking (BPR) [13]. In this section we propose a feed- forward Neural Network, denoted as RecNet , to learn jointly the embedding repr esentation, Φ( . ) , as well as the scoring function, f ( . ) , deﬁned previously . The input of the network is a triplet ( i, u, i 0 ) composed by the indexes of an item i , a user u and a second item i 0 ; such that the user u has a prefer ence over the pair of items ( i, i 0 ) expressed by the desired output y i,u,i 0 , deﬁned with respect to the prefer ence relation  u (Eq. 1). Each index in the triplet is then transformed to a corr esponding binary indicator vector i , u , and i 0 having all its characteristics equal to 0 except the one that indicates the position of the user or the items in its respective set, which is equal to 1 . Hence, the following one- hot vector corr esponds to the binary vector representation of user u ∈ U : 1 ↓ . . . u − 1 ↓ u ↓ u + 1 ↓ . . . N ↓ u ⊤ = (0 , . . . , 0 , 1 , 0 , . . . , 0) . 1 The network entails then three successive layers, namely Embedding (SG), Mapping and Dense hidden layers depicted in Figur e 2. • The Embedding layer transforms the sparse binary repr e- sentations of the user and each of the items to a denser real-valued vectors. W e denote by U u and V i the trans- formed vectors of user u and item i ; and U = ( U u ) u ∈U and V = ( V i ) i ∈I the corr esponding matrices. Note that as the binary indicator vectors of users and items contain one single non-null characteristic, each entry of the corresponding dense vector in the SG layer is connected by only one weight to that characteristic. • The Mapping layer is composed of two groups of units each being obtained from the element-wise product between the user repr esentation vector U u of a user u and a corr esponding item representation vector V i of an item i inducing the feature r epresentation of the pair ( u, i ); Φ( u, i ) . • Each of these units ar e also fully connected to the units of a Dense layer composed of successive hidden layers (see Section 4 for more details r elated to the number of hidden units and the activation function used in this layer) . The model is trained such that the output of each of the dense layers reﬂects the relationship between the corre- sponding item and the user and is mathematically deﬁned by a multivariate real-valued function g ( . ) . Hence, for an input ( i, u, i 0 ) , the output of each of the dense layers is a real-value scor e that reﬂects a pr eference associated to the corresponding pair ( u, i ) or ( u, i 0 ) (i.e. g (Φ( u, i )) or g (Φ( u, i 0 )) ). Finally the prediction given by RecNet for an input ( i, u, i 0 ) is: f ( i, u, i 0 ) = g (Φ( u, i )) − g (Φ( u, i 0 )) . (7) 2.3 Algorithmic implementation W e decompose the ranking loss as a linear combination of two logistic surrogates: L c,p ( f , U , V , S ) = L c ( f , S ) + L p ( U , V , S ) , (8) where the ﬁrst term reﬂects the ability of the non-linear transformation of user and item feature representations, g (Φ( ., . )) , to respect the relative ordering of items with respect to users’ pr eferences: L c ( f , S ) = 1 |S | P ( z i,u,i 0 ,y i,u,i 0 ) ∈S log(1 + e y i,u,i 0 ( g (Φ( u,i 0 )) − g (Φ( u,i )) ) . (9) The second term focuses on the quality of the compact dense vector repr esentations of items and users that have to be found, as measur ed by the ability of the dot-product in the r esulting embedded vector space to respect the relative ordering of preferr ed items by users: L p ( U , V , S ) = 1 |S | P ( z i,u,i 0 ,y i,u,i 0 ) ∈S h log(1 + e y i,u,i 0 U > u ( V i 0 − V i ) ) + λ ( k U u k + k V i 0 k + k V i k ) i , (10) where λ is a regularization parameter for the user and items norms. Finally , one can also consider a version in which both losses are assigned differ ent weights: L c,p ( f , U , V , S ) = α L c ( f , S ) + (1 − α ) L p ( U , V , S ) , (11) where α ∈ [0 , 1] is a real-valued parameter to balance between ranking prediction ability and expressiveness of the learned item and user repr esentations. Both options will be discussed in the experimental section. T raining phase The training of the RecNet is done by back-propagating [14] the error -gradients from the output to both the deep and embedding parts of the model using mini-batch stochastic optimization (Algorithm 1). During training, the input layer takes a random set ˜ S n of size n of interactions by building triplets ( i, u, i 0 ) based on this set, and generating a sparse representation from id’s vector corr esponding to the picked user and the pair of items. The binary vectors of the examples in ˜ S n are then propagated throughout the network, and the ranking error (Eq. 8) is back-propagated. JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 5 Binarization Input i i u u i ′ i ′ Embedding Mapping Φ( ., . ) Dense g Output g (Φ( u, i )) − g (Φ( u, i ′ )) 1 Fig. 2: The architectur e of RecNet trained to r eﬂect the pr eference of a user u over a pair of items i and i 0 . Algorithm 1 RecNet . : Learning phase Input: T : maximal number of epochs A set of users U = { 1 , . . . , N } A set of items I = { 1 , . . . , M } for ep = 1 , . . . , T Randomly sample a mini-batch ˜ S n ⊆ S of size n fr om the original user-item matrix for all (( i, u, i 0 ) , y i,u,i 0 ) ∈ ˜ S n Propagate ( i, u, i 0 ) fr om the input to the output. Retro-propagate the pairwise ranking err or (Eq. 8) estimated over ˜ S n . Output: Users and items latent feature matrices U , V and the model weights. Model T esting As for the pr ediction phase, shown in Algorithm 2, a ranked list N u,k of the k  M preferred items for each user in the test set is maintained while retrieving the set I . Given the latent representations of the triplets, and the weights learned; the two ﬁrst items in I are placed in N u,k in a way which ensures that preferred one, i ∗ , is in the ﬁrst position. Then, the algorithm retrieves the next item, i ∈ I by comparing it to i ∗ . This step is simply carried out by comparing the model’s output over the concatenated binary indicator vectors of ( i ∗ , u, i ) and ( i, u, i ∗ ) . Hence, if f ( i, u, i ∗ ) > f ( i ∗ , u, i ) , which fr om Equation 7 is equivalent to g (Φ( u, i )) > g (Φ( u, i ∗ )) , then i is predicted to be preferr ed over i ∗ ; i  u i ∗ ; and it is put at the ﬁrst place instead of i ∗ in N u,k . Here we assume that the predicted prefer ence relation  u is transitive, which then ensures that the predicted or der in the list is respected. Otherwise, if i ∗ is predicted to be preferred over i , then i is compared to the second preferred item in the list, using the model’ prediction as before, and so on. The new item, i , is inserted in N u,k in the case if it is found to be preferred over another item in N u,k . By repeating the process until the end of I , we obtain a ranked list of the k most preferred items for the user u . Algorithm 2 does not require an ordering of the whole set of items, as also in most cases we are just inter ested in the relevancy of the top ranked items for assessing the quality of a model. Further , its complexity is at most O ( k × M ) which is convenient in the case where M > > 1 . The merits of a similar algorithm have been discussed by [15] but, as pointed out above, the basic assumption for inserting a new item in the ranked list N u,k is that the pr edicted pr eference relation induced by the model should be transitive, which may not hold in general. In our experiments, we also tested a more conventional inference algorithm, which for a given user u , consists in the or dering of items in I with respect to the output given by the function g , and we did not ﬁnd any substantial differ ence in the performance of RecNet . , as presented in the following section. 3 ( U N ) - R E L A T E D W O R K This section provides an overview of the state-of-the-art approaches that are the most similar to ours. 3.1 Neural Language Models Neural language models have proven themselves to be suc- cessful in many natural language processing tasks includ- Algorithm 2 RecNet . : T esting phase Input: A user u ∈ U ; A set of items I = { 1 , . . . , M } ; A set containing the k preferr ed items in I by u ; N u,k ← ∅ ; The output of RecNet . learned over a training set: f Apply f to the ﬁrst two items of I and, note the preferr ed one i ∗ and place it at the top of N u,k ; for i = 3 , . . . , M if g (Φ( u, i )) > g (Φ( u, i ∗ )) then Add i to N u,k at rank 1 else j ← 1 while j ≤ k AND g (Φ( u, i )) < g (Φ( u, i g )) ) // where i g = N u,k ( j ) j ← j + 1 if j ≤ k then Insert i in N u,k at rank j Output: N u,k ; JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 6 ing speech recognition, information r etrieval and sentiment analysis. These models are based on a distributional hypoth- esis stating that words, occurring in the same context with the same frequency , are similar . In order to capture such similarities, these approaches propose to embed the word distribution into a low-dimensional continuous space using Neural Networks, leading to the development of several powerful and highly scalable language models such as the word2V ec Skip-Gram (SG) model [16, 17]. The recent work of [18] has shown new opportunities to extend the word repr esentation learning to characterize more complicated pieces of information. In fact, this paper established the equivalence between SG model with nega- tive sampling, and implicitly factorizing a point-wise mu- tual information (PMI) matrix. Further , they demonstrated that word embedding can be applied to different types of data, provided that it is possible to design an appropriate context matrix for them. This idea has been successfully applied to recommendation systems where differ ent ap- proaches attempted to learn repr esentations of items and users in an embedded space in order to meet the problem of recommendation more efﬁciently [19, 20, 21]. In [22], the authors used a bag-of-word vector r epresen- tation of items and users, from which the latent representa- tions of latter are learned through word-2-vec. [20] pr oposed a model that relies on the intuitive idea that the pairs of items which are scored in the same way by differ ent users are similar . The approach reduces to ﬁnding both the latent repr esentations of users and items, with the traditional Matrix Factorization (MF) approach, and simultaneously learning item embeddings using a co-occurrence shifted positive PMI (SPPMI) matrix deﬁned by items and their context. The latter is used as a regularization term in the traditional objective function of MF . Similarly , in [21] the authors proposed Prod2V ec, which embeds items using a Neural-Network language model applied to a time series of user purchases. This model was further extended in [23] who, by deﬁning appropriate context matrices, proposed a new model called Meta-Prod2V ec. Their approach learns a repr esentation for both items and side information available in the system. The embedding of additional information is further used to regularize the item embedding. Inspired by the concept of sequence of words; the appr oach pr o- posed by [19] deﬁned the consumption of items by users as trajectories. Then, the embedding of items is learned using the SG model and the users’ embeddings are further used to predict the next item in the trajectory . In these approaches, the learning of item and user repr esentations are employed to make prediction with predeﬁned or ﬁxed similarity functions (such as dot-products) in the embedded space. 3.2 Learning-to-Rank with Neural Netw orks Motivated by automatically tuning the parameters involved in the combination of different scoring functions, Learning- to-Rank approaches wer e originally developed for Informa- tion Retrieval (IR) tasks and are grouped into three main categories: pointwise, listwise and pairwise [24]. Pointwise approaches [25, 26] assume that each queried document pair has an ordinal score. Ranking is then for- mulated as a regression problem, in which the rank value of each document is estimated as an absolute quantity . In the case where relevance judgments are given as pairwise prefer ences (rather than relevance degrees), it is usually not straightforward to apply these algorithms for learning. Moreover , pointwise techniques do not consider the inter- dependency among documents, so that the position of docu- ments in the ﬁnal ranked list is missing in the r egression-like loss functions used for parameter tuning. On the other hand, listwise appr oaches [27, 28, 29] take the entire ranked list of documents for each query as a training instance. As a direct consequence, these approaches are able to differ entiate doc- uments from different queries, and consider their position in the output ranked list at the training stage. Listwise techniques aim to directly optimize a ranking measure, so they generally face a complex optimization problem deal- ing with non-convex, non-differentiable and discontinuous functions. Finally , in pairwise approaches [30, 31, 32, 33] the ranked list is decomposed into a set of document pairs. Ranking is therefor e consider ed as the classiﬁcation of pairs of documents, such that a classiﬁer is trained by minimizing the number of misorderings in ranking. In the test phase, the classiﬁer assigns a positive or negative class label to a document pair that indicates which of the documents in the pair should be better ranked than the other one. Perhaps the ﬁrst Neural Network model for ranking is RankPr op, originally proposed by [34]. RankProp is a pointwise approach that alternates between two phases of learning the desir ed real outputs by minimizing a Mean Squared Error (MSE) objective, and a modiﬁcation of the desired values themselves to reﬂect the current ranking given by the net. Later on [35] proposed RankNet, a pairwise approach, that learns a prefer ence function by minimizing a cross entr opy cost over the pairs of relevant and irrelevant examples. SortNet proposed by [36, 37] also learns a prefer- ence function by minimizing a ranking loss over the pairs of examples that ar e selected iteratively with the overall aim of maximizing the quality of the ranking. The three approaches above consider the problem of Learning-to-Rank for IR and without learning an embedding. 4 E X P E R I M E N TA L R E S U LT S W e conducted a number of experiments aimed at evaluating how the simultaneous learning of user and item repre- sentations, as well as the preferences of users over items can be efﬁciently handled with RecNet . . T o this end, we considered four real-world benchmarks commonly used for collaborative ﬁltering. W e validated our approach with re- spect to differ ent hyper-parameters that impact the accuracy of the model and compare it with competitive state-of-the- art appr oaches. W e run all experiments on a cluster of ﬁve 32 core Intel Xeon @ 2.6Ghz CPU (with 20MB cache per core) systems with 256 Giga RAM running Debian GNU/Linux 8.6 (wheezy) operating system. All subsequently discussed JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 7 components were implemented in Python3 using the T en- sorFlow library with version 1.4.0. 2 , 3 4.1 Datasets W e report results obtained on thr ee publicly available movie datasets, for the task of personalized top-N recommenda- tion: M O V I E L E N S 4 100K ( M L -100K), M O V I E L E N S 1M ( M L - 1M) [38], N E T FL I X 5 , and one clicks dataset, K A S A N D R - Germany 6 [39], a recently released data set for on-line advertising. • M L-100K, M L-1M and N E T FL I X consists of user -movie ratings, on a scale of one to ﬁve, collected from a movie recommendation service and the Netﬂix company . The latter was r eleased to support the Netlﬁx Prize com- petition 7 . For all three datasets, we only keep users who have rated at least ﬁve movies and remove users who gave the same rating for all movies. In addition, for N E T FL I X , we take a subset of the original data and randomly sample 20% of the users and 20% of the items. In the following experiments, as we only compare with appr oaches developed for the ranking purposes and our model is designed to handle implicit feedback, these three data sets are made binary such that a rating higher or equal to 4 is set to 1 and to 0 otherwise. • The original K A S A N D R dataset contains the interactions and clicks done by the users of Kelkoo, an online ad- vertising platform, across twenty Europeans countries. In this article, we used a subset of K A S A N D R that only considers interactions from Germany . It gathers 17,764,280 interactions fr om 521,685 users on 2,299,713 offers belonging to 272 categories and spanning across 801 merchants. For K A S A N D R , we remove users who gave the same rating for all offers. This implies that all the users who never clicked or always clicked on each and every offer shown to them wer e removed. T able 1 provides the basic statistics on these collections after pr e-processing, as discussed above. T ABLE 1: Statistics of various collections used in our exper- iments after preprocessing. # of users # of items # of interactions Sparsity M L -100K 943 1,682 100,000 93.685% M L -1M 6,040 3,706 1,000,209 95.530% N E T FL I X 90,137 3,560 4,188,098 98.700% K A S A N D R 25,848 1,513,038 9,489,273 99.976% 4.2 Experimental set-up Compared baselines In or der to validate the framework deﬁned in the previous section, we propose to compare the following approaches. 2. https://www .tensorﬂow .org/. 3. For research purpose we will make available all the codes imple- menting Algorithms 1 and 2 that we used in our experiments and all the pre-processed datasets. 4. https://movielens.org/ 5. http://academictorrents.com/details/ 9b13183dc4d60676b773c9e2cd6de5e5542cee9a 6. https://archive.ics.uci.edu/ml/datasets/KASANDR 7. B. James and L. Stan, The Netﬂix Prize (2007). • BPR-MF [13] provides an optimization criterion based on implicit feedback; which is the maximum poste- rior estimator derived from a Bayesian analysis of the pairwise ranking problem, and proposes an algorithm based on Stochastic Gradient Descent to optimize it. The model can further be extended to the explicit feedback case. • Co-Factor [20], developed for implicit feedback, con- straints the objective of matrix factorization to use jointly item repr esentations with a factorized shifted positive pointwise mutual information matrix of item co-occurrence counts. The model was found to outper- form WMF [40] also pr oposed for implicit feedback. • LightFM [41] was ﬁrst proposed to deal with the prob- lem of cold-start using meta information. As with our approach, it r elies on learning the embedding of users and items with the Skip-gram model and optimizes the cross-entr opy loss. • RecNet p focuses on the quality of the latent repr esen- tation of users and items by learning the prefer ence and the representation through the ranking loss L p (Eq. 10). • RecNet c focuses on the accuracy of the score obtained at the output of the framework and therefore learns the prefer ence and the repr esentation through the ranking loss L c (Eq. 9). • RecNet c,p uses a linear combination of L p and L c as the objective function, with α ∈ ]0 , 1[ . W e study the two situations presented before (w .r .t. the presence/absence of a supplementary weighting hyper -parameter). Evaluation protocol For each dataset, we sort the interactions according to time, and take 80% for training the model and the r emaining 20% for testing it. In addition, we remove all users and offers which do not occur during the training phase. W e study two different scenarios for the prediction phase: (1) for a given user , the prediction is done only on the items that were shown to him or her; (2) the prediction is done over the set of all items, regar dless of any knowledge about previous interactions. In the context of movie recommendation, a shown item is deﬁned as a movie for which the given user provided a rating. For K A S A N D R , the deﬁnition is quite straight-forward as the data wer e collected from an on-line advertising platform, where the items are displayed to the users, who can either click or ignore them. The ﬁrst setting is arguably the most common in aca- demic research, but is abstracted from the real-world prob- lem as at the time of making the recommendation, the notion of shown items is not available, ther efore for cing the RS to consider the set of all items as potential candidates. As a result, in this setting, for M L-100K, M L-1M, K A S A N D R and N E T FL I X , we only consider in average 25, 72, 6 and 8 items for prediction per user . The goal of the second setting is to reﬂect this real-world scenario, and we can expect lower results than in the ﬁrst setting as the size of the search space of items increases considerably . T o summarize, predicting only among the items that were shown to user evaluates the model’s capability of retrieving highly rated items among the shown ones, while predicting among all items measur es the performance of the model on the basis of its ability to recommend offers which user would like to engage in. JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 8 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MAP@1 Embeddings Size Recnet c Recnet p Recnet c,p (a) M L-100K 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MAP@1 Embeddings Size Recnet c Recnet p Recnet c,p (b) M L-1M 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MAP@1 Embeddings Size Recnet c Recnet p Recnet c,p (c) K A S A N D R 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MAP@1 Embeddings Size Recnet c Recnet p Recnet c,p (d) N E T FL I X Fig. 3: MAP@1 as a function of the dimension of the embedding for M L -100K, M L -1M and K A S A N D R . T ABLE 2: Best parameters for RecNet p , RecNet c and RecNet c,p when prediction is done on only shown offers; k denotes the dimension of embeddings, λ the regularization parameter . W e also r eport the number of hidden units per layer . M L -100K M L -1M N E T FL I X K A S A N D R RecNet c RecNet p RecNet c,p RecNet c RecNet p RecNet c,p RecNet c RecNet p RecNet c,p RecNet c RecNet p RecNet c,p k 1 2 2 16 1 1 9 2 6 19 1 18 λ 0 . 05 0 . 005 0 . 005 0 . 05 0 . 0001 0 . 001 0 . 05 0 . 01 0 . 05 0 . 0001 0 . 05 0 . 005 # units 32 64 16 32 16 32 64 16 16 64 16 64 T ABLE 3: Best parameters for RecNet p , RecNet c and RecNet c,p when prediction is done on all offers; k denotes the dimension of embeddings, λ the regularization parameter . W e also report the number of hidden units per layer . M L -100K M L -1M N E T FL I X K A S A N D R RecNet c RecNet p RecNet c,p RecNet c RecNet p RecNet c,p RecNet c RecNet p RecNet c,p RecNet c RecNet p RecNet c,p k 15 5 8 2 11 2 3 13 1 4 16 14 λ 0 . 001 0 . 001 0 . 001 0 . 05 0 . 0001 0 . 001 0 . 0001 0 . 001 0 . 001 0 . 001 0 . 0001 0 . 05 # units 32 16 16 32 64 32 32 64 64 32 64 64 All comparisons are done based on a common ranking metric, namely the Mean A verage Precision (MAP). First, let us r ecall that the A verage Pr ecision (AP @ ` ) is deﬁned over the pr ecision, P r (fraction of recommended items clicked by the user), at rank ` . AP @ ` = 1 ` ` X j =1 r j P r ( j ) , where the relevance judgments, r j , ar e binary (i.e. equal to 1 when the item is clicked or preferred, and 0 otherwise). Then, the mean of these AP’s across all users is the MAP . In the following results, we report MAP at different rank ` = 1 and 10 . Hyper-parameters tuning First, we provide a detailed study of the impact of the dif fer- ent hyper -parameters involved in the pr oposed framework RecNet . . For all datasets, hyper-parameters tuning is done on a separate validation set. • The size of the embedding is chosen among k ∈ { 1 , . . . , 20 } . The impact of k on the performance is presented in Figure 3. • W e use ` 2 regularization on the embeddings and choose λ ∈ { 0 . 0001 , 0 . 001 , 0 . 005 , 0 . 01 , 0 . 05 } . • W e run RecNet with 1 hidden layer with relu activation functions, where the number of hidden units is chosen in { 16 , 32 , 64 } . • In order to train RecNet , we use ADAM [42] and found the learning rate η = 1 e − 3 to be mor e ef ﬁcient for all JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 9 0.8 0.82 0.84 0.86 0.88 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MAP @ MAP@1 MAP@5 MAP@10 (a) M L-100K 0.8 0.82 0.84 0.86 0.88 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MAP @ MAP@1 MAP@5 MAP@10 (b) M L-1M 0.94 0.95 0.96 0.97 0.98 0.99 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MAP @ MAP@1 MAP@5 MAP@10 (c) K A S A N D R 0.8 0.82 0.84 0.86 0.88 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MAP @ MAP@1 MAP@5 MAP@10 (d) N E T FL I X Fig. 4: MAP@1, MAP@5, MAP@10 as a function of the value of α for M L-1M, M L -100K and K A S A N D R . our settings. For other parameters involved in Adam, i.e., the exponential decay rates for the moment esti- mates, we keep the default values ( β 1 = 0 . 9 , β 2 = 0 . 999 and  = 10 − 8 ). • Finally , we ﬁx the number of epochs to be T = 10 , 000 in advance and the size of mini-batches to n = 512 . • One can see that all three versions of RecNet perform the best with a quite small number of hidden units, only one hidden layer and a low dimension for the repr esentation. As a consequence, they involve a few number of parameters to tune while training. • In terms of the ability to r ecover a relevant ranked list of items for each user , we also tune the hyper-parameter α (Eq. 11) which balances the weight given to the two terms in RecNet c,p . These results are shown in Figure 4, where the values of α are taken in the interval [0 , 1] . While it seems to play a signiﬁcant role on M L-100K and K A S A N D R , we can see that for M L -1M the results in terms of MAP ar e stable, r egardless the value of α . From Figure 3, when prediction is done on the interacted offers, it is clear that best MAP@1 results are generally obtained with small sizes of item and user embedded vector spaces k . These empirical results support our theor etical analysis where we found that small k induces smaller gen- eralization bounds. This observation on the dimension of embedding is also in agreement with the conclusion of [41], which uses the same technique for representation learning. For instance, one can see that on M L -1M, the highest MAP is achieved with a dimension of embedding equals to 1 . Since in the interacted of fers setting, the pr ediction is done among the very few shown offers, RecNet makes non-personalized recommendations. This is due to the fact that having k = 1 means that the recommendations for a given user with a positive (negative) value is done by sorting the positive (negative) items according to their learned embeddings, and in some sense, can therefore be seen as a bi-polar popularity model. This means that in such cases popularity and non- personalized based appr oaches are perhaps the best way to make recommendations. For reproducibility purpose, we report the best combination of parameters for each variant of RecNet in T able 2 and T able 3. 4.3 Results Hereafter , we compare and summarize the performance of RecNet . with the baseline methods on various data sets. Empirically , we observed that the version of RecNet c,p where both L c and L p have an equal weight while training gives better results on average, and we decided to only report these results later . T ables 4 and 5 r eport all results. In addition, in each case, we statistically compar e the performance of each algorithm, and we use bold face to indicate the highest performance, and the symbol ↓ indicates that performance is signiﬁcantly worst than the best result, according to a W ilcoxon rank sum test used at a p-value threshold of 0 . 01 [43]. JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 10 T ABLE 4: Results of all state-of-the-art approaches for implicit feedback when prediction is done only on offers shown to users. The best result is in bold, and a ↓ indicates a result that is statistically signiﬁcantly worse than the best, accor ding to a W ilcoxon rank sum test with p < . 01 . M L -100K M L -1M N E T FL I X K A S A N D R MAP@1 MAP@10 MAP@1 MAP@10 MAP@1 MAP@10 MAP@1 MAP@10 BPR-MF 0 . 613 ↓ 0 . 608 ↓ 0 . 788 ↓ 0 . 748 ↓ 0.909 0 . 842 ↓ 0 . 857 ↓ 0 . 857 ↓ LightFM 0 . 772 ↓ 0 . 770 ↓ 0 . 832 ↓ 0 . 795 ↓ 0 . 800 ↓ 0 . 793 ↓ 0 . 937 ↓ 0 . 936 ↓ CoFactor 0 . 718 ↓ 0 . 716 ↓ 0 . 783 ↓ 0 . 741 ↓ 0 . 693 ↓ 0 . 705 ↓ 0 . 925 ↓ 0 . 918 ↓ RecNet c 0.894 0.848 0 . 877 ↓ 0 . 835 0 . 880 ↓ 0.847 0 . 958 ↓ 0 . 963 ↓ RecNet p 0 . 881 ↓ 0 . 846 0 . 876 ↓ 0.839 0 . 875 ↓ 0 . 844 0 . 915 ↓ 0 . 923 ↓ RecNet c,p 0 . 888 ↓ 0 . 842 0.884 0.839 0 . 879 ↓ 0.847 0 . 970 0.973 T ABLE 5: Results of all state-of-the-art approaches for recommendation on all implicit feedback data sets when pr ediction is done on all offers. The best result is in bold, and a ↓ indicates a result that is statistically signiﬁcantly worse than the best, accor ding to a W ilcoxon rank sum test with p < . 01 M L -100K M L -1M N E T FL I X K A S A N D R MAP@1 MAP@10 MAP@1 MAP@10 MAP@1 MAP@10 MAP@1 MAP@10 BPR-MF 0 . 140 ↓ 0.261 0 . 048 ↓ 0 . 097 ↓ 0 . 035 ↓ 0 . 072 ↓ 0 . 016 ↓ 0 . 024 ↓ LightFM 0 . 144 ↓ 0 . 173 ↓ 0 . 028 ↓ 0 . 096 ↓ 0 . 006 ↓ 0 . 032 ↓ 0 . 002 ↓ 0 . 003 ↓ CoFactor 0 . 056 ↓ 0 . 031 ↓ 0 . 089 ↓ 0 . 033 ↓ 0 . 049 ↓ 0 . 030 ↓ 0 . 002 ↓ 0 . 001 ↓ RecNet c 0 . 106 ↓ 0 . 137 ↓ 0 . 067 ↓ 0 . 093 ↓ 0 . 032 ↓ 0 . 048 ↓ 0 . 049 ↓ 0 . 059 ↓ RecNet p 0.239 0 . 249 0.209 0.220 0.080 0.089 0 . 100 ↓ 0 . 100 ↓ RecNet c,p 0 . 111 ↓ 0 . 134 ↓ 0 . 098 ↓ 0 . 119 ↓ 0 . 066 ↓ 0 . 087 0.269 0.284 Setting 1 : interacted items When the prediction is done over offers which user inter- acted with (T able 4), the RecNet architecture, regar dless the weight given to α , beats all the other algorithms on K A S A N D R , M L-100K and M L-1M. However , on N E T FL I X , BPR-MF outperforms our approach in terms of MAP@1. This may be owing to the fact that the binarized N E T FL I X movie data set is strongly biased towards the popular movies and usually , the majority of users have watched one or the other popular movies in such data sets and rated them well. In N E T FL I X , around 75% of the users have given ratings greater to 4 to the top-10 movies. W e believe that this phenomenon adversely affects the performance of RecNet . However , on K A S A N D R , which is the only true implicit dataset RecNet signiﬁcantly outperforms all other approaches. Setting 2 : all items When the prediction is done over all offers (T able 5), we can make two observations. First, all the algorithms encounters an extreme drop of their performance in terms of MAP . Sec- ond, RecNet framework signiﬁcantly outperforms all other algorithms on all datasets, and this differ ence is all the more important on K A S A N D R , wher e for instance RecNet c,p is in average 15 times mor e efﬁcient. W e believe, that our model is a fresh departure from the models which learn pairwise ranking function without the knowledge of embeddings or which learn embeddings without learning any pairwise ranking function. While learning pairwise ranking function, our model is aware of the learned embeddings so far and vice-versa. W e demonstrate that the simultaneous learning of two ranking functions helps in learning hidden features of implicit data and impr oves the performance of RecNet . Comparison between RecNet versions One can note that while optimizing ranking losses by Eq. 8 or Eq. 9 or Eq. 10, we simultaneously learn representation and prefer ence function; the main dif ference is the amount of emphasis we put in learning one or another . The results presented in both tables tend to demonstrate that, in almost all cases, optimizing the linear combination of the pairwise- ranking loss and the embedding loss ( RecNet c,p ) indeed increases the quality of overall recommendations than op- timizing standalone losses to learn embeddings and pair- wise prefer ence function. For instance, when the pr ediction is done over offers which user interacted with (T able 4), ( RecNet c,p ) outperforms ( RecNet p ) and ( RecNet c ) on M L - 1M, K A S A N D R and N E T FL I X . When prediction is done on all offers (T able 5), ( RecNet c,p ) outperforms ( RecNet p ) and ( RecNet c ) on K A S A N D R . Thus, in case of interacted offers setting, optimizing ranking and embedding loss simultane- ously boosts performance on all datasets. However , in the setting of all offers, optimizing both losses simultaneously is beneﬁcial in case of true implicit feedback datasets such as K A S A N D R (recall that all other datasets were synthetically made implicit). 5 C O N C L U S I O N W e presented and analyzed a learning to rank framework for recommender systems which consists of learning user prefer ences over items. W e showed that the minimization of pairwise ranking loss over user pr eferences involves depen- dent random variables and provided a theoretical analysis by proving the consistency of the empirical risk minimiza- tion in the worst case wher e all users choose a minimal num- ber of positive and negative items. From this analysis we then proposed RecNet , a new neural-network based model for learning the user preference, where both the user ’s and item’s r epresentations and the function modeling the user ’s JOURNAL OF L A T E X CLASS FILES, V OL. 14, NO . 8, AUGUST 2015 11 prefer ence over pairs of items are learned simultaneously . The learning phase is guided using a ranking objective that can captur e the ranking ability of the prediction function as well as the expressiveness of the learned embedded space, where the pr eference of users over items is respected by the dot product function deﬁned over that space. The training of RecNet is carried out using the back-propagation algorithm in mini-batches deﬁned over a user-item matrix containing implicit information in the form of subsets of preferr ed and non-preferr ed items. The learning capability of the model over both prediction and representation problems show their interconnection and also that the proposed double ranking objective allows to conjugate them well. W e as- sessed and validated the pr oposed approach through exten- sive experiments, using four popular collections proposed for the task of recommendation. Furthermore, we propose to study two different settings for the prediction phase and demonstrate that the performance of each approach is strongly impacted by the set of items consider ed for making the pr ediction. For future work, we would like to extend RecNet in order to take into account additional contextual information regar ding users and/or items. More speciﬁcally , we are interested in the integration of data of different natures, such as text or demographic information. W e believe that this information can be taken into account without much effort and by doing so, it is possible to improve the perfor - mance of our approach and tackle the problem of pr oviding recommendation for new users/items at the same time, also known as the cold-start problem. The second important extension will be the development of an on-line version of the proposed algorithm in order to make the approach suitable for real-time applications and on-line advertising. Finally , we have shown that choosing a suitable α , which controls the the trade-off between ranking and embedding loss, greatly impact the performance of the proposed frame- work, and we believe that an interesting extension will be to learn automatically this hyper-parameter , and to make it adaptive during the training phase. A C K N OW L E D G E M E N T S This work was partly done under the Calypso project sup- ported by the FEDER pr ogram from the R ´ egion Auver gne- Rh ˆ one-Alpes. The resear ch of Y ury Maximov at LANL was supported by Center of Non Linear Studies (CNLS). R E F E R E N C E S [1] F . Ricci, L. Rokach, B. Shapira, and P . B. Kantor , Recom- mender Systems Handbook , 1st ed. New Y ork, NY , USA: Springer-V erlag New Y ork, Inc., 2010. [2] P . Lops, M. de Gemmis, and G. Semeraro, “Content- based recommender systems: State of the art and trends.” in Recommender Systems Handbook , F . Ricci, L. Rokach, B. Shapira, and P . B. Kantor , Eds. Springer , 2011, pp. 73–105. [3] R. White, J. M. Jose, and I. Ruthven, “Comparing explicit and implicit feedback techniques for web re- trieval: TREC-10 interactive track report,” in Proceed- ings of TREC , 2001. [4] H. W ang, N. W ang, and D. Y eung, “Collaborative deep learning for recommender systems,” in Proceedings of SIGKDD , 2015, pp. 1235–1244. [5] N. Usunier , M. Amini, and P . Gallinari, “A data- dependent generalisation error bound for the AUC,” in ICML ’05 workshop on ROC Analysis in Machine Learning , Bonn, Germany , 2005. [6] M.-R. Amini and N. Usunier , Learning with Partially Labeled and Interdependent Data . New Y ork, NY , USA: Springer , 2015. [7] S. Janson, “Large Deviations for Sums of Partly De- pendent Random V ariables,” Random Structures and Algorithms , vol. 24, no. 3, pp. 234–248, 2004. [8] N. Usunier , M.-R. Amini, and P . Gallinari, “Generaliza- tion error bounds for classiﬁers trained with interde- pendent data,” in Proceedings of NIPS , 2006, pp. 1369– 1376. [9] C. McDiarmid, “On the method of bounded differ- ences,” Survey in Combinatorics , pp. 148–188, 1989. [10] L. Ralaivola and M. Amini, “Entr opy-based concen- tration inequalities for dependent variables,” 2015, pp. 2436–2444. [11] V . V apnik, The nature of statistical learning theory . Springer Science & Business Media, 2000. [12] M. V olkovs and G. W . Y u, “Effective latent models for binary feedback in r ecommender systems,” in Proceed- ings of SIGIR , 2015, pp. 313–322. [13] S. Rendle, C. Freudenthaler , Z. Gantner , and L. Schmidt-Thieme, “BPR: bayesian personalized rank- ing fr om implicit feedback,” in Proceedings of UAI , 2009, pp. 452–461. [14] L. Bottou, “Stochastic gradient descent tricks,” in Neu- ral Networks: T ricks of the T rade - Second Edition , 2012, pp. 421–436. [15] N. Ailon and M. Mohri, “An efﬁcient reduction of ranking to classiﬁcation,” in Proceedings of COL T , (2008), pp. 87–98. [16] T . Mikolov , K. Chen, G. Corrado, and J. Dean, “Efﬁcient estimation of word repr esentations in vector space,” CoRR , vol. abs/1301.3781, 2013. [17] T . Mikolov , I. Sutskever , K. Chen, G. S. Corrado, and J. Dean, “Distributed repr esentations of words and phrases and their compositionality ,” in Proceedings of NIPS , 2013, pp. 3111–3119. [18] O. Levy and Y . Goldberg, “Neural word embedding as implicit matrix factorization,” in Proceedings of NIPS , 2014, pp. 2177–2185. [19] ´ E. Gu ` ardia-Sebaoun, V . Guigue, and P . Gallinari, “La- tent trajectory modeling: A light and efﬁcient way to introduce time in recommender systems,” in Proceed- ings of RecSys , 2015, pp. 281–284. [20] D. Liang, J. Altosaar , L. Charlin, and D. M. Blei, “Factor - ization meets the item embedding: Regularizing matrix factorization with item co-occurr ence,” in Proceedings of RecSys , 2016, pp. 59–66. [21] M. Grbovic, V . Radosavljevic, N. Djuric, N. Bhamidi- pati, J. Savla, V . Bhagwan, and D. Sharp, “E-commer ce in your inbox: Product recommendations at scale,” in Proceedings of SIGKDD , 2015, pp. 1809–1818. [22] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T . Chua, “Neural collaborative ﬁltering,” in Pr oceedings 12 of WWW , 2017, pp. 173–182. [23] F . V asile, E. Smirnova, and A. Conneau, “Meta- prod2vec: Product embeddings using side-information for recommendation,” in Proceedings of RecSys , 2016, pp. 225–232. [24] T . Liu, “Learning to rank for information retrieval,” Foundations and T rends in Information Retrieval , vol. 3, no. 3, pp. 225–331, 2009. [25] K. Crammer and Y . Singer , “Pranking with ranking,” in Proceedings of NIPS , 2001, pp. 641–647. [26] P . Li, C. J. C. Bur ges, and Q. W u, “Mcrank: Learning to rank using multiple classiﬁcation and gradient boost- ing,” in Proceedings of NIPS , 2007, pp. 897–904. [27] Y . Shi, M. Larson, and A. Hanjalic, “List-wise learning to rank with matrix factorization for collaborative ﬁl- tering,” in Proceedings of RecSys , 2010, pp. 269–272. [28] J. Xu and H. Li, “Adarank: a boosting algorithm for information r etrieval,” in Pr oceedings of SIGIR , 2007, pp. 391–398. [29] J. Xu, T . Liu, M. Lu, H. Li, and W . Ma, “Directly optimizing evaluation measures in learning to rank,” in Pr oceedings of SIGIR , 2008, pp. 107–114. [30] W . W . Cohen, R. E. Schapire, and Y . Singer , “Learning to order things,” J. Artif. Intell. Res. (JAIR) , vol. 10, pp. 243–270, 1999. [31] Y . Fr eund, R. D. Iyer , R. E. Schapire, and Y . Singer , “An efﬁcient boosting algorithm for combining prefer- ences,” Journal of Machine Learning Research , vol. 4, pp. 933–969, 2003. [32] T . Joachims, “Optimizing search engines using click- through data,” in Pr oceedings of SIGKDD , 2002, pp. 133– 142. [33] J. Pessiot, T . T ruong, N. Usunier , M. Amini, and P . Gal- linari, “Learning to rank for collaborative ﬁltering,” in Proceedings of ICEIS , 2007, pp. 145–151. [34] R. Caruana, S. Baluja, and T . M. Mitchell, “Using the future to sort out the present: Rankprop and multitask learning for medical risk evaluation,” in Proceedings of NIPS , 1995, pp. 959–965. [35] C. J. C. Burges, T . Shaked, E. Renshaw , A. Lazier , M. Deeds, N. Hamilton, and G. N. Hullender , “Learn- ing to rank using gradient descent,” in Proceedings of ICML , 2005, pp. 89–96. [36] L. Rigutini, T . Papini, M. Maggini, and M. Bianchini, “A neural network approach for learning object ranking,” in Pr oceedings of ICANN , 2008, pp. 899–908. [37] L. Rigutini, T . Papini, M. Maggini, and F . Scarselli, “Sortnet: Learning to rank by a neural prefer ence func- tion,” IEEE T rans. Neural Networks , vol. 22, no. 9, pp. 1368–1380, 2011. [38] F . M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM T rans. Interact. Intell. Syst. , vol. 5, no. 4, pp. 19:1–19:19, Dec. 2015. [39] S. Sidana, C. Laclau, M.-R. Amini, G. V andelle, and A. Bois-Crettez, “Kasandr: A large-scale dataset with implicit feedback for recommendation,” in Proceedings of SIGIR , 2017. [40] Y . Hu, Y . Koren, and C. V olinsky , “Collaborative ﬁl- tering for implicit feedback datasets,” in Proceedings of ICDM , 2008, pp. 263–272. [41] M. Kula, “Metadata embeddings for user and item cold-start recommendations,” in Proceedings of the 2nd Workshop on New T rends on Content-Based Recommender Systems co-located with RecSys. , 2015, pp. 14–21. [42] D. P . Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR , vol. abs/1412.6980, 2014. [43] E. Lehmann and H. D’Abrera, Nonparametrics: statistical methods based on ranks . Springer , 2006. Sumit Sidana is a PhD student at Grenob le Alpes Univ ersity , Grenoble. He received a bachelors degree in Information T echnology from SVIET , India and a masters degree in computer science from IIIT , Hyderabad. His research interests include machine lear ning, recommender sys- tems, probabilistic gr aphical models and deep lear ning. Mikhail T roﬁmov is a PhD student at Feder al Research Center Com- puter Science and Control of Russian Academy of Sciences, Mosco w . He receiv ed a bachelor’ s degree in applied math and physics and master’ s degrees in intelligent data analysis from Mosco w Institute Of Physics and T echnology . His research interests include machine lear n- ing on sparse data, tensor approximation methods and counterfactual learning. Oleg Gorodnitskii is a MSc student at Skoltech Institute of Science and T echnology . He received a bachelor’ s degree in applied math and physics from Moscow Institute Of Physics and T echnology . His research interests include machine learning, recommender systems, optimization methods and deep learning. Charlotte Laclau received the PhD degree from the University of Paris Descar tes in 2016. She is a a postdoctoral researcher in the Machine Learning group in the Univ ersity of Grenob le Alpes. Her research inter- ests include statistical machine lear ning, data mining and par ticular ly unsupervised lear ning in information retr iev al. Y ury Maximov is a postdoc at CNLS and the Theoretical division of Los Alamos National Laborator y , and Assistant Professor of the Center of Energy Systems at Skolk ovo Institute of Science and T echnology . His research is in optimization methods and machine learning theor y . Massih-R ´ eza Amini is a Prof essor in the Univ ersity of Grenoble Alpes and head of the Machine Learning group . His research is in statistical machine learning and he has contributed in dev eloping machine learn- ing techniques for inf or mation retr iev al and te xt mining.

Representation Learning and Pairwise Ranking for Implicit Feedback in Recommendation Systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment