Choice by Elimination via Deep Neural Networks

We introduce Neural Choice by Elimination, a new framework that integrates deep neural networks into probabilistic sequential choice models for learning to rank. Given a set of items to chose from, the elimination strategy starts with the whole item …

Authors: Truyen Tran, Dinh Phung, Svetha Venkatesh

Choice by Elimination via Deep Neural Networks
Choice b y Elimination via Deep Neural Net w orks T ruy en T ran, Dinh Ph ung and Sv etha V enk atesh Cen tre for Pattern Recognition and Data Analytics Deakin Universit y , Geelong, Australia { truyen.tr an,dinh.phung,svetha.venkatesh } @de akin.e du.au Octob er 1, 2018 Abstract W e in tro duce Neur al Choic e by Elimination , a new framework that integrates deep neural net works in to probabilistic sequential c hoice mo dels for learning to rank. Given a set of items to c hose from, the elimination strategy starts with the whole item set and iteratively eliminates the least worth y item in the remaining subset. W e pro ve that the choice b y elimination is equiv alent to marginalizing out the random Gompertz latent utilities. Coupled with the c hoice mo del is the recen tly introduced Neural High wa y Netw orks for appro ximating arbitrarily complex rank functions. W e ev aluate the prop osed framework on a large-scale public dataset with ov er 425 K items, drawn from the Y aho o! learning to rank challenge. It is demonstrated that the prop osed metho d is comp etitive against state-of-the-art learning to rank metho ds. 1 In tro duction P eople often rank options when making choice. Ranking is central in many social and individual con texts, ranging from election [ 16 ], sp orts [ 17 ], information retriev al [ 22 ], question answering [ 1 ], to recommender systems [ 32 ]. W e fo cus on a setting known as learning to rank (L2R) in whic h the system learns to c ho ose and rank items (e.g. a set of do cumen ts, p otential answ ers, or shopping items) in resp onse to a query (e.g., keyw ords, a question, or an user). Tw o main elemen ts of a L2R system are rank model and rank function. One of the most promising rank mo dels is listwise [ 22 ] where all items resp onding to the query are considered simultaneously . Most existing work in L2R fo cuses on designing listwise rank losses rather than formal models of choic e , a crucial asp ect of building preference-aw are applications. One of a few exceptions is the Plac kett-Luce mo del which originates from Luce’s axioms of choice [ 24 ] and is later used in the context of L2R under the name ListMLE [ 34 ]. The Plac k ett-Luce model offers a natural interpretation of making sequential c hoices. First, the probability of c ho osing an item is prop ortional to its w orth. Second, once the most probable item is chosen, the next item will b e pick ed from the remaining items in the same fashion. Ho wev er, the Plack ett-Luce mo del suffers from tw o drawbac ks. First, the mo del sp ends effort to separate items down the rank list, while only the first few items are usually imp ortant in practice. Th us the effort in making the right ordering should b e sp ent on more imp ortan t items. Second, the Plac kett-Luce is inadequate in explaining man y comp etitive situations, e.g., sp ort tournaments and buying preferences, where the ranking pro cess is rev ersed – worst items are eliminated first [33]. Addressing these drawbac ks, we introduce a probabilistic sequential rank mo del termed choic e by elimination . At each step, we remov e one item and rep eat until no item is left. The rank of items is 1 gate input rank sc ore elimination o rde r gate A B C A C C Figure 1: Neur al choic e by elimination with 4-lay er highw ay netw orks for ranking three items (A,B,C). Empt y b oxes represent hidden la y ers that share the same parameters (hence, recurrent). Double circles are gate that con trols information flow. Dark filled circles represen t a c hosen item at eac h step – here the elimination order is (B,A,C). then the reverse of the elimination order. This elimination pro cess has an imp ortan t prop ert y: Near the end of the pro cess, only b est items compete against the others. This is unlike the selection pro cess in Plac kett-Luce, where the b est items are contrasted against all other alternatives. W e may face difficulty in separating items of similarly high quality but can ignore irrelev ant items effortlessly . The elimination mo del th us reflects more effort in ranking w orthy items. Once the ranking model has been specified, the next step in L2R is to design a rank function f ( x ) of query-sp ecific item attributes x [ 22 ]. W e lev erage the newly introduced highway networks [ 28 ] as a rank function approximator. Highw ay netw orks are a compact deep neural netw ork architecture that enables passing information and gradient through hundreds of hidden lay ers b etw een input features and the function output. The high w ay net works coupled with the prop osed elimination mo del constitute a new framework termed Neur al Choic e by Elimination (NCE) illustrated in Fig. 1. The framework is an alternative to the current state-of-the-arts in L2R whic h inv olve tree ensembles [ 12 , 5 , 4 ] trained with hand-crafted metric-aw are losses [ 7 , 21 ]. Unlik e the tree ensembles where t ypically h undreds of trees are maintained, highw ay netw orks can b e trained with dr op outs [ 27 ] to pro duce an implicit ensemble with only one thin netw ork. Hence we aim to establish that de ep neur al networks ar e c omp etitive in L2R . While shallow neural netw orks ha ve b een used in ranking b efore [ 6 ], they w ere outp erformed by tree ensembles [ 21 ]. Deep neural nets are compact and more p ow erful [ 3 ], but they hav e not b een measured against tree-based ensem bles for generic L2R problems. W e empirically 2 demonstrate the effectiveness of the proposed ideas on a large-scale public dataset from Y aho o! L2R c hallenge with totally 18 . 4 thousands queries and 425 thousands do cuments. T o summarize, our pap er makes the follo wing con tributions: (i) introducing a new neural sequential c hoice model for learning to rank; and (ii) establishing that deep nets are scalable and competitive as rank function approximator in large-scale settings. 2 Bac kground 2.1 Related W ork The elimination pro cess has b een found in multiple comp etitive situations such as m ultiple round con tests and buying decisions [ 14 ]. Choic e by elimination of distractors has b een long studied in the psyc hological literature, since the pioneer work of Tversky [ 33 ]. These backw ard elimination mo dels ma y offer b etter explanation than the forward selection when eliminating asp ects are a v ailable [ 33 ]. Ho wev er, existing studies are mostly on selecting a single best c hoice. Multiple sequen tial eliminations are muc h less studied [ 2 ]. Second, most prior work has b een ev aluated on a handful of items with several attributes, whereas we consider hundreds of thousands of items with thousands of attributes. Third, the cross-field connection with data mining has not b een made. The link b et ween choice mo dels and Random Utility Theory has b een well-studied since Th urstone in the 1920s, and is still an activ e topic [ 2 , 30 , 31 ]. Deep neural netw orks for L2R hav e b een studied in the last t wo years [ 18 , 10 , 26 , 11 , 25 ]. Our work contributes a formal reasoning of human choices together with a newly in tro duced highw ay net works which are v alidated on large-scale public datasets against state-of-the-art metho ds. 2.2 Plac k ett-Luce W e now review Plack ett-Luce mo del [ 24 ], a forw ard selection metho d in learning to rank, also kno wn in the L2R literature as ListMLE [ 34 ]. Giv en a query and a set of response items I = (1 , 2 , .., N ) , the rank choice is an ordering of items π = ( π 1 , π 2 , ..., π N ), where π i is the index of item at rank i . F or simplicit y , assume that eac h item π i is asso ciated with a set of attributes, denoted as x π i ∈ R p . A rank function f ( x π i ) is defined on π i and is indep endent of other items. W e aim to characterize the rank p erm utation mo del P ( π ). Let us start from the classic probabilistic theory that any joint distribution of N v ariables can be factorized according to the c hain-rule as follows P ( π ) = P ( π 1 ) N Y i =2 P ( π i | π 1: i − 1 ) (1) where π 1: i − 1 is a shorthand for ( π 1 , π 2 , .., π i − 1 ), and P ( π i | π 1: i − 1 ) is the probabilit y that item π i has rank i giv en all existing higher ranks 1 , ..., i − 1. The factorization can b e interpreted as follows: c ho ose the first item in the list with probability of P ( π 1 ), and c ho ose the second item from the remaining items with probability of P ( π 2 | π 1 ), and so on. Luce’s axioms of choic e assert that an item is chosen with probability prop ortional to its worth . This translates to the follo wing choice mo del: P ( π i | π 1: i − 1 ) = exp ( f ( x π i )) P N j = i exp  f ( x π j )  3 Learning using maximizing likelihoo d minimizes the log-loss: ` 1 ( π ) = N − 1 X i =1   − f ( x π i ) + log N X j = i exp  f ( x π j )    (2) 3 Choice b y Elimination W e note that the factorization in Eq. (1) is not unique. If w e p ermute the indices of items, the factorization still holds. Here we derive a r everse Plack ett-Luce mo del as follo ws P ( π ) = Q ( π N ) N − 1 Y i =1 Q ( π i | π i +1: N ) (3) where Q ( π i | π i +1: N ) is the probabilit y that item π i receiv es rank i giv en all existing low er ranks i + 1 , i + 2 , ..., N Since π N is the most irrelev an t item in the list, Q ( π N ) can be considered as the probability of eliminating the item. Th us the en tire pro cess is backw ard elimination: The next irrelev ant item π k is eliminated, given that more extraneous items ( π i>k ) ) hav e already been eliminated. It is reasonable to assume that the pr ob ability of an item b eing eliminate d is inversely pr op ortional to its worth. This suggests the following sp ecification Q ( π i | π i +1: N ) = exp ( − f ( x π i )) P i j =1 exp  − f ( x π j )  . (4) Note that, due to specific choices of conditional distributions, distributions in Eqs. (1,3) are generally not the same. With this mo del, the log-loss has the follo wing form: ` 2 ( π ) = N X i =1   f ( x π i ) + log i X j =1 exp  − f ( x π j )    (5) 3.1 Deriv ation using Random Utility Theory . Random Utility Theory [ 2 , 29 ] offers an alternative that explains the ordering of items. Assume that there exists latent utilities { u i } , one p er item { π i } . The ordering P ∗ ( π ) is defined as Pr ( u 1 ≥ u 2 ≥ ... ≥ u N ) . Here we show that it is linked to Gomp ertz distribution . Let u j ≥ 0 denote the laten t random utilit y of item π j . Let v j = e bu j , the Gomp ertz distribution has the PDF P j ( u j ) = bη j v j exp ( − η j v j + η j ) and the CDF F j ( u j ) = 1 − exp ( − η j ( v j − 1)), where b > 0 is the scale and η > 0 is the shape parameter. A t rank i , choosing the worst item π i translates to ensuring u i ≤ u j for all j < i . The random utilit y theory states that probability of choosing π i can b e obtained by integrating out all latent utilities sub ject to the inequalit y constraints: 4 Q ( π i | π i +1: N ) = ˆ + ∞ 0 P i ( u i )   ˆ + ∞ u i Y j