DUM: Diversity-Weighted Utility Maximization for Recommendations

DUM: Diver sity-W eighted Utility Maximization f or Recommendations Azin Ashkan ∗ , Branisla v Kv eton ∗ , Shlomo Berkovsky ∗∗ , Zheng W en ∗∗∗ ∗ T echnicolor , United States {azin.ashkan,branislav.kveton}@technicolor.com ∗∗ CSIR O, Australia shlomo.berkovsky@csiro.au ∗∗∗ Y ahoo Labs, United States zhengwen@yahoo-inc.com ABSTRA CT The need for div ersiﬁcation of recommendation lists manifests in a number of recommender systems use cases. Howe ver , an increase in div ersity may undermine the utility of the recommendations, as relev ant items in the list may be replaced by more diverse ones. In this work we propose a novel method for maximizing the utility of the recommended items subject to the div ersity of user’ s tastes, and show that an optimal solution to this problem can be found greed- ily . W e ev aluate the proposed method in two online user studies as well as in an ofﬂine analysis incorporating a number of ev alua- tion metrics. The results of evaluations sho w the superiority of our method ov er a number of baselines. K eywords Recommender systems, polymatroid, div ersity , utility 1. INTR ODUCTION The popularity of recommender systems has soared in the re- cent years. They are widely used in social networks, entertain- ment, eCommerce, W eb search, and many other online services [20]. Recommenders deal with the information overload problem and select items on behalf of their users. T ypically , a recommender scores recommendable items according to their match to the user’ s preferences and interests, as encapsulated in the user proﬁles, and then recommends a list of top-scoring items. A naïv e selection of top-scoring items may , howe ver , yield a sub- optimal recommendation list. For instance, collaborativ e recom- menders tend to recommend most popular items appearing in the proﬁles of numerous users [13]. While being good recommenda- tions on their own, these items are likely to be known to the user and bear little value. Likewise, content-based recommenders may target user’ s f av orite topics and recommend homogeneous lists that ov erlook other potentially interesting topics [16]. This has brought to the fore the problem of di versity in recommender systems, which has been studied in a number of works [5, 10, 17, 24, 31]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. T o copy otherwise, to republish, to post on servers or to redistrib ute to lists, requires prior speciﬁc permission and/or a fee. Copyright 20XX A CM X-XXXXX-XX-X/XX/XX ...$15.00. In a nutshell, the diversity problem deals with the construction of recommendation lists that co ver as wide as possible range of topics of interest. The problem is particularly acute for users with eclectic interests, having no single dominant topic but rather interested in multiple topics. In this case, it is important for the recommenda- tion list to include items that touch upon many topics, in order to increase the chance of answering the current user’ s need. Repercus- sions of the diversity problem can be recognized also in other rec- ommender system use cases. Consider the group recommendation problem in heterogeneous (in terms of interests) groups. Another example of the need for di versity is in sequential recommendations, like in the music domain. In both cases, the recommendation list should incorporate diverse items that either appeal to a number of group members or represent a number of music genres [29]. The need for di versity manifests itself also be yond recommender systems. Consider an ambiguous W eb search query . Having no knowledge about the context of the query , a search engine may present results pertaining to different interpretations of the query , so that the user can pick the desired one and reformulate the query [3]. Another instance comes from text summarization. Unless the de- sired topic of the summary is kno wn, it should include references to as many aspects of the original document as possible. Also, div er- siﬁcation may be useful in computer supported collaborati ve work. There, formation of virtual groups may need to bring together users with complementary skills and expertise areas, such that the div er - sity of the group is important. In all the above diversiﬁcation use cases, it is of paramount im- portance to maintain the trade-of f between increasing the list div er- sity and maintaining the utility of the results [4, 11, 30]. Di versity typically comes at the account of decreasing the rele vance of items, as relev ant but redundant items are substituted with less rele vant but more diverse ones. Hence, there is a need to strike the balance be- tween the two objectives [4], a modular relev ance function and a submodular div ersity function, for which an approximation to the optimal solution can be computed greedily [18]. In this work, we introduce a different objectiv e diversiﬁcation function and sho w that an optimal solution to the di versity problem can be found greedily . W e propose a parameter-free method, de- noted as diversity-weighted utility maximization ( DUM ), which max- imizes the utility of the items recommended to users, subject to the div ersity of their interests. W e cast this problem as ﬁnding the max- imum of a modular function subject to a submodular constraint [8], which is known to hav e an optimal greedy solution. This solution guarantees that items in the recommendation list cov er different interests in the user proﬁle, such that each topic of interest is repre- sented by items with high utility . In other words, the utility of items remains the primary concern, but it is subjecti ve to maintaining the div ersity and avoiding redundancy in the list. W e discuss sev eral interpretations of DUM and identify suitable submodular div ersity functions. W e conduct an extensi ve ev aluation of the proposed approach. W e present two online studies using crowdsourcing, which com- pare the percei ved quality of the lists generated by DUM with base- line settings maximizing a linear combination of utility and diver - sity . The results show the superiority of the lists generated by DUM ov er the baseline methods and we characterize the cases when this superiority is prominent. W e also present an ofﬂine ev aluation that applies a variety of metrics to (a) exemplify the trade-off between div ersity and utility in recommendations; and (b) demonstrate that DUM successfully outperforms the baselines. Overall, our analyses demonstrate that DUM can effecti vely deliv er personalized recom- mendations with high degree of utility and diversity , while not re- quiring a-priori parameter tuning. In summary , the contribution of this work is two-fold. Firstly , we propose a parameter-free and computationally efﬁcient method aimed at improving the diversity of the recommendation lists, while maintaining their utility . Secondly , we present experimental ev al- uations – online user studies and ofﬂine experiments alike – that demonstrate solid empirical e vidence supporting the validity of the proposed approach. Note: The following notation is used throughout this paper . Let A and B be sets, and e be an element of a set. W e use A + e instead of A ∪ { e } and A + B instead of A ∪ B . Furthermore, we use A − e instead of A \ { e } and A − B instead of A \ B . W e represent or der ed sets by vectors and also refer to them as lists . 2. RELA TED WORK A common approximation to diversiﬁed ranking is based on the notion of maximal mar ginal r elevance ( MMR ) proposed by Carbonell and Goldstein [4]. In this approach, utility (e.g., relev ance) and di- versity are represented by independent metrics. Marginal relev ance is deﬁned as a weighted combination of these two metrics, to ac- count for the trade-off between utility and div ersity . Gi ven a stan- dard ranking of items, R , a diversiﬁed re-ranking of these items, S , is created (Algorithm 1). In each iteration, an item e ∗ ∈ R − S is chosen, such that it maximizes the marginal rele vance: e ∗ = arg max e ∈ R − S (1 − λ ) w ( e ) + λf ( S + e ) (1) where w ( . ) and f ( . ) represent the notions of utility and div ersity , respectiv ely , and the parameter λ controls the trade-of f between the two. T ypically , the utility w is a modular function of S , whereas the div ersity f is a submodular function of S . The existing approaches differ in ho w they account for dif ferent aspects of query or user (or any other entity of interest) to model f ( . ) . Implicit approaches assume that similar items should be penal- ized since they cover similar aspects. For instance, Y u et al. [26] compute f ( S + e ) = − max d 0 ∈ S Sim( e, e 0 ) to measure the redun- dancy of user intent e with respect to a set of selected intents S , where Sim( e, e 0 ) is the cosine similarity between the user intents e and e 0 . Gollapudi and Sharma [9] propose multiple div ersiﬁcation objectiv es considering the tradeof f between relevance and div ersity and using various axioms, relevance functions, and distance func- tions. Their distance functions are deﬁned based on various implicit metrics, e.g., document content, in order to capture the pairwise similarity between any pair of documents. On the other hand, explicit approaches model different aspects (e.g., query intent, query topic, or movie genre) directly , and pro- mote div ersity by maximizing the coverage of selected items with Algorithm 1 MMR : Maximal Marginal Relev ance Input: Standard ordering of items R S ← () , n = | R | while | S | < n do e ∗ ← arg max e ∈ R − S λ w ( e ) + (1 − λ ) f ( S + e ) R ← R − e ∗ Append item e ∗ to list S end while Output: List of recommended items S respect to these aspects. For instance, Santos et al. [21] deﬁne f ( S + e ) = Σ t ∈T q P ( t | q ) P ( e, ¯ S | t ) where P ( d, ¯ S | t ) represents the likelihood of document e satisfying topic t while the ones in S fail to do so. Also, P ( t | q ) denotes the popularity of t among all possible topics T q that may satisfy a user’ s information need from issuing query q . In addition to the abov e approaches in di versifying existing rank- ings, another group of work directly learns a diverse ranking by maximizing a submodular objective function. Among these ap- proaches, Radlinski et al. [19] and Y ue and Guestrin [27] propose online learning algorithms for optimizing a class of submodular objectiv e functions for div ersiﬁed retriev al and recommendation, respectiv ely . Agrawal et al. [1], on the other hand, address search result div ersiﬁcation in an ofﬂine setting, with respect to the topical categories of documents. The authors target the maximization of a submodular objecti ve function following the deﬁnition of mar ginal relev ance. They propose a greedy algorithm to approximate the ob- jectiv e function and show that an optimal solution can be found in a special case, where each document belongs to exactly one category . V allet and Castells [23] study personalization in combination with div ersity such that the two objectiv es complement each other in addressing various query aspects and satisfying user needs. In particular , they generalize the work of Agrawal et al. [1] and San- tos et al. [21] to the personalized versions by exploiting av ailable information about user preferences. Most of these studies target diversity in information retriev al, while there has been a growing interest in recommendation di versi- ﬁcation more recently . One of the initial works in recommendation div ersiﬁcation is by Ziegler et al. [31], who argue that user satisfac- tion does not solely depend on the accuracy of recommendation re- sults. The authors propose a similarity metric, the intra-list similar- ity (ILS), which computes the average pairwise similarity of items in a list. A higher value of the metric denotes a lower diversity . They use this metric in their topic div ersiﬁcation model to control a balance between the accuracy and div ersity of recommendations. Zhang et al. [28] formulate the div ersiﬁcation problem as ﬁnding the best possible subset of items to be recommended over all possi- ble subsets. They address this as the maximization of the diversity of a list of recommended items, subject to maintaining the accuracy of the items. Zhou et al. [30] propose a hybrid method that targets the maximization of a weighted combination of independent utility- and diversity-based approaches requiring parameter tuning to con- trol the tradeoff between the objecti ves of the two approaches. Most of the existing div ersiﬁcation approaches consider the max- imization of an objectiv e function to satisfy a user’ s need in terms of utility and div ersity of the result list. Most of these approaches are based on the idea behind the maximal marginal relev ance where a submodular objecti ve function (Equation 1) is maximized. There- fore, a (1 − 1 /e ) -approximation to the optimal solution can be com- puted greedily [18]. This paper is an extension to our prior work in [2], where we introduce a ne w objecti ve function for recommen- dation diversiﬁcation, the optimal solution of which can be found greedily . This objectiv e function targets the utility as the primary concern, and maximizes it subject to maintaining the div ersity of user’ s tastes. In this paper , we elaborate on the intuitions and de- tails behind the proposed greedy algorithm, and provide extensiv e online and ofﬂine ev aluations on its performance in practice. W e show that this method is computationally efﬁcient and parameter- free, and it guarantees that high-utility items appear at the top of the recommendation list, as long as they contribute to the div ersity of the list. 3. MO TIV A TING EXAMPLES In this section, we discuss sev eral motiv ating examples for our work. A more formal description of our method and its analysis are presented in Section 4. Consider the following recommendation problem. A system rec- ommends to a user movies from a ground set of four mo vies: ID Movie Utility Action Comedy e name w ( e ) 1 Inception 0.8 X 2 Spider-Man 2 0.7 X 3 Grown Ups 2 0.5 X 4 The Sweep 0.2 X The user likes either Action or Comedy movies, depending on the mood of the user , but the system does not know the user’ s mood. The user chooses the ﬁrst recommended movie e that matches the genre that the user currently prefers and is satisﬁed proportionally to the utility of the movie w ( e ) , the probability that e is liked. Our goal is to recommend a minimal list of movies that maximizes the user’ s satisfaction and also covers all user’ s preferences, irrespec- tiv e of the user’ s mood. The optimal solution to our problem is a list of two movies, S = (1 , 3) . When the user prefers Action movies, the user selects the ﬁrst recommended movie in the list, Inception , and is satisﬁed with probability 0 . 8 . This is substantially greater than if Spider-Man 2 , another Action movie, was in the list instead of Inception . On the other hand, when the user prefers Comedy movies, the user selects the second recommended movie in the list, Gr own Ups 2 , and is satisﬁed with probability 0 . 5 . This is substantially greater than if The Sweep , another Comedy movie, was in the list instead of Gr own Ups 2 . Note that the solution S can be computed greedily . In particular, S is a list of two highest-utility mo vies, one from each genre. Now suppose that we add to the ground set a movie that is both Action and Comedy , and its utility is 0 . 7 : ID Movie Utility Action Comedy e name w ( e ) 1 Inception 0.8 X 2 Spider-Man 2 0.7 X 3 Gro wn Ups 2 0.5 X 4 The Sweep 0.2 X 5 Kindergarten Cop 0.6 X X The optimal solution to the problem is a list S = (1 , 5) . When the user prefers Action movies, the user selects the ﬁrst recommended movie, Inception , and is satisﬁed with probability 0 . 8 . On the other hand, when the user prefers Comedy movies, the user selects the second recommended movie, Kindergarten Cop , and is satisﬁed with probability 0 . 6 . Note again that the solution S can be com- puted greedily . It is a list of two highest-utility movies, one from each genre. Finally , we replace the last mo vie in the ground set with a movie whose utility is 0.9: ID Movie Utility Action Comedy e name w ( e ) 1 Inception 0.8 X 2 Spider-Man 2 0.7 X 3 Gro wn Ups 2 0.5 X 4 The Sweep 0.2 X 5 Indiana Jones and 0.9 X X the Last Crusade The optimal solution to the problem is a single movie, S = (5) . The reason is that Indiana Jones and the Last Crusade is the highest- utility movie in both Action and Comedy . Hence, it is the best rec- ommendation irrespectiv e of the user’ s mood. Note again that the solution S can be computed greedily . It is the highest-utility movie that belongs to both genres In all three examples, the optimal solutions can be computed greedily . This is not by chance. In the next section, we generalize the ideas expressed in these examples and introduce the notion of div erse recommendations where the optimal solution can be found greedily . This is the main contribution of our paper . 4. DIVERSITY -WEIGHTED UTILITY MAX- IMIZA TION Our objective is to maximize the utility of recommending a list of items to a user subject to the di versity of their tastes. W e present the formal deﬁnition of our method in Section 4.1, followed by Sec- tion 4.2 where we sho w that the optimal solution of the method can be found efﬁciently . The interpretations and intuitions behind our method are explained in Section 4.3. W e show that the length of the list recommended by our method can be controlled by consid- ering different diversity constraints dependent on user preferences in Section 4.4. 4.1 Problem F ormulation Let E = { 1 , . . . , L } be a ground set of L recommendable items, such as movies or songs. Let w ∈ ( R + ) L be a vector of item utilities, such as item popularity scores or predicted ratings. The e -th entry of w , w ( e ) , is the utility of item e . The objectiv e of the diversiﬁcation method is to maximize the satisfaction of the user subject to the di versity of their tastes. How- ev er , an increase in di versity typically comes at the account of a de- crease in the utility of the items in the list, e.g., relev ant but redun- dant items are substituted by less relev ant but more diverse items. Addressing this tradeoff and striking the balance between increas- ing the diversity and maintaining the utility is an important chal- lenge for any div ersiﬁcation method. Considering the utility as the primary concern, we aim to expos the user to a variety of choices in the recommendation list, while losing the minimal amount of utility in the provision of these choices. In order to recommend a list of items that maximizes the user’ s utility of choice, we target at maximizing the utility of the recom- mendation list weighted by the increase in div ersity . In other words, each increase in diversity is covered by the item with the highest possible utility . Formally , our di versity-weighted utility maximiza- tion ( DUM ) problem is formulated as: A ∗ = arg max A ∈ Θ L X k =1 g A ( a k ) w ( a k ) , (2) where A = ( a 1 , . . . , a L ) is an ordered set of items E , Θ is the set of all permutations of E , and A ∗ = ( a ∗ 1 , . . . , a ∗ L ) is the optimal solution to the problem. The vector g A ∈ ( R + ) L are the gains in div ersity associated with items E . In particular: g A ( e ) = f ( A k − 1 + e ) − f ( A k − 1 ) (3) is the gain in div ersity associated with choosing item e giv en a set of previously chosen items in A , where k is such that a k = e and A k = { a 1 , . . . , a k } is an unordered set of the ﬁrst k items in A . The function f : 2 E → R + is a div ersity function from subsets of the ground set E to non-negativ e real numbers. The diversity function f can have many different forms. F or instance, f ( X ) can be the number of unique genres covered by movies X recommended by a recommender system. Alternatively , f ( X ) can be the av erage pairwise dissimilarity between a set of products X recommended by a shopping website. In this work, we assume that the div ersity function f is monotonically incr easing : ∀ X ⊆ E , e ∈ E − X : f ( X + e ) − f ( X ) ≥ 0 , (4) the diversity of any set X does not decrease when any item e is added to this set. This assumption is quite natural. W e also assume that f ( ∅ ) = 0 , the div ersity of the empty set is zero. This assump- tion is without loss of generality . In particular, it can be always satisﬁed by subtracting f ( ∅ ) from f . 4.2 Greedy Solution For a general monotonic function f , the optimization problem (2) is NP-hard. Ho we ver , when f is submodular , the problem can be cast as ﬁnding a maximum-weight basis of a polymatroid [8] and can be solved greedily . W e ﬁrst present the greedy algorithm and then argue that it is optimal. The pseudo-code of the greedy algorithm for diversity-weighted utility maximization ( DUM ) is shown in Algorithm 2. The algorithm works as follo ws. First, the items E are sorted in decreasing order according to their utility , w ( a ∗ 1 ) ≥ . . . ≥ w ( a ∗ L ) , and placed into A ∗ = ( a ∗ 1 , . . . , a ∗ L ) . Then we examine the items in this order . When g A ∗ ( a ∗ k ) > 0 , item a ∗ k is added to the list of recommended items S . When g A ∗ ( a ∗ k ) = 0 , item a ∗ k is not added to S because it does not contribute to the div ersity of S . Finally , the algorithm returns the recommendation list S . W e illustrate DUM on the second example in Section 3. In this example, A ∗ = (1 , 2 , 5 , 3 , 4) , and the diversity gains of movies 2 , 3 and 4 are zero due to the contribution of their preceding movies in the list. Therefore, these movies are not placed into the recom- mendation list, and S = (1 , 5) . DUM has several notable properties. First, it is parameter-fr ee . That is, DUM does not require any parameter tuning and therefore should be robust in practice. Second, DUM is a greedy method and therefore is computationally efﬁcient . In particular, suppose that the diversity function f is an oracle that can be queried in O (1) time. Then the time complexity of DUM is O ( L log L ) , comparable to the complexity of sorting L numbers. Finally , DUM computes the optimal solution to the optimization problem (2). In the rest of Section 4, we analyze DUM both in terms of A ∗ and S . Note that the solutions A ∗ and S are equiv alent in the sense that S is a list obtained from A ∗ by eliminating the items that hav e zero contribution in the objectiv e function (2). Therefore, the values of Algorithm 2 DUM : Diversity-W eighted Utility Maximization Input: Ground set E W eight vector w // Compute the maximum-weight basis of a polymatroid Let a ∗ 1 , . . . , a ∗ L be an ordering of items E such that: w ( a ∗ 1 ) ≥ . . . ≥ w ( a ∗ L ) A ∗ ← ( a ∗ 1 , . . . , a ∗ L ) // Generate the list of recommended items S S ← () for k = 1 , . . . , L do g A ∗ ( a ∗ k ) ← f ( A ∗ k − 1 + a ∗ k ) − f ( A ∗ k − 1 ) if ( g A ∗ ( a ∗ k ) > 0) then Append item a ∗ k to list S end if end for Output: List of recommended items S the solutions are identical. So the difference in treatment is purely technical and allows us to reduce o verhead in notation. The optimality of DUM can be argued based on the following ob- servation. Our optimization problem (2) is equiv alent to maximiz- ing a modular function on a polymatroid [8], a well-known combi- natorial optimization problem that can be solved greedily . In par- ticular , let M = ( E , f ) be a polymatroid, where E is its ground set and f is a submodular diversity function. Let: P M = (5) ( x : x ∈ R L , x ≥ 0 , ∀ X ⊆ E : X e ∈ X x ( e ) ≤ f ( X ) ) be the independence polyhedron associated with function f . Then the maximum-weight basis of M is deﬁned as: x ∗ = arg max x ∈ P M h w , x i , (6) where w ∈ ( R + ) L is a vector of non-negati ve weights. Because P M is a submodular polytope and the weights w are non-negati ve, the optimization problem (6) is equiv alent to ﬁnding the order of dimensions A in which h w , x i is maximized [8]. This problem can be written formally as (2) and has the same greedy solution as in DUM . In particular, the items E are sorted in decreasing order according to their weights, w ( a ∗ 1 ) ≥ . . . ≥ w ( a ∗ L ) , and placed into A ∗ = ( a ∗ 1 , . . . , a ∗ L ) . Finally , x ∗ = g A ∗ . 4.3 Interpr etation In this section, we discuss sev eral interpretations of DUM . With- out loss of generality , we assume that the different aspects of user’ s taste are represented by a ﬁnite set of topics T = { 1 , . . . , M } . For example, in a movie recommendation domain, these topics can be the genres of movies, such as T = { Drama , Comedy , Action } . Our ﬁrst observation is that if the div ersity of a set of items is measured by the number of unique topics covered by the items, then DUM generates a list of items, where each topic is covered by the highest-utility item in that topic. L E M M A 1. Let the diversity function f be deﬁned as the num- ber of topics cover ed by items X : f ( X ) = X t ∈T 1 {∃ e ∈ X : item e cov ers topic t } . Then DUM returns a recommendation list S , wher e each topic t is cover ed by the highest-utility item that belongs to t . Mor eover , the length of S is at most |T | . P R O O F . The ﬁrst claim is proved by contradiction. Let e ∗ t be the item with the highest utility that belongs to topic t . Suppose that item e ∗ t is not chosen by DUM , e ∗ t is not in list S generated by DUM . Then g A ∗ ( e ∗ t ) = 0 , which implies that another item must hav e cov ered topic t before item e ∗ t . Ho we ver , this is a contradiction, since e ∗ t is the item with the highest utility from t , and therefore DUM must hav e tested it before any other item that co vers t . The second claim follows from the fact that g A ∗ ( a ∗ k ) > 0 im- plies that the v alue of the di versity function f increases by at least one. By deﬁnition, f ( X ) ≤ |T | for any X . Therefore, the maxi- mum number of items added to S is |T | . Our second observation is that our objective (2) can be viewed as maximizing the expected utility of choosing an item when the div ersity gains g A ( e ) are viewed as the probabilities of choosing items. This interpretation is motiv ated by the cascade model [7] of user beha vior , which considers the relationship between successive items in a list. In this model, users scan the list from top to the bottom and eventually stop because either their information need is satisﬁed or their patience is exhausted. Speciﬁcally , note that for any ordering of items A : L X k =1 g A ( a k ) = L X k =1 [ f ( A k − 1 + a k ) − f ( A k − 1 )] = f ( E ) − f ( ∅ ) + L − 1 X k =1 [ f ( A k ) − f ( A k )] = f ( E ) . (7) The ﬁrst equality is due to the deﬁnition of the div ersity gains (3). The second equality follows from the fact that A k = A k − 1 + a k . The last equality is due to the observ ation that f ( ∅ ) = 0 . It follows that: ∀ e ∈ E : g A ( e ) f ( E ) ∈ [0 , 1] , 1 f ( E ) L X k =1 g A ( a k ) = 1 , (8) and therefore g A ( e ) /f ( E ) can be interpreted as the probability of choosing item e , gi ven that none of the earlier recommended items A k − 1 is chosen. Under this assumption, P L k =1 g A ( a k ) w ( a k ) is the expected utility of choosing an item, scaled up by f ( E ) . 4.4 Diversity Function The length of the recommendation list S generated by DUM de- pends on the div ersity function f . In extreme cases, this list may include all items. For instance, consider a problem where: ∀ X ⊆ E , e ∈ E − X : f ( X + e ) − f ( X ) > 0 , (9) the div ersity increases when any item e is added to any subset of items X . Then for any ordering A , g A ( e ) > 0 for all items e . As a result, g A ∗ ( e ) > 0 for all items e and DUM returns the list of all items sorted in the descending order of utility . This result is mathematically correct. But it is not a very useful diverse recom- mendation. T o get useful diverse recommendations, it is important to control the maximum number of items returned by DUM , e.g., by choosing appropriate div ersity functions. For instance, for the div ersity func- tion in Lemma 1, the maximum length of the recommendation list S is equal to the number of topics |T | . In this section, we general- ize the ideas from Section 4.3 and propose another class of diversity functions that are suitable for DUM . Consider the case where dif ferent users may have different toler- ance for redundancy in the recommendation list due to their inter- ests and preferences [14, 25]. These dif ferences can be modeled by a diversity function that assigns different weights to each topic of interest. In particular , the function can be deﬁned as: f ( X ) = X t ∈T min ( X e ∈ X 1 { item e cov ers topic t } , N t ) , (10) where N t is the number of items from topic t that is required to be in the recommendation list of a given user . In the next lemma, we characterize the output of DUM for the abov e function. L E M M A 2. Let the diversity function f be deﬁned as in (10) . Then DUM returns a r ecommendation list S such that each topic t is cover ed by at least N t items of the highest utility that cover topic t . Mor eover , the length of S is at most P t ∈T N t . P R O O F . The ﬁrst claim is proved by contradiction. Let e ∗ t,k be the k -th item with the highest utility from topic t , where k ≤ N t . Suppose that item e ∗ t,k is not chosen by DUM , e ∗ t,k is not in list S generated by DUM . Then g A ∗ ( e ∗ t,k ) = 0 , which implies that topic t must hav e been covered at least N t times before DUM tests item e ∗ t . Howe ver , note that this is a contradiction, since e ∗ t,k is among the ﬁrst N t items that cover topic t , and therefore among the ﬁrst N t items from that topic that are tested by DUM . The second claim follows from the fact that g A ∗ ( a ∗ k ) > 0 im- plies that the v alue of the di versity function f increases by at least one. By deﬁnition, f ( X ) ≤ P t ∈T N t for any X . Therefore, the maximum number of items added to S is P t ∈T N t . The div ersity function in (10) allows for controlling the length of the recommendation list S . In particular, if topic t is irrelev ant for the user , N t should be set to 0 . As a rule of thumb, more relev ant topics t should be assigned higher weights N t . 5. EXPERIMENTS The proposed method is ev aluated in two online user studies and in an of ﬂine ev aluation. In each case, we compare the performance of DUM to variants of MMR , since many existing diversiﬁcation ap- proaches are based on the objectiv e function of MMR (Section 2). In theory , the optimal solution to DUM can be found greedily , while MMR ﬁnds only a (1 − 1 /e ) -approximation to the optimal solution. Through the empirical ev aluation, we show that DUM satisﬁes the users’ needs better than MMR , and it is superior in recommending lists that satisfy utility and div ersity at the same time. W e conduct two online studies using Amazon’ s Mechanical T urk 1 (MT). In the ﬁrst study , we evaluate separately the recommendation lists generated by DUM and MMR , by asking MT workers to identify in the lists a movie that matches their genre of interest and indi- cate the relev ance of this movie. In the second study , we compare the DUM and MMR recommendation lists, by asking MT workers to judge the cov erage of two movie genres by the lists. W e also report the ﬁndings of an ofﬂine study , where we perform a ﬁne-grained assessment of the ev aluated methods by creating user preference proﬁles and considering various combinations of genres. 1 http://www .mturk.com Figure 1: A portion of our Mechanical T urk questionnaire in user study 1 . W e only show the ﬁrst two lists of recommended movies. 5.1 User Study 1 In the ﬁrst study , we ev aluate the div ersity and utility of DUM in a movie recommendation application. W e compare DUM to three variants of MMR , which are parametrized by λ ∈  1 3 , 2 3 , 0 . 99  . The ground set E are 10 k most frequently rated IMDb 2 movies. The utility of movie e , w ( e ) , is the number of ratings assigned to this movie. The values of w ( e ) are normalized such that the maxi- mum utility is 1 , i.e., max e ∈ E w ( e ) = 1 . The div ersity function f is deﬁned as in Lemma 1. W e also normalize f such that the maxi- mum div ersity of is 1, i.e., max e ∈ E f ( e ) = 1 . The set of topics T includes 8 most popular movie genres in E : T = { Drama , Comedy , Thriller , Romance , (11) Action , Crime , Adventur e , Horr or } . W e restrict our attention to 8 genres only since 8 genres can be always cov ered by 8 mo vies, a reasonably short list of movies that can be ev aluated by a MT worker . All methods are evaluated in 200 MT human intelligence tasks (HITs). In each HIT , we initially ask the worker to choose a genre of interest. Then, we generate four recommendation lists: one by DUM and three by MMR for different values of λ . The generation of the lists is independent of the genre chosen by the worker . Fi- nally , we ask the worker to ev aluate the lists. For each list, we ask two questions. First, we ask the worker to identify a movie in the list that matches the chosen genre. This question addresses the di- versity of the list, whether the chosen genre is covered by the list. The worker can also answer “none” if the list does not contain a 2 http://www .imdb .com DUM MMR λ = 1 3 λ = 2 3 λ = 0 . 99 List includes a movie that 84.0% 70.5% 67.0% 66.5% matches the chosen genre The chosen movie is 77.0% 64.5% 62.5% 62.5% a good recommendation T able 1: Comparison of DUM and MMR in user study 1 . For each method, we report the per centage of times that the worker ﬁnds a matching mo vie in the list and the per centage of times that the matching movie is a good r ecommendation. movie from the chosen genre. If the worker identiﬁes a matching movie, we ask the worker if the movie is a good recommendation for the chosen genre. This question addresses the utility of the list, whether the chosen genre is cov ered by a good movie in the list. A screenshot of our MT questionnaire is shown in Figure 1. In each HIT , we present the four recommendation lists in a ran- dom order . This eliminates the position bias . In addition, in each HIT the set of recommendable movies contains 3 . 3 k movies chosen at random from the 10 k movies in E . Hence, the recommendation lists differ across the HITs, which eliminates the item bias , i.e., the workers cannot prefer one method over another only because the recommended movies are inherently more likable. All the rec- ommendation lists are of the same length – the length of the list produced by DUM . W e adopt this methodology because we want to compare the lists for the same number of movies in the lists. Note that that we do not put MMR into an y disadv antage. In particular , for any DUM list, MMR can generate lists that are either of higher utility or more diverse than the DUM list, when the value of λ is large or small, respectiv ely . This can be seen in T able 2, for instance. Our HITs are completed by 34 master workers, who are MT’ s elite workers chosen based on the high quality of the work. Each worker is asked to complete at most 8 HITs. This guarantees that our HITs are completed by more than just a handful of work ers. On av erage, a worker spends 72 seconds per HIT , i.e., 19 seconds to ev aluate a list of up to 8 movies. Later in this section, we present two permutation tests that sho w that our results are highly unlikely under the hypothesis that the workers are of lo w quality , or that the questions are answered randomly . This implies that the workers hav e reasonable expertise in e valuating the HITs. The results of our study are presented in T able 1. For each com- pared method, we report the percentage of times that the worker ﬁnds a movie in the list that matches the chosen genre and the per- centage of times that the matching movie is a good recommenda- tion. W e observe two major trends. Firstly , the percentage of times that the worker ﬁnds a matching movie in the DUM list is 13 . 5% higher than in the list generated by the best performing baseline, MMR with λ = 1 3 . This result is statis- tically signiﬁcant and we sho w it using a permutation test. The test statistic is the difference in the percentage of times that the worker ﬁnds a matching mo vie in the lists generated by the best and second best performing methods. The null hypothesis is that all compared methods are equally good. Under this hypothesis, we permute the answers of the workers 10 6 times, generate an empirical distribu- tion of the test statistic, and observe that the value of 13 . 5% or higher is less likely than 0 . 0001 . So we reject the null hypothesis with p < 0 . 0001 . Secondly , the percentage of times that the worker considers the chosen movie to be a good genre-matching recommendation in the DUM list is 12 . 5% higher than in the list generated by the best per- forming baseline, MMR with λ = 1 3 . This result is statistically sig- niﬁcant and we show it again using a permutation test. The test DUM The Shawshank Redemption drama crime The Dark Knight drama thriller action crime The Lord of the Rings 1 action adventure Forrest Gump drama romance Back to the Future comedy adventure The Shining drama horror MMR ( λ = 1 / 3 ) The Dark Knight drama thriller action crime Dr . Phibes Rises Again comedy romance adventure horror The Shawshank Redemption drama crime Pulp Fiction thriller crime Fight Club drama The Godfather drama crime MMR ( λ = 2 / 3 ) The Dark Knight drama thriller action crime The Shawshank Redemption drama crime The Lord of the Rings 1 action adventure Pulp Fiction thriller crime Fight Club drama The Godfather drama crime MMR ( λ = 0 . 99 ) The Shawshank Redemption drama crime The Dark Knight drama thriller action crime Pulp Fiction thriller crime Fight Club drama The Godfather drama crime The Lord of the Rings 1 action adventure T able 2: F our recommended lists in user study 1 where DUM outperforms MMR . statistic is the difference in the percentage of times that the worker ﬁnds a good recommendation in the lists generated by the best and second best performing methods. The null hypothesis is that all compared methods are equally good. Under this hypothesis, we permute the answers of the workers 10 6 times, generate an empir- ical distribution of the test statistic, and observe that the value of 12 . 5% or higher is less likely than 0 . 001 . So we reject the null hypothesis with p < 0 . 001 . Overall, this user study sho ws that the div ersity and the utility of recommendation lists generated by DUM are perceiv ed superior to those of the lists generated by MMR . W e note that for all the methods compared in T able 1, the ra- tio between the percentage of times that the genre-matching movie is a good recommendation and that the matching movie is found is always between 0 . 92 and 0 . 94 . This implies that if a matching movie found, it is v ery likely to be considered a good recommenda- tion. W e conjecture that this is due to the high popularity of movies in the ground set E , which practically guarantees the utility of the recommended movies and minimizes the differences between the compared methods. In T able 2, we show a real-life example illustrating ho w DUM out- performs MMR . Here, DUM covers all the 8 mo vie genres by popular movies. These movies are well known and can be easily matched to any chosen target genre. Howev er , MMR with λ = 0 . 99 assigns insufﬁcient weight to diversity and therefore cov ers only ﬁ ve movie genres. The result is that this MMR list is unsuitable for users who like Comedy , Romance , or Horr or movies. MMR with λ = 2 3 has the same problem. On the other hand, MMR with λ = 1 3 assigns too much weight to di versity and therefore covers four mo vie genres by a relatively unknown movie, “Phibes Rises Again”. These genres are not covered by any other movie in the list. The result is that this MMR list is likely to be of a low utility for users who like Comedy , Romance , Adventur e , and Horr or movies. Figure 2: Our Mechanical T urk questionnaire in user study 2 for t 1 = Drama and t 2 = Romance . 5.2 User Study 2 In the second study , we ev aluate DUM on a speciﬁc problem of recommending a diverse set of movies that cover exactly two gen- res. W e again compare DUM to three variants of MMR , which are parameterized by λ ∈  1 3 , 2 3 , 0 . 99  . The compared methods are ev aluated by MT workers. In each HIT , we ask the w orker to consider a situation where Bob and Alice go for a vacation and can take sev eral movies with them. Bob and Alice prefer two different movie genres. W e generate four recom- mendation lists: one by DUM and three by MMR for different values of λ . For each list, we ask the worker to indicate whether the list is appropriate for both Bob and Alice, only for one of them, or for none of them. A screenshot of our MT questionnaire is shown in Figure 2. Each HIT is associated with two movie genres, t 1 and t 2 , the preferences of Bob and Alice in the HIT . W e generate three HITs for each pair of the 18 most frequent IMDb movie genres, so that the recommendation lists are e valuated 3 18 × 17 2 = 459 times. Like in Section 5.1, the ground set E are 10 k most frequently rated IMDb movies. The utility of movie e , w ( e ) , is the number of rat- ings assigned to this movie. The div ersity function f is deﬁned as in Lemma 2. The topics are T = { t 1 , t 2 } and N t 1 = N t 2 = 4 . For this setting, DUM generates a list of at most 8 movies, at least 4 from each genre. The utility and div ersity are normalized as in Section 5.1. In each HIT , the order of the recommendation lists is randomized and the length of the lists is determined as in Sec- tion 5.1. Our HITs are completed by 57 master workers. Each worker is asked to complete at most 10 HITs. This guarantees that our HITs are completed by more than just a handful of workers. On av erage, a worker spends 57 seconds per HIT , i.e., 14 seconds to Suitable DUM MMR for λ = 1 3 λ = 2 3 λ = 0 . 99 Bob and Alice 74.51% 64.92% 58.39% 28.98% Bob or Alice 23.53% 32.68% 39.43% 66.67% Neither 1.96% 2.40% 2.18% 4.36% T able 3: Comparison of DUM and MMR in user study 2. For each method, we report the percentage of times that the worker iden- tiﬁes the recommended list as suitable for both Bob and Alice; only for Bob or only f or Alice; or for neither of them. ev aluate a list of up to 8 movies. In our analysis, we do not differ - entiate between suboptimal answers “Suitable only for Alice” and “Suitable only for Bob” and collapse the two into a single answer “Suitable for Alice or Bob”. The results of the second user study are presented in T able 3. W e observe that the workers consider the DUM list to be suitable for both Bob and Alice in 74 . 51% of cases. This is 9 . 6% higher than the best performing baseline, MMR with λ = 1 3 . This result is statistically signiﬁcant and we show it using a permutation test. The test statistic is the difference in the percentage of times that the recommended lists, generated by the best and second best per- forming methods, are suitable for both Bob and Alice. The null hypothesis is that all compared methods are equally good. Under this hypothesis, we permute the answers of the workers 10 6 times, generate an empirical distribution of the test statistic, and observe that the value of 9 . 6% or higher is less likely than 0 . 0001 . So we reject the null hypothesis with p < 0 . 0001 . Similarly to Section 5.1, our permutation test can be also inter- preted as showing that our results are highly unlikely under the hypothesis that the workers are of low quality , the lists are rated randomly . This implies that our workers hav e reasonable expertise in ev aluating our HITs. In T able 4, we show another real-life example illustrating how DUM outperforms MMR for t 1 = Horr or and t 2 = Action . Here, DUM cov ers each movie genre by four most popular movies from that genre. Howe ver , MMR with λ = 0 . 99 assigns insufﬁcient weight to div ersity and therefore recommends only most popular items that happen to be Action movies. So the recommendation list is un- suitable for users who like Horr or movies. MMR with λ = 2 3 has a similar behavior and is strongly dominated by Horror movies. On the other hand, MMR with λ = 1 3 assigns too much weight to div ersity and therefore recommends many Horr or movies that are also Action movies. These are less popular than the most popular Horr or movies that are not Action . So the list is of a low utility for users who like Horr or movies. T o sum up, DUM outperforms MMR in cases, where items from one topic hav e a higher utility than items from the other topic, and the items at the intersection of the two topics also hav e a low utility . While DUM recommends a mixture of high utility items from each topic, MMR either prefers items at the intersection of the topics, when the value of λ is low; or recommends high-utility items from the dominant topic only , when the value of λ is high. 5.3 Ofﬂine Evaluation The main goal of the ofﬂine ev aluation is to assess the perfor- mance of DUM under various conditions, such as recommendations across multiple users with their interest proﬁles deﬁned based on different combinations of topics. W e use the 1M MovieLens dataset [15] in the ofﬂine e valuation. The dataset consists of 1 million ratings on a 1-to-5 stars scale, assigned by about 6000 users to about 4000 movies. W e remove DUM The Dark Knight action The Lord of the Rings 1 action The Matrix action Inception action The Shining horror Alien horror Psycho horror Shaun of the Dead horror MMR ( λ = 1 / 3 ) Zombieland horror action From Dusk T ill Dawn horror action Dawn of the Dead horror action Resident Evil horror action The Dark Knight action The Lord of the Rings 1 action The Matrix action Inception action MMR ( λ = 2 / 3 ) The Dark Knight action The Lord of the Rings 1 action The Matrix action Inception action The Lord of the Rings 2 action The Dark Knight Rises action The Lord of the Rings 3 action The Shining horror MMR ( λ = 0 . 99 ) The Dark Knight action The Lord of the Rings 1 action The Matrix action Inception action The Lord of the Rings 2 action The Dark Knight Rises action The Lord of the Rings 3 action A vatar action T able 4: F our recommended lists in user study 2 where DUM outperforms MMR . The topics are t 1 = Horror and t 2 = Action . users having less than 300 ratings, so that for each user we have enough data to create a user proﬁle and recommend movies. W e end up with 1000 users and a total of 515k ratings. Movies rated by each user are split randomly into the training and test set with the 2 : 1 ratio. W e use matrix factorization [22] to predict the rating of movies in the test set and feed the predicted ratings as the utility scores into the DUM and MMR methods. The split is performed three times for each user, and the reported results are based on the av erage of experiments conducted on these splits. For each user, the training set is used for creating their interest proﬁle, whereas the test set contains the recommendable movies (along with their actual and predicted utility). On av erage, a user proﬁle is created based on 343 mo vies and the recommendation list is selected from a set of 171 candidates. These steps are carried out as follows: User proﬁle creation: There are 18 genres of movies in the dataset, and each movie belongs to one or more of these genres. For each user , we create a multinomial distribution over the popu- larity of genres of the mo vies rated by this user in the training data, assuming that users rated movies that they had watched. W e sam- ple 10 times from this distribution to create the user’ s preference proﬁle over genres, and normalize it so that the sum of the scores equals 1. For each user with the preference score r t for genre t , we set N t = b r t × K c in (10), where K is the length of the recom- mendation list. That is, the coverage of each genre in the result list is proportional to the preference of the user for that genre. 0.91 0.927 0.9441 0.9611 0.9782 0.9952 1.7631 1.7767 1.7903 1.8039 1.8175 1.8312 mean nDCG (Utility) mean ILD (Diversity) DUM MMR 0.7523 0.7566 0.7609 0.7652 0.7695 0.7738 1.7644 1.7773 1.7903 1.8032 1.8162 1.8291 mean nDCG (Utility) mean ILD (Diversity) DUM MMR (a) (b) Figure 3: Perf ormance of DUM in terms of diversity and utility is compared to the performance of MMR for all the settings of the parameter λ . (a) The actual rating of movies is the utility scor e, (b) The predicted rating of mo vies is the utility score. 0 0.2 0.4 0.6 0.8 1 1.7634 1.7773 1.7913 1.8052 1.8192 1.8331 mean ILD (diversity) 0 0.2 0.4 0.6 0.8 1 0.7513 0.756 0.7607 0.7654 0.7701 0.7748 mean nDCG (utility) Parameter λ in MMR MMR utility DUM utility MMR diversity DUM diversity Figure 4: T radeoff between the diversity and utility of the rec- ommendation lists across all users f or all the settings of the pa- rameter λ in MMR . Di versity and utility scores achieved by DUM (independent of λ ) are sho wn for comparison pur poses. Recommendation: Movies in the test data are used as the ground set E of recommendable movies, from which each diversiﬁcation method ﬁnds the list of K = 10 movies to recommend to each user . The predicted utility of movies is used in the recommendation. The reason for using the predicted utility instead of the readily av ailable movie ratings is to keep the ev aluation as close as possible to real- world recommendation scenarios, where the utility of items is not known. When we evaluate the performance of the studied methods, we use the actual utility score, i.e., the rating assigned by a user to a test set movie. 5.3.1 Evaluation Metrics W e use three evaluation metrics to compare the performance of DUM to various settings of MMR : a div ersity metric, a utility metric, and a compound metric that considers both diversity and utility . W e chose these particular metrics due to two reasons. First, we wanted to ev aluate the performance of our method with respect to div ersity and utility individually (ﬁrst two metrics), as well as in combination (third metric). Second, we wanted them to be different from the objective function of DUM in order to avoid any potential bias. Thus, the compound metric combines diversity and utility in a different manner from what DUM does. Intra-list distance (ILD) [24, 28] is a common metric that mea- sures the di versity of a recommendation list as the a verage distance between pairs of recommended items. The dual of this measure is the intra-list similarity [31]. W e use ILD to measure distance based div ersity of a recommendation list in our experiment: I LD = 2 | S | ( | S | − 1) X e ∈ S X e 0 ∈ S d( e, e 0 ) (12) where d( e, e 0 ) measures the distance between two items e and e 0 in a list S . W e choose the Euclidean distance between the genre vectors of two movies as the distance function d . Note that this metric is cardinally different from the diversity function exploited by DUM , which is shown in (10). Discounted cumulative gain (DCG) [12] measures the accumu- lated utility gain of items in the recommendations list from the top to the bottom, with the gain of each item s k being discounted by its position k in the list: D C G = | S | X k =1 w ( s k ) log( k + 1) (13) Here, w ( s k ) is the utility of item s k at rank k in the list. W e estimate the utility of a movie for a user by the rating that the user assigned to the mo vie. W e also use the normalized DCG (nDCG), which is in the range [0, 1]. nDCG is computed as nDCG = DCG/IDCG, where IDCG is the ideal gain achiev able when all the listed items hav e the highest utility score. Expected intra-list distance (EILD) [24] is a compound metric that combines utility and div ersity . EILD measures the av erage intra-list distance (ILD) with respect to rank-sensiti vity and utility: E I LD = | S | X k =1 | S | X k 0 =1 C k disc( k )rdisc( k 0 | k ) w ( s k ) w ( s k 0 )d( s k , s k 0 ) (14) where disc( k ) = 1 / log( k + 1) is the discount function at rank k in the list and rdisc( k 0 | k ) = disc(max(1 , k 0 − k )) is the relative rank discount. In order to av oid bias, we use the normalization con- stant proposed in [24] and set C k = 1 C / P | S | k 0 =1 disc( k 0 | k ) w ( s k 0 ) where C = P | S | k =1 disc( k ) . W e compute each of these metrics for e very recommendation list provided to a user . Then, we a verage them across the three runs for ev ery user to compute user-based mean of the metric. This is per- formed for DUM and all settings of MMR , and the mean of each metric for each method is computed across all the users and reported. 5.3.2 Evaluation Results Figure 3 shows the performance of DUM against MMR in terms of div ersity and utility metrics. In Figure 3-a, the actual rating of movies in the test set is used as the utility of movies in the recom- mendation step, whereas in Figure 3-b the prediction produced by matrix f actorization is used as the utility score. In both ﬁgures, MMR exhibits a trade-off between the values of mean ILD (as a measure of div ersity) and mean nDCG (as a measure of utility). This trade- off is due to different values of the tuning parameter λ in different settings of MMR . F or low values of λ the utility is prioritized, such that the diversity of lists generated by MMR is low , but the utility is high. An opposite situation is observed for high λ , when the div ersity gets prioritized. It can be seen that the performance of DUM with respect to both metrics is superior to any settings of MMR , re gardless of the way the utility score is obtained. For instance, in Figures 3-b, DUM achieves nDCG of 0.767 (compared to the highest nDCG of 0.774 achiev ed by MMR for λ = 0 ) and ILD of 1.811 (compared to the highest ILD of 1.829 achiev ed by MMR for λ = 1 ). It should be highlighted that the utility and div ersity cannot be optimized by MMR simultane- ously , since they are achiev ed for different values of λ , while DUM competes with both of them at the same time. Also recall that DUM is parameter-free, and its superiority o ver MMR becomes clear . Comparing Figures 3-a and 3-b, we observe that, as expected, the exact knowledge of the utility improves the performance of both DUM and MMR . Howev er , this knowledge is unav ailable to practi- cal recommenders. Hence, we use the predicted utility values in the recommendation step in the following experiments, in order to mimic the conditions of a real-world recommendation scenario. The trade-off between the diversity and utility objectiv es in MMR for various values of λ is visualized in Figure 4. As λ increases, the div ersity of the list recommended by MMR increases whereas its utility decreases. It can be seen that λ = 0 . 49 is the operating point for MMR , where the utility and the div ersity curves intersect. On the contrary , the utility and div ersity of DUM are stable and both are abov e the operating point of MMR . Another argument in fa vor of DUM is obtained through the EILD metric that combines div ersity and utility . A comparison between DUM and all the possible settings of MMR with respect to EILD are plotted in Figure 5. It can be seen that DUM signiﬁcantly outper- forms MMR for lo w/moderate values of λ , which correspond to cases where utility is prioritized, or both utility and div ersity are similarly important. MMR starts outperforming DUM for λ > 0 . 65 , when the importance of di versity takes o ver , which may not be a fa v orite ob- jectiv e in real-w orld recommnedations. This conﬁrms the superior- ity of DUM in balancing the utility and di versity goals with no prior parameterization, compared to a method that explicitly targets the maximization of their weighted combination. 6. CONCLUSION Much research in recommender systems has focused on the ac- curacy , but ov erlooked issues related to the composition of the rec- ommendation lists. Increasing the diversity of the lists poses a trade-off to the utility , such that the problem of maximizing util- ity subject to di versity is an important challenge. In this work, we propose the div ersity-weighted utility maximization ( DUM ) method and sho w that the problem can be cast as ﬁnding the maximum of a modular function on a polymatroid, which is known to hav e an op- 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.113 1.1238 1.1346 1.1453 1.1561 1.1668 1.1776 1.1883 1.1991 Parameter λ in MMR mean EILD (diversity) DUM MMR Figure 5: Performance of DUM in terms of expected diversity (w .r .t. utility and rank) is compared to the performance of MMR for all the settings of the parameter λ . timal greedy solution. This parameter-free method guarantees that items in the recommendation list cover different aspects of user’ s taste, such that each aspect is cov ered by items with high utility . W e conduct two online user studies. The diversity and utility of DUM are evaluated in a movie recommendation scenario, and the perceiv ed div ersity of DUM is ev aluated in a speciﬁc problem of rec- ommending a div erse set of movies that cover exactly two genres. In both studies, we found that DUM outperforms baseline models that maximize a linear combination of utility and diversity . W e also report an ofﬂine ev aluation of DUM using a suite of diversity and utility metrics. The results show that DUM effecti vely balances the trade-off between diversity and utility: our method achiev es per- formance comparable to the best performing baselines of di versity and utility , if ex ecuted individually . Moreo ver , a combined metric of div ersity and utility sho ws the superiority of parameter -free DUM ov er the baseline methods that need to be parameterized. Most div ersiﬁcation methods use MMR objective function, to lin- early combine modular and submodular functions of utility and div ersity , respectiv ely . Our work is orthogonal to these methods in the sense that the DUM objectiv e function maximizes a modular function subject to a submodular constraint. W e demonstrate sig- niﬁcant improv ements ov er various settings of MMR , while we in- tend to conduct a more encompassing comparison with other vari- ants MMR in the future. Another future direction is to account for the nov elty of the recommended items [5, 6] with respect to prior consumption history of the user . This may be incorporated into the div ersity function by considering, apart from the diversity contri- bution, also the no velty contrib ution of items in the list. Another issue that deserves further in vestig ation is the changes that need to be introduced in the diversity metric and in the tol- erance for redundancy across different domains and applications. For instance, a metric of di versity applicable for ne ws ﬁltering may differ substantially from the metric we derived for the movie rec- ommendation task in this work. Furthermore, user’ s tolerance for redundancy of news items that are in agreement with their own opinion may differ from their tolerance for redundancy of items having an opposite opinion. W e intend to address these questions in our future works. 7. REFERENCES [1] R. Agrawal, S. Gollapudi, A. Halv erson, and S. Ieong. Div ersifying search results. In Pr oceedings of the Second A CM International Confer ence on W eb Sear ch and Data Mining , pages 5–14, 2009. [2] A. Ashkan, B. Kveton, S. Berko vsky , and Z. W en. Div ersiﬁed utility maximization for recommendations. In P oster Proceedings of the 8 th A CM Confer ence on Recommender Systems , 2014. [3] G. Capannini, F . M. Nardini, R. Perego, and F . Silvestri. Efﬁcient di versiﬁcation of web search results. Pr oceedings of the VLDB Endowment , 4(7):451–459, 2011. [4] J. Carbonell and J. Goldstein. The use of MMR, div ersity-based reranking for reordering documents and producing summaries. In Pr oceedings of the 21 st annual international A CM SIGIR confer ence on Resear ch and development in information r etrieval , pages 335–336, 1998. [5] P . Castells, S. V arg as, and J. W ang. Novelty and di versity metrics for recommender systems: choice, discov ery and relev ance. In International W orkshop on Diversity in Document Retrieval (DDR 2011) at the 33 rd Eur opean Confer ence on Information Retrieval , pages 29–36, 2011. [6] C. L. Clarke, M. K olla, G. V . Cormack, O. V echtomov a, A. Ashkan, S. Büttcher , and I. MacKinnon. Nov elty and div ersity in information retrie v al ev aluation. In Proceedings of the 31 st annual international A CM SIGIR confer ence on Resear ch and de velopment in information r etrieval , pages 659–666, 2008. [7] N. Craswell, O. Zoeter , M. T aylor, and B. Ramse y . An experimental comparison of click position-bias models. In Pr oceedings of the 2008 International Conference on W eb Sear ch and Data Mining , pages 87–94, 2008. [8] J. Edmonds. Submodular functions, matroids, and certain polyhedra. In Combinatorial Structur es and Their Applications: Pr oceedings of the Calgary International Confer ence on Combinatorial Structur es and Their Applications , pages 69–87. 1970. [9] S. Gollapudi and A. Sharma. An axiomatic approach for result div ersiﬁcation. In Pr oceedings of the 18 th WWW Confer ence , pages 381–390, 2009. [10] M. Halvey , P . Punitha, D. Hannah, R. V illa, F . Hopfgartner , A. Goyal, and J. M. Jose. Di versity , assortment, dissimilarity , variety: A study of div ersity measures using lo w lev el features for video retriev al. In Advances in Information Retrieval , pages 126–137. 2009. [11] D. Jannach, L. Lerche, F . Gedikli, and G. Bonnin. What recommenders recommend–an analysis of accuracy , popularity , and sales div ersity ef fects. In In Pr oceedings of the 21 st Confer ence on User Modeling, Adaptation, and P ersonalization , pages 25–37. 2013. [12] K. Järvelin and J. K ekäläinen. Cumulated gain-based ev aluation of IR techniques. ACM T ransactions on Information Systems (TOIS) , 20(4):422–446, 2002. [13] Y . Koren and R. M. Bell. Adv ances in collaborati ve ﬁltering. In Recommender Systems Handbook , pages 145–186. 2011. [14] A. Lad and Y . Y ang. Learning to rank relevant and no vel documents through user feedback. In Pr oceedings of the 19 th A CM international confer ence on Information and knowledge manag ement , pages 469–478, 2010. [15] S. Lam and J. Herlocker . MovieLens 1M Dataset. http://www .grouplens.org/node/12, 2014. [16] P . Lops, M. de Gemmis, and G. Semeraro. Content-based recommender systems: State of the art and trends. In Recommender Systems Handbook , pages 73–105. 2011. [17] S. M. McNee, J. Riedl, and J. A. Konstan. Being accurate is not enough: how accurac y metrics ha ve hurt recommender systems. In CHI’06 extended abstr acts on Human factors in computing systems , pages 1097–1101, 2006. [18] G. L. Nemhauser , L. A. W olsey , and M. L. Fisher . An analysis of approximations for maximizing submodular set functions - I. Mathematical Pr ogramming , 14(1):265–294, 1978. [19] F . Radlinski, R. Kleinberg, and T . Joachims. Learning div erse rankings with multi-armed bandits. In Pr oceedings of the 25 th international confer ence on Machine learning , pages 784–791, 2008. [20] F . Ricci, L. Rokach, B. Shapira, and P . B. Kantor , editors. Recommender Systems Handbook . Springer , 2011. [21] R. L. Santos, C. Macdonald, and I. Ounis. Exploiting query reformulations for web search result div ersiﬁcation. In Pr oceedings of the 19th international conference on W orld wide web , pages 881–890, 2010. [22] C. Thurau, K. Kersting, M. W ahabzada, and C. Bauckhage. Con ve x non-neg ativ e matrix factorization for massi ve datasets. Knowledge and information systems , 29(2):457–478, 2011. [23] D. V allet and P . Castells. Personalized di versiﬁcation of search results. In Pr oceedings of the 35 th international A CM SIGIR confer ence on Resear ch and development in information r etrieval , pages 841–850, 2012. [24] S. V argas and P . Castells. Rank and relev ance in novelty and div ersity metrics for recommender systems. In Pr oceedings of the ﬁfth A CM confer ence on Recommender systems , pages 109–116, 2011. [25] S. V argas, P . Castells, and D. V allet. Explicit relev ance models in intent-oriented information retriev al div ersiﬁcation. In Pr oceedings of the 35 th international A CM SIGIR confer ence on Resear ch and development in information r etrieval , pages 75–84, 2012. [26] J. Y u, S. Mohan, D. P . Putthi vidhya, and W .-K. W ong. Latent dirichlet allocation based div ersiﬁed retrie v al for e-commerce search. In Pr oceedings of the 7th ACM international confer ence on W eb sear ch and data mining , pages 463–472, 2014. [27] Y . Y ue and C. Guestrin. Linear submodular bandits and their application to div ersiﬁed retrie v al. In Advances in Neural Information Pr ocessing Systems , pages 2483–2491, 2011. [28] M. Zhang and N. Hurley . A voiding monoton y: improving the div ersity of recommendation lists. In Pr oceedings of the 2008 A CM confer ence on Recommender systems , pages 123–130, 2008. [29] Y . C. Zhang, D. Ó. Séaghdha, D. Quercia, and T . Jambor . Auralist: introducing serendipity into music recommendation. In Pr oceedings of the ﬁfth ACM international confer ence on W eb sear ch and data mining , pages 13–22, 2012. [30] T . Zhou, Z. K uscsik, J.-G. Liu, M. Medo, J. R. W akeling, and Y .-C. Zhang. Solving the apparent div ersity-accuracy dilemma of recommender systems. Pr oceedings of the National Academy of Sciences , 107(10):4511–4515, 2010. [31] C.-N. Ziegler , S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic div ersiﬁcation. In Pr oceedings of the 14 th international confer ence on W orld W ide W eb , pages 22–32, 2005.

DUM: Diversity-Weighted Utility Maximization for Recommendations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment