RecSim: A Configurable Simulation Platform for Recommender Systems

We propose RecSim, a configurable platform for authoring simulation environments for recommender systems (RSs) that naturally supports sequential interaction with users. RecSim allows the creation of new environments that reflect particular aspects o…

Authors: Eugene Ie, Chih-wei Hsu, Martin Mladenov

RecSim: A Configurable Simulation Platform for Recommender Systems
R E C S I M : A Configurable Simulation Platform for Recommender Systems * Eugene Ie † 1 , Chih-wei Hsu 1 , Martin Mladenov 1 , V ihan Jain 1 , Sanmit Narvekar § , 2 , Jing W ang 1 , Rui W u 1 , and Craig Boutilier † , 1 1 Google Research 2 Department of Computer Science, Uni versity of T exas at Austin September 27, 2019 Abstract W e propose R E C S I M , a configurable platform for authoring simulation environ- ments for recommender systems (RSs) that naturally supports sequential inter action with users. R E C S I M allows the creation of new environments that reflect particular aspects of user behavior and item structure at a lev el of abstraction well-suited to push- ing the limits of current reinforcement learning (RL) and RS techniques in sequential interactiv e recommendation problems. En vironments can be easily configured that vary assumptions about: user preferences and item familiarity; user latent state and its dynamics; and choice models and other user response behavior . W e outline ho w R E C S I M offers v alue to RL and RS researchers and practitioners, and how it can serv e as a vehicle for academic-industrial collaboration. 1 Intr oduction Practical recommender systems (RSs) are rapidly e v olving, as adv ances in artificial intelli- gence, machine learning, natural language understanding, automated speech recognition and voice user interfaces f acilitate the de velopment of collaborative inter active r ecommenders (CIRs) . While traditional recommenders, such as those based on collaborative filtering [K on- stan et al., 1997, Breese et al., 1998, Salakhutdinov and Mnih, 2007], typically recommend items that myopically maximize predicted user engagement (e.g., through item rating, score or utility), CIRs explicitly use a sequence of interactions to maximize user engagement or sat- isfaction. CIRs often use con versational methods [V inyals and Le, 2015, Ghazvininejad et al., * https://github .com/google- research/recsim 1 Corresponding authors: { eugeneie, cboutilier } @google.com. 2 W ork done while at Google Research. 1 2018], example critiquing or preference elicitation [Chen and Pu, 2012, Christakopoulou et al., 2016], bandit-based exploration [Li et al., 2010, 2016, Christakopoulou and Banerjee, 2018], or reinforcement learning [Sun and Zhang, 2018] to e xplor e the space of options in collaboration with the user to unco v er good outcomes or maximize user engagement o ver extended horizons. While a topic of increasing research activity in AI—especially in the subareas mentioned abov e—the deployment of CIRs in practice remains limited. This is due, in no small part, to se veral challenges that researchers in these areas face when dev eloping modeling techniques and algorithms that adequately reflect qualitati ve characteristics of user inter action dynamics . The importance of modeling the dynamics of user interaction when de vising good algorithmic and modeling techniques for CIRs is plainly obvious. The next generation of recommenders will increasingly focus on modeling sequential user interaction and optimizing users’ long- term engagement and overall satisfaction. Setting aside questions of user interface design and natural language interaction, 1 this makes CIRs a natural setting for the use of r einfor cement learning (RL) . Indeed, RSs have recently emerged as a useful application area for the RL community . Unfortunately , the usual practice of developing recommender algorithms using static data sets—even those with temporal extent, e.g., the MovieLens 1M dataset [Harper and K onstan, 2016]—does not easily extend to the RL setting inv olving interaction sequences. In particular , the inability to easily e xtract predictions re garding the impact of counterfactual actions on user behavior makes applying RL to such datasets challenging. This is further exacerbated by the fact that data generated by RSs that optimize for myopic engagement are unlikely to follo w action distributions similar to those of policies stri ving for long-term user engagement. 2 T o facilitate the study of RL algorithms in RSs, we de veloped R E C S I M , a configurable platform for authoring simulation en vironments to allo w both researchers and practitioners to challenge and extend e xisting RL methods in synthetic recommender settings. Our goal is not to create a “perfect” simulator; we do not expect policies learned in simulation to be deployed in liv e systems. Rather , we expect simulations that mirror specific aspects of user behavior found in real systems to serve as a controlled en vironment for dev eloping, e valuating and comparing recommender models and algorithms (especially those designed for sequential user-system interaction). As an open-source platform, R E C S I M will also aid reproducibility and sharing of models within the research community , which in turn, will support increased researcher engagement at the intersection of RL/RecSys. For the RS practitioner interested in applying RL, R E C S I M can challenge assumptions made in standard RL algorithms in stylized recommender settings, identify pitfalls of those assumptions to allo w practitioners to focus on additional abstractions needed in RL algorithms. This in turn reduces liv e experiment cycle time via rapid development and model refinement in simulation, and minimizes the potential fo negati ve impact on users in real-world systems. 1 W e return to these topics in the concluding section. 2 This is itself will generally limit the ef fectiv eness of of f-policy techniques lik e in verse propensity scoring and other forms of importance weighting. 2 Figure 1: Data Flo w through components of R E C S I M . The remainder of the paper is organized as follows. W e provide an ov ervie w of R E C S I M along with its relations with RL and RecSys. W e then conclude this introduction by elaborat- ing specific goals (and non-goals) of the platform, and suggesting ways in which both the RecSys and RL research communities, as well as industrial practitioners, might best take adv antage of R E C S I M . W e briefly discuss related ef forts in Section 2. W e outline the basic components of R E C S I M in Section 3 and describe the software architecture in Section 4. W e describe se veral case studies in Section 5 designed to illustrate some of the uses to which R E C S I M can be put, and conclude with a discussion of potential future de velopments in Section 6. 1.1 R E C S I M : A Brief Sketch R E C S I M is a configurable platform that allows the natural, albeit abstract, specification of an en vironment in which a recommender interacts with a corpus of documents (or recommendable items) and a set of users , to support the dev elopment of recommendation algorithms. Fig. 1 illustrates its main components. W e describe these in greater detail in Section 3, but pro vide a brief sketch here to allo w deeper discussion of our moti v ations. The en vironment consists of a user model , a document model and a user -c hoice model . 3 The (r ecommender) ag ent interacts with the environment by recommending slates of docu- ments (fixed or dynamic length) to users. The agent has access to observable features of the users and (candidate) documents to make recommendations. The user model samples users from a prior distribution ov er (configurable) user featur es : these may include latent features such as personality , satisfaction, interests; observ able features such as demographics; and behavioral features such as session length, visit frequenc y , or time budget. The document model samples items from a prior ov er document featur es , which again may incorporate latent features such as document quality , and observable features such as topic, document length and global statistics (e.g., ratings, popularity). The level of observ ability for both user and document features is customizable, so that dev elopers hav e the flexibility to capture dif ferent RS operating regimes to in vestigate particular research questions. When the agent recommends documents to a user , the user r esponse is determined by a user choice model . The choice of document by the user depends on observable document features (e.g., topic, percei ved appeal) and all user features (e.g., interests). Other aspects of a user’ s response (e.g., time spent with a document or post-consumption rating) can also depend on latent document features (e.g., document quality , length). Once a document is consumed, the user state under goes a transition through a configurable (user) transition model . For e xample, user interest in a document’ s topic might increase/decrease; user remaining (time) budget may decrease at different rates depending on document quality; and user satisfaction may increase/decrease depending on document-interest match and document quality . De velopers can ev aluate ov erall user eng agement in the simulated en vironment to compare policies deriv ed using different RS or RL models and algorithms. W e illustrate dif ferent configurations with three use cases in Section 5. 1.2 R E C S I M and RL One motiv ation for R E C S I M is to provide en vironments that facilitate the de velopment of ne w RL algorithms for recommender applications. While RL has sho wn considerable success in games, robotics, physical system control and computational modeling [Mnih et al., 2015, Silver et al., 2016, Haarnoja et al., 2018, Lazic et al., 2018], lar ge-scale deployment of RL in real-world applications has proven challenging [Dulac-Arnold et al., 2019]. RSs, in particular , hav e recently emerged as a useful domain for the RL community that could serve to bridge this gap—the ubiquity of RSs in commercial products makes it ripe for demonstrating RL ’ s real-w orld impact. Unfortunately , the application of RL to the real-world RSs poses many challenges not widely studied in the mainstream RL literature, among them: • Generalization across users : Most RL research focuses on models and algorithms in v olving a single en vironment. A typical commercial RS interacts with millions of users—each user is a distinct (and possibly independent) partially observable Marko v decision process (POMDP). 3 Ho we ver , as in collaborati ve filteri ng, contextual bandits, and related models for recommendation, it is critical that the recommender agent 3 For the purposes of R E C S I M we treat an RS’ s interaction with one user as having no impact on the state of another user . W e recognize that multi-agent interaction often occurs across users in practice, and that the RS’ s 4 generalizes across users, e.g., by modeling the different en vironments as a contextual MDP [Hallak et al., 2015]. Large-scale recommenders rarely ha ve enough e xperience with any single user to mak e good recommendations without such generalization. • Combinatorial action spaces : Many , if not most, recommenders propose slates of items to users. Slate recommendation has been explored in non-sequential settings, capturing point-wise user choice models using non-parametric means [Ai et al., 2018, Bello et al., 2018, Jiang et al., 2019]. Ho we ver modeling such a combinatorial action space in the conte xt of sequential recommendations poses challenges to existing RL al- gorithms [Sunehag et al., 2015, Metz et al., 2017], as the assumptions the y make render them inef fectiv e for e xploration and generalization in large-scale recommenders. • Large, dynamic, stochastic action spaces : The set of recommendable items is often generated dynamically and stochastically in many large-scale recommenders. For example, a video recommendation engine may operate o ver a pool of videos that are undergoing constant flux by the minute: injection of fresh content (e.g., latest ne ws), change in content av ailability (e.g., copyright considerations or user-initiated dele- tions), surging/declining content popularity , to name a fe w . This poses an interesting challenge for standard RL techniques as the action space is not fixed [Boutilier et al., 2018, Chandak et al., 2019]. • Sev ere partial obser vability and stochasticity : Interaction with users means that an RS is operating in a latent-state MDP; hence it must capture various aspects of the user’ s state (e.g., interests, preferences, satisfaction, acti vity , mood, etc.) that gener- ally emit very noisy signals via the user’ s observed beha vior . Moreov er , exogenous unobserv able ev ents further complicate the interpretation of a user’ s behavior (e.g., if a user turned of f a music recommendation, was it because she did not like the recommendation, or did someone ring her doorbell?). T aken together , these factors mean that recommender agents must learn to act in en vironments that hav e extremely lo w signal-to-noise ratios [Mladenov et al., 2019]. • Long-horizons : There is e vidence that some aspects of user latent state ev olv e very slo wly ov er long horizons. For e xample, Hohnhold et al. [2015] sho w that ad quality and ad load induce slo w b ut detectable changes in ads effecti veness o v er periods of months, while W ilhelm et al. [2018] sho w that video recommendation div ersification on Y ouT ube induces similarly slow , persistent changes in user engagement. Maximiz- ing long-term user engagement often requires reasoning about MDPs with extremely long horizons, which can be challenging for many current RL methods [Mladenov et al., 2019]. In display advertising, user responses such as clicks and con versions can happen days after the recommendation [Chapelle and Li, 2011, Chapelle, 2014], which requires agents to model delayed feedback or abrupt changes in reward signals. objectiv es (e.g., fairness) may induce further dependence in the policies applied to dif ference users. W e ignore such considerations here (though see the concluding section). 5 • Other practical challenges : Other challenges include accurate off-polic y estimation in inherently logs-based production en vironments and costly policy e v aluation in li ve systems. In addition, there are often multiple e valuation criteria for RSs, among which the tradeof f [Rodriguez et al., 2012] and the corresponding re ward function may not be obvious. Because of these and other challenges, direct application of published RL approaches often fail to perform well or scale [Dulac-Arnold et al., 2019, Ie et al., 2019, Mladenov et al., 2019]. Broadly speaking RL research has often looked past many of these problems, in part because access to suitable data, real-world systems, or simulation environments has been lacking. 1.3 R E C S I M and RecSys En vironments in which the user’ s state (both observed and latent) can e volv e as the user interacts with a recommender pose ne w challenges not just for RL, b ut for RSs research as well. As noted above, traditional research in RSs deals with “static” users. Howe ver in recent years, RS research has increasingly started to explore sequential patterns in user interaction using HMMs, RNNs and related methods [He and McAuley, 2016, Hidasi et al., 2016, W u et al., 2017]. Interest in the application of RL to optimizing these sequences has been rarer [Shani et al., 2005] though the recent successes of deep RL hav e spurred activity in the use of RL for recommendation [Gauci et al., 2018, Zheng et al., 2018, Chen et al., 2018, Zhao et al., 2018, Ie et al., 2019]. Howe ver , much of this work has been de v eloped in proprietary RSs, has used specially crafted synthetic user models, or has adapted static data sets to the RL task. The use of R E C S I M will allow the more systematic exploration of RL methods in RS research. Moreover , the configurability of R E C S I M can help support RS research on more static aspects of recommender algorithms. For instance, user transition models can be “v acuous” so that the user state ne v er changes. Ho we ver , R E C S I M allows the developer to configure the user state (including it’ s relationship to documents) to be arbitrarily complex, and v ary which parts of the state are observ able to the recommender itself. In addition, the user choice model allo ws one to configure v arious methods by which users choose among recommended items and their induced responses or beha viors. This can be used to rapidly de velop and refine no vel collaborati ve filtering methods, conte xtual bandits algorithms and the like; or simply to test the robustness of existing recommendation schemes to various assumptions about user choice and response behavior . 1.4 Non-objectives The main goal of R E C S I M is to allow the straightforw ard specification and sharing of the main en vironment components in volved in simulating the sequential interaction of an RS with a user . It does not (directly) provide learning algorithms (e.g., collaborati v e filtering or reinforcement learning agents) that generate recommendations. Its main aim is to support the 6 de velopment, refinement, analysis and comparisons of such algorithms. That said, R E C S I M is distributed with se veral baseline algorithms—both typical algorithms from the literature (e.g., a simple contextual bandit) and some recent RL-based recommender algorithms, as outlined belo w—to: (a) allo w for straightforward “out-of-the-box” testing, and (b) serv e as ex emplars of the APIs for those implementing ne w recommender agents. Instead of providing realistic recommender simulations in R E C S I M that reflect user behavior with full fidelity , we anticipate that ne w environments will be created by researchers and practitioners that reasonably reflect particular aspects of user behavior at a lev el of abstraction well-suited to pushing the capabilities of existing modeling techniques and algorithms. R E C S I M is released with a variety of dif ferent user state, transition and choice models, and se veral corresponding document models. Se veral of these correspond to those used in the case studies discussed in Sec. 5. These are included primarily as illustrations of the principles laid out abo v e. Specifically , we do not adv ocate the use of these as benchmarks (with the exception of researchers interested in the v ery specific phenomena they study). While R E C S I M environments will not reflect the full extent of user behavior in most practical recommender settings, it can serve as a vehicle to facilitate collaboration and identify syner gies between academic and industrial researchers. In particular , through the use of “stylized user models” that reflect certain aspects of user behavior , industrial researchers can share both qualitativ e and quantitative observations of user interaction in real-world systems that is detailed enough to meaningfully inform the de velopment of models and algorithms in the research community while not re v ealing user data nor sensiti ve industrial practices. W e provide an illustrativ e example of how this might work in one of the case studies outlined belo w . 2 Related W ork W e briefly outline a selection of related work on the use of simulation, first in RL, and next in RS and dialogue systems. 2.1 RL Platf orms Simulation has played an outsized role in the ev aluation of RL methods in recent years. The Arcade Learning En vironment [Bellemare et al., 2013] (or ALE) introduced a now well- kno wn platform for testing algorithms on a suite of Atari 2600 games. Since then numerous RL e v aluation benchmarks and en vironments have been proposed. W e only mention a fe w here to draw contrasts with our goals, and refer to Castro et al. [2018] for an ov ervie w of related work and the v arious goals such platforms can play . The OpenAI Gym [Brockman et al., 2016] is one of the most widely used platforms, consisting of a collection of en vironments (including both discrete, e.g., Atari, and con- tinuous, e.g., Mujoco-based, settings) against which RL algorithms can be benchmarked and compared. Our work shares OpenAI Gym’ s emphasis on of fering en vironments rather than agents, but dif fers in that we focus on allowing the authoring of en vironments to push 7 de velopment of algorithms that handle ne w domain characteristics rather than benchmarking. Once configured, ho we ver , the R E C S I M en vironment is wrapped in an OpenAI Gym en vi- ronment, which, giv en its popularity , can facilitate RL experimentation and e v aluation. The Dopamine frame work [Castro et al., 2018], by contrast, provides simple implementations of a v ariety of (value-based) RL methods that support the rapid de velopment of ne w algorithms for research purposes. While easily plugged into new en vironments—R E C S I M itself is integrated into Dopamine as discussed belo w—it does not provide support for authoring en- vironments. Other framew orks also provide standard RL algorithms with libraries integrated with OpenAI Gym [Gauci et al., 2018, Guadarrama et al., 2018]. ELF [Tian et al., 2017] is a platform that allows configuration of real-time strategy games (R TSs) to support the de velopment of ne w RL methods to overcome the challenges of doing research with commercial games (e.g., by allo wing access to internal game state). It allo ws configuration of some aspects of the game (e.g., action space) and its parameters, sharing R E C S I M ’ s motiv ation of en vironment configurability . Like R E C S I M , it also supports hierarchy and multi-timescale actions. Unlike R E C S I M , ELF pays special attention to the support of dif ferent training regimes and is designed for fast performance. Zhang et al. [2018] de v elop a set of “natural” RL benchmarks that augment traditional RL tasks with richer image- and video-based state input to “widen the state space of the MDP , ” seeking to ov ercome the simplicity of the state space in many simulated RL benchmarks. They share our moti v ation to press RL algorithms to address new phenomena, but R E C S I M focuses on supporting configuring basic new structure in state space, action space, observability , system dynamics, agent objectiv es, etc., and emphasizes the use of stylized models to challenge the fundamental assumptions of many current RL models, algorithms and training paradigms. 2.2 RecSys and Dialogue En vironments Rohde et al. [2018] propose RecoGym, a stylized RS simulation en vironment integrated with the RL-based OpenAI Gym. It provides a configurable en vironment for studying sequential user interaction combining or ganic na vigation with intermittent recommendation (or ads). While RecoGym supports sequential interaction, it does not allo w user state transitions; instead the focus in on bandit-style feedback—RL/sequentiality is handled within the learning agent (especially exploration, CTR estimation, etc.). It does allow configuration of user response behavior , item/user dimensionality , etc. The coupling of fline of agent training with simulation of user behavior was studied by Schatzmann et al. [2007] using a rule-based approach. Similar rule-based en vironments hav e also been recently explored to aid e valuation of goal-oriented dialogue agents [W ei et al., 2018], while the use of learning to enhance rule-based en vironment dynamics has been explored in dialogue-based [Peng et al., 2018] and interacti ve search [Liu et al., 2019] systems. More recently , generati ve adv ersarial networks ha v e been used to generate virtual users for high-fidelity recommender en vironments to support learning policies that can be transferred to real systems [Shi et al., 2019, Zhao et al., 2019]. As en vironments differ widely across systems and commercial products, we propose that stylized models of en vironments 8 that reflect specific aspects of user beha vior will pro ve v aluable in developing new RL/RS approaches of practical import. W e thus emphasize ease of authoring en vironments using stylized models (and, in the future through plugging in learned models), rather than focusing on “sim-to-real” transfer using high-fidelity models. 3 Simulation Components W e discuss the main components of R E C S I M in further detail (Sec. 3.1) and illustrate their role and interaction in a specific recommendation en vironment (Sec. 3.2). 3.1 Main Components Fig. 1 illustrates the main components of R E C S I M . The envir onment consists of a user model , a document model and a user-choice model . The (r ecommender) agent interacts with the en vironment by recommending slates of documents to a user . The agent uses observable user and (candidate) document features to make its recommendations. Since observable history is used in many RL/RS agents, we provide tools to allo w the de veloper to add v arious summaries of user history to help with recommendation or exploration. The document model also samples items from a prior over document featur es , including latent features such as document quality; and observ able features such as topic, or global statistics (e.g., ratings, popularity). Agents and users can be configured to observe dif ferent document features, so developers hav e the flexibility to capture different RS operating regimes (e.g., model predictions or statistical summaries of other users’ engagement with a document may be av ailable to the agent but not observ able to the user). The user model samples users from a prior ov er (configurable) user featur es , including latent features such as personality , satisfaction, interests; observ able features such as demo- graphics; and behavioral features such as session length, visit frequenc y , and (time) budget. The user model also includes a transition model, described belo w . When the agent recommends documents to a user , the user r esponse is determined by a user choice model [Louviere et al., 2000]. The choice of document by the user depends on observable document features (e.g., topic, perceiv ed appeal) and all user (latent or observable) features (e.g., interests). Other aspects of the response (e.g., time spent, rating) can themselves depend on latent (as well as obsev able) document features if desired (e.g., document quality , length). Specific choice models include the multinomial logit [Louviere et al., 2000] and exponentiated cascade [Joachims, 2002]. Once a document is consumed, the user state transitions through a configurable (user) transition model . For e xample, user interest in a document’ s topic might increase/decrease; user remaining (time) budget may decrease at dif ferent rates depending on document quality; and user satisfaction may increase/decrease depending on document-interest match and document quality . Developers can ev aluate ov erall user engagement in the simulated en vironments to compare policies deri v ed using dif ferent RL and recommendation approaches. 9 R E C S I M can be viewed as a dynamic Bayesian network that defines a probability distribution ov er trajectories of slates, choices, and observ ations. In particular , the probability of a trajectory of user observations o t and choices c t , recommended slates A t , and candidate documents D t factorizes as: p ( o 1 , . . . , o N , c 1 , . . . , c N , A 1 , . . . , A N ) = X ( z 0 ,...,z N ) h p ( z 0 ) p ( A 0 ) p ( c 0 | A 0 , z 0 ) N Y t =1 p ( o t | z t ) p ( z t | z t − 1 , A t , c t ) p ( c t | A t , z t − 1 ) p ( A t | D t , H t − 1 ) p ( D t ) i , where z i is the user state, p ( z t | z t − 1 , A t , c t ) the transition model, p ( c t | A t , z t − 1 ) the choice model, p ( o t | z t ) the observation model, and p ( A t | H t − 1 , D t ) is the recommender policy , which may depend on the entire history of observ ables H t − 1 up to that point. R E C S I M ships with se v eral default en vironments and recommender agents, b ut de v el- opers are encouraged to dev elop their own en vironments to stress test recommendation algorithms and approaches to adequately engage users exhibiting a v ariety of behaviors. 3.2 SlateQ Simulation En vironment T o illustrate how dif ferent R E C S I M components can be configured, we describe a specific slate-based recommendation en vironment, used by Ie et al. [2019], that is constructed for testing RL algorithms with combinatorial actions in recommendation en vironments. W e briefly re view e xperiments using that en vironment in one of our use cases below . T o capture fundamental elements of user interest in the recommendation domain, the en vironment assumes a set of topics (or user interests) T . The documents are dra wn from content distribution P D ov er topic vectors. Each document d in the set of documents D has: an associated topic vector d ∈ [0 , 1] | T | , where d j is the degree to which d reflects topic j ; a length ` ( d ) (e.g., length of a video, music track or news article); an inher ent quality L d , representing the topic-independent attractiv eness to the average user . Quality varies randomly across documents, with document d ’ s quality distributed according to N ( µ T ( d ) , σ 2 ) , where µ t is a topic-specific mean quality for any t ∈ T . Other en vironment realizations may adopt assumptions to simplify the setup, such as, assuming each document d has only a single topic T ( d ) , so d = e i for some i ≤ | T | (i.e., a one-hot topic encoding); using the same constant length ` for all documents; or assuming fixed quality variance across all topics. The user model assumes users u ∈ U hav e v arious de grees of interests in topics (with some prior distribution P U ), ranging from − 1 (completely uninterested) to 1 (fully inter - ested), with each user u associated with an inter est vector u ∈ [ − 1 , 1] | T | . User u ’ s interest in document d is gi ven by the dot product I ( u, d ) = ud . The user’ s interest in topics e volv es ov er time as they consume different documents. A user’ s satisfaction S ( u, d ) with a con- sumed document d is a function f ( I ( u, d ) , L d ) of user u ’ s interest and document d ’ s quality . Alternati ve implementations could include: a conv e x combination to model user’ s satisf action such as S ( u, d ) = (1 − α ) I ( u, d ) + αL d where α balances user-interest-dri v en and document- 10 Figure 2: Control flo w (single user) in the RecSim architecture. quality-dri ven satisfaction; or a stylized model that stochastically nudges user interest I t in topic t = T ( d ) after consumption of document d using ∆ t ( I t ) = ( − y | I t | + y ) · − I t , where y ∈ [0 , 1] denotes the fraction of the distance between the current interest le v el and the maximum le vel (1 , − 1) . Each user could also be assumed to ha v e a fixed b udget B u of time to engage with content during a session. Each document d consumed reduces user u ’ s budget by the document length ` ( d ) less a bonus b < ` ( d ) that increases with the document’ s appeal S ( u, d ) . Other session-termination mechanisms can also be configured in R E C S I M . T o model realistic RSs, the user choice model assumes the recommended document’ s topic to be observable to the user before choice and consumption. Ho we ver , the document’ s quality is not observable to the user prior to consumption, but is rev ealed afterward, and drives the user’ s state transition. Popular choice functions like the conditional choice model and exponential cascade model are provided in R E C S I M to model user choice from a document slate. 4 Softwar e Architectur e In this section, we provide a more detailed description of the simulator architecture and outline some common elements of the en vironment and recommender agents. 11 4.1 Simulator Figure 2 presents the control flow for a single user in the R E C S I M architecture. The environ- ment consists of a user model, a document model, and a user-choice model. The simulator serves as the interface between the en vironment and the agent, and manages the interactions between the two using the follo wing steps: 1. The simulator requests the user state from the user model, both the observ able and latent user features. The simulator also queries the document model for a set of (candidate) documents that have been made a vailable for recommendation. (These documents may be fixed, sampled, or determined in some other f ashion.) 2. The simulator sends the candidate documents and the observable portion of the user state to the agent. (Recall that the recommender agent does not have direct access to user latent features, though the agent is free to estimate them based on its interaction history with that user and/or all other users.) 3. The agent uses its current polic y to returns a slate to the simulator to be “presented” to the user . (E.g., the agent might rank all candidate documents using some scoring function and return the top k .) 4. The simulator forwards the recommended slate of documents and the full user state (observ able and latent) to the user choice model. (Recall that the user’ s choice, and other behavioral responses, can depend on all aspects of user state.) 5. Using the specified choice and response functions, the user choice model generates a (possibly stochastic) user choice/response to the recommended slate, which is returned to the simulator . 6. The simulator then sends the user choice and response to both: the user model so it can update the user state using the transition model; and the agent so it can update its policy gi v en the user response to the recommended slate. In the current design, the simulator sequentially simulates each user . 4 Each episode is a multi-turn history of the interactions between the agent and single user . At the beginning of each episode, the simulator asks the en vironment to sample a user model. An episode terminates when it is long enough or the user model transits to a terminal state. Similar to Dopamine [Castro et al., 2018], we also define an iteration , for bookkeeping purposes, to consist of a fixed number of turns spanning multiple episodes. At each iteration, the simulator aggregates and logs rele v ant metrics generated since last iteration. The simulator can also checkpoint the agent’ s model state, so that the agent can restart from that state after interruption. 4 W e recognize that this design is some what limiting. The next planned release of R E C S I M will support interleav ed user interaction. 12 An important consideration when applying RL to practical recommender applications is the importance of batch RL [Riedmiller, 2005]. A new RS will generally need to learn from data gathered by e xisting/legac y RSs, since on-polic y training and e xploration can ha ve a negati ve impact on users. T o facilitate batch RL in v estigations, R E C S I M allows one to log the trace of all (simulated) user interactions for offline training using a suitable RL agent. Specifically , it logs each episode as a T ensorflow [Abadi et al., 2016] SequenceExample , against which de velopers can de velop and test batch RL methods (and ev aluate the resulting agent through additional en vironment interaction). R E C S I M relies on T ensorflow’ s T ensorboard to visualize aggre gate metrics o ver time, whether during agent training or at e v aluation time using freshly sampled users (and docu- ments if desired). R E C S I M includes a number of typical metrics (e.g., av erage cumulati ve re ward and episode length) as well as some basic di versity metrics. It uses a separate sim- ulation process to run ev aluation (in parallel across multiple users), which logs metrics for a specified number of episodes. The e valuation phase can be configured in multiple ways to e valuate, say , the robustness of a trained recommendation agent to changes in the en vironment (e.g., changes to user state distrib ution, transition functions, or choice models). For e xample, as we discuss in one use case below , we can train the agent using trajectory data assuming a particular user -choice model, b ut e v aluate the agent with simulated users instantiated with a different user-choice model. This separate ev aluation process is analo- gous to the splitting of games into train and test sets in ALE (to assess ov er-fitting with hyperperameter tuning). 4.2 En vironment The en vironment provides APIs for the simulator to perform the steps described in Figure 2. Once all components in the en vironment are defined, the en vironment is wrapped in an OpenAI Gym [Brockman et al., 2016] en vironment. OpenAI Gym has prov en to be popular for specifying nov el environments to train/test numerous RL algorithms. Developers can readily incorporate state-of-the-art RL algorithms for application recommender domains. Because OpenAI Gym is intended for RL ev aluation, developers are required to define a re ward function for each environment—this is usually interpreted as the primary criterion (or at least one of the criteria) for e v aluation of the recommender agent, and will generally be a function of a user’ s (history of) responses. 4.3 Recommender Agent Architectur e R E C S I M pro vides several abstractions with APIs for the simulation steps in Figure 2 so de velopers can create agents in a configurable way with reusable modules. W e do not fo- cus on modules specific to RL algorithms (e.g., replay memory [Lin and Mitchell, 1992]). Instead, R E C S I M offers stackable hierar chical agent layers intended to solv e a more ab- stract recommendation problem. A hierarchical agent layer does not materialize a slate of documents (i.e., RS action), but relies on one or more base agents to do so. The hierarchical agent architecture in R E C S I M can roughly be summarized as follows: a hierarchical agent 13 layer recei ves an observation and reward from the en vironment; it preprocesses the raw observ ation and passes it to one or more base agents. Each base agent outputs either a slate or an abstract action (depending on the use case), which is then post-processed by the agent to create/output the slate (concrete action). Hierarchical layers are recursively stackable in a fashion similar to K eras [Chollet et al., 2015] layers. Hierarchical layers are defined by their pre- and post-processing functions and can play many roles depending how these are implemented. For example, a layer can be used as a pure feature injector —it can extract some feature from the (history of) observations and pass it to the base agent, while k eeping the post-processing function v acuous. This allo ws decoupling of feature- and agent-engineering. V arious re gularizers can be implemented in a similar fashion by modifying the rew ard. Layers may also be stateful and dynamic, as the pre- or post-processing functions may implement parameter updates or learning mechanisms. W e demonstrate the use of the hierarchical agent layer using an environment with users having latent topic interests. In this en vironment, recommender agents are tasked with explor ation to uncover user interests by showing documents with different topics. Exploration can be po wered by , say , a contextual bandit algorithm. A hierarchical agent layer can be used to log bandit feedback (i.e., per-topic impression and click statistics). Its base agent exploits that bandit feedback to returns a slate of recommended documents with the “best” topic(s) using some bandit algorithm. These impression and click statistics should not be part of the user model—neither a user’ s choice nor transition depends on them—indeed, these are agent- specific. Ho wev er , it is useful to have a common ClusterClic kStatsLayer for modelling user history in R E C S I M since many agents often use those statistics in their algorithms. Another set of suf ficient statistics that is commonly used in POMDP reinforcement learning is that of finite histories of observables. W e have implemented F ixedLengthHistoryLayer which records observations about user , documents, and user responses during the last fe w turns. Agents can utilize that layer to, say , model temporal dynamic behavior without requiring direct access to the user model. A slightly more subtle illustration can be found in the T emporalAg gr e gationLayer . It was recently shown [Mladeno v et al., 2019] how reducing the control frequency of an agent may improv e performance in en vironments where observability is lo w and/or the a v ailable state representations are limited. The T emporalAg gr e gationLayer reduces the control frequency of its base layer by calling its step function once ev ery k steps, then remembering the features of the returned slate and trying to reproduce a similar slate for the remaining k − 1 control periods. It also provides an optional switching cost regularizer that penalizes the agent for switching to a slate with different features. In this way , the concept of temporal aggregation/re gularization can be applied to any base agent. Finally , we provide a general hierarchical agent that can wrap an arbitrary list of agents as abstract actions, implementing arbitrary tree-structured hierarchical agent architectures. R E C S I M provides a set of baseline agents to facilitate the e valuation of new recom- mender agents. T abularQAg ent implements the Q-learning algorithm [W atkins and Dayan, 1992] (by discretizing observations if necessary). The size of tab ular representation of state- action space is exponential in the length of observ ations; and it enumerates all possible 14 recommendation slates (actions) in order to maximize ov er Q v alues during training and serving/e valuation—hence it is a suitable baseline only for the smallest en vironments. Full- SlateQAgent implements a deep Q-Network (DQN) agent [Mnih et al., 2015] by treating each slate as a single action and querying the DQN for maximizing action. The inputs of this DQN are observations of the en vironment. Slate enumeration generally limits the number of candidate documents that can be e v aluated at each interaction (b ut see the SlateQ use case belo w). RandomAgent recommends random slates with no duplicates. T o be self-contained, R E C S I M also pro vides an adapter for applying the DQN agent in Dopamine to the simulation en vironments packaged with R E C S I M . In addition, se veral standard multi-armed bandit algorithms [Auer et al., 2002, Garivier and Cappe, 2011, Agrawal and Goyal, 2013] are provided to support e xperimentation with exploration (e.g., of latent topic interests). 5 Case Studies W e outline three use cases dev eloped within R E C S I M , one a standard bandit approach, the other two driving recent research into novel RL techniques for recommendation. These illustrate the range of uses to which R E C S I M can be put, e ven using en vironments with fairly simple, stylized models of user interaction. 5.1 Latent State Bandits In this study , we examine how the tradeof f between immediate and long-term rew ard affects the v alue of exploration. The en vironment samples users whose latent topic interests are not directly observ able by the agent. An agent can discover these interests using various exploration strate gies. This single-item recommendation en vironment assumes a set of topics (or user interests) T . The documents are drawn from content distrib ution P D ov er topics. Each document d in the set of documents D has: an associated topic vector which is a one-hot topic encoding d of d ’ s sole topic T ( d ) ; an inher ent quality L d , representing the topic-independent attractiveness. L d is distributed according to ln N ( µ T ( d ) , σ 2 ) , where µ t is a topic-specific mean quality for any t ∈ T . The user model assumes users u ∈ U hav e v arious degrees of interests in topics (with some prior distrib ution P U ). Each user u has a static inter est vector u . User u ’ s interest in document d is giv en by the dot product I ( u, d ) = ud . The probability that u chooses d is proportional to a function depending on topic af finity I ( u, d ) and document quality: f ( I ( u, d ) + L d ) . W e ev aluate agents using total user clicks induced o ver a session. W e can configure P U and the distribution o ver d conditioned on P U so that topic af finity influences user choice more than document quality: I ( u, d ) > L d . Intuitiv ely , exploration or planning (RL) is critical in this case—it will be less so when I ( u, d ) ∼ L d (lo w topic affinity). The following table presents the results of applying dif ferent exploration/planning strategies, using the agents described in Sec. 4: RandomAgent , T abularQAg ent , FullSlateQAgent , and a bandit agent po wered by UCB1 [Auer et al., 2002]. The latter three agents employ the ClusterClickStatsLayer for per-topic impression and click counts. W e also implement an 15 “omniscient” greedy agent which knows the user choice model and user prior to myopically optimize expected re ward f ( I ( ˆ u, d ) + L d ) , where ˆ u is an av erage user (interest). W e see that UCB1 and Q-learning perform far better than the other agents in the high-affinity en vironment. Strategy En vironment A vg. CTR (%) En vironment A vg. CTR (%) Random Low T opic Affinity 7.86 High T opic Affinity 14.97 Greedy Low T opic Affinity 9.59 (22.01%) High T opic Af finity 17.56 (17.30%) T abularQ Low T opic Affinity 8.24 (4.83%) High T opic Affinity 20.16 (34.67%) FullSlateQ Low T opic Affinity 9.64 (22.60%) High T opic Af finity 23.28 (55.51%) UCB1 Low T opic Affinity 9.76 (24.17%) High T opic Af finity 25.17 (68.14%) 5.2 T ractable Decomposition for Slate RL Many RSs recommend slates , i.e., multiple items simultaneously , inducing an RL problem with a lar ge combinatorial action space, that is challenging for exploration, generalization and action optimization. While recent RL methods for such combinatorial action spaces [Sunehag et al., 2015, Metz et al., 2017] take steps to address this problem, they are unable to scale to problems of the size encountered in large, real-w orld RSs. Ie et al. [2019] used R E C S I M to study decomposition techniques for estimating Q-values of whole recommendation slates. The algorithm exploits certain assumptions about user choice behavior —the process by which a user selects and/or engages with items on a slate— to construct a decomposition based on a linear combination of constituent item Q-v alues. While these assumptions are minimal and seem natural for recommender settings, the authors used R E C S I M to study (a) the efficacy of the decomposed TD/Q-learning algorithm v ariants ov er myopic policies commonly found in commercial recommenders and (b) the robustness of the estimation algorithm under user choice model de viations. In their simulations, user topic interests are observ able and shift with exposure to documents with specific topics. The number of topics is finite, with some having high av erage quality and others having lo wer quality . Document quality impacts ho w quickly a user’ s (time) budget decays during a session. While users may ha v e initial interest in lo w- quality topics, their interests shift to higher quality topics if the agent successfully determines which topics hav e greater long-term v alue. Their results demonstrate that using RL to plan long-term interactions can provide significant v alue in terms of overall engagement. While having full Q-learning that includes optimal slate search in both training and inference may result in the best o verall long term user engagement, a significant portion of the gains can be captured using a less costly v ariant of on-policy TD-learning coupled with greedy slate construction at serving time. They also found that the decomposition technique is robust to user choice model shifts—gains over myopic approaches are still possible even if the assumed user choice model dif fers. W e refer to Ie et al. [2019] for additional en vironment details and results. 16 5.3 Advantage Amplification ov er Long Horizons Experiments in real world ads systems and RSs suggest that some aspects of user latent state e volv e very slo wly [Hohnhold et al., 2015, W ilhelm et al., 2018]. Such slo w user -learning behavior in en vironments with a low signal-to-noise ratio (SNR) poses se vere challenges for end-to-end event-level RL . Mladenov et al. [2019] used R E C S I M to inv estigate this issue. The authors de veloped a simulation en vironment in which documents hav e an observ able quality that ranges within [0 , 1] . Documents on the 0 -end of the scale are termed chocolate , and lead to large amounts of immediate engagement, while documents on the 1 -end, termed kale , generate lo wer engagement, b ut tend to increase satisfaction. Users’ satisfaction is modeled as v ariable in [0 , 1] that stochastically (and slowly) increases or decreases with the consumption of dif ferent types of content; pure chocolate documents generate engagement drawn from ln N ( µ choc , σ choc ) , pure kale documents resp. from ln N ( µ k ale , σ k ale ) , while mixed documents interpolate linearly between the parameters of the two distributions in proportion to their kaleness. One possible response to the diffi culties of learning in slowly ev olving en vironments with lo w SNR is through the use of temporally-aggre gated hierarchical policies. Mladenov et al. [2019] implement two approaches—temporal aggreg ation (repeating actions for some predetermined period k ) and temporal regularization, i.e., subtracting a constant λ from the rew ard whenev er A t 6 = A t − 1 (in terms of document features)—as hierarchical agent layers in R E C S I M that can modify a base agent. These hierarchical agent nodes amplify the dif ferences between Q -v alues of actions (the adv antage function), making the learned policy less susceptible to lo w-SNR ef fects. In simulation, temporal aggreg ation was sho wn to improv e the quality of learned policies to a point almost identical to the case where the user satisfaction is fully observed. W e refer to Mladenov et al. [2019] for the en vironment details and results. 6 Next Steps While R E C S I M in its current form provides ample opportunity for researchers and practition- ers to probe and question assumptions made by RL/RS algorithms in stylized environments, we recognize the broader interest in the community to dev elop models that address the “sim-to-real” gap. T o that end, we are dev eloping methodologies to fit stylized user models using production usage logs—as well as additional hooks in the frame work to plug in such user models—to create en vironments that are more faithful to specific (e.g., commercial) RSs. W e expect that such fitted stylized user models, especially when abstracted suitably to address concerns about data priv acy and business practices, may f acilitate industry/academic collaborations. That said, we vie w this less as directly tackling “sim-to-real” transfer , and more as a means of aligning research objectives around realistic problem characteristics that reflect the needs and behaviors of real users. Our initial emphasis in this release of R E C S I M is facilitating the creation of new simula- tion en vironments that dra w attention to modeling and algorithmic challenges pertinent to 17 RSs. Naturally , there are many directions for further de velopment of R E C S I M . For example, we are extending it to allo w concurrent e xecution that de viates from the single-user control flo w depicted in Fig. 2. Concurrent ex ecution will not only improv e simulation throughput, but also reflects how RS agents operate in real-world production settings. In particular, it will allo w in v estigation of phenomena, such as ”distrib uted e xploration” across users, that are not feasible within the current serial user control flo w . Finally , modern CIRs will in volv e rich forms of mix ed-mode interactions that co ver a v a- riety of system actions (e.g., preference elicitation, providing endorsements, navig ation chips) and user responses (e.g., e xample critiquing, indirect/direct feedback, query refinements), not to mention unstructured natural language interaction. Furthermore, real-world users typically transition across the search-browsing spectrum ov er multiple RS sessions—our next major release will incorporate some of these interaction modalities. Refer ences Mart ´ ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Da vis, Jeffre y Dean, Matthieu De vin, Sanjay Ghemawat, Geof fre y Irving, Michael Isard, et al. T ensorFlow: A system for large-scale machine learning. arXi v preprint arXi v:1605.08695, 2016. Shipra Agra wal and Navin Go yal. Further optimal regret bounds for Thompson sampling. In Pr oceedings of the 16th International Conference on Artificial Intelligence and Statistics , pages 99–107, Scottsdale, AZ, 2013. Qingyao Ai, Keping Bi, Jiafeng Guo, and W . Bruce Croft. Learning a deep listwise context model for ranking refinement. In Pr oceedings of the 41st Annual International A CM Confer ence on Researc h and Development in Information Retrieval (SIGIR-18) , pages 135–144, Ann Arbor , MI, 2018. Peter Auer , Nicol ` o Cesa-Bianchi, and Paul Fischer . Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235–256, 2002. Marc G. Bellemare, Y av ar Naddaf, Joel V eness, and Michael Bo wling. The Arcade Learning En vironment: An e v aluation platform for general agents. Journal of Artificial Intelligence Resear c h , 47:253–279, 2013. Irwan Bello, Sayali K ulkarni, Sagar Jain, Craig Boutilier , Ed Chi, Elad Eban, Xiyang Luo, Alan Macke y , and Ofer Meshi. Seq2Slate: Re-ranking and slate optimization with RNNs. arXiv:1810.02019 [cs.IR] , 2018. Craig Boutilier , Alon Cohen, A vinatan Hassidim, Y ishay Mansour , Ofer Meshi, Martin Mladenov , and Dale Schuurmans. Planning and learning with stochastic action sets. In International Joint Confer ence on Artifical Intelligence (IJCAI) , pages 4674–4682, Stockholm, 2018. 18 Jack S. Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictiv e algo- rithms for collaborati ve filtering. In Pr oceedings of the 14th Confer ence on Uncertainty in Artificial Intelligence , pages 43–52, Madison, WI, 1998. Greg Brockman, V icki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie T ang, and W ojciech Zaremba. OpenAI gym. arXiv:1606.01540 [cs.LG] , 2016. Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar , and Marc G. Bellemare. Dopamine: A research frame work for deep reinforcement learning. arXiv:1812.06110 [cs.LG] , 2018. Y ash Chandak, Georgios Theocharous, Blossom Metevier , and Philip S. Thomas. Rein- forcement learning when all actions are not alw ays av ailable. [cs.LG] , 2019. Oli vier Chapelle. Modeling delayed feedback in display advertising. In The 20th ACM SIGKDD International Conference on Knowledge Disco very and Data Mining, (KDD-14) , pages 1097–1105, 2014. Oli vier Chapelle and Lihong Li. An empirical ev aluation of Thompson sampling. In Advances in Neural Information Pr ocessing Systems 24 (NIPS-11) , pages 2249–2257, 2011. Li Chen and Pearl Pu. Critiquing-based recommenders: surve y and emerging trends. User Modeling and User-Adapted Interaction , 22(1):125–150, Apr 2012. ISSN 1573-1391. doi: 10.1007/s11257- 011- 9108- 6. URL https://doi.org/10.1007/s11257- 011- 9108- 6. Minmin Chen, Alex Beutel, Paul Covington, Sag ar Jain, Francois Belletti, and Ed Chi. T op-k of f-policy correction for a REINFORCE recommender system. In 12th ACM International Confer ence on W eb Searc h and Data Mining (WSDM-19) , pages 456–464, Melbourne, Australia, 2018. Franc ¸ ois Chollet et al. Keras. https://github .com/fchollet/keras, 2015. K onstantina Christakopoulou and Arindam Banerjee. Learning to interact with users: A collaborati ve-bandit approach. In Pr oceedings of the 2018 SIAM International Confer ence on Data Mining, SDM 2018, May 3-5, 2018, San Die go Marriott Mission V alley , San Die go, CA, USA. , pages 612–620, 2018. K onstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. T owards con versational recommender systems. In Pr oceedings of the 22Nd ACM SIGKDD International Confer - ence on Knowledge Disco very and Data Mining , KDD ’16, pages 815–824, New Y ork, NY , USA, 2016. A CM. ISBN 978-1-4503-4232-2. Gabriel Dulac-Arnold, Daniel Manko witz, and T odd Hester . Challenges of real-world reinforcement learning. arXiv:1904.12901 [cs.LG] , 2019. 19 Aurelien Gari vier and Oli vier Cappe. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Pr oceeding of the 24th Annual Conference on Learning Theory , pages 359–376, 2011. Jason Gauci, Edoardo Conti, Y itao Liang, Kittipat V irochsiri, Y uchen He, Zachary Kaden, V iv ek Narayanan, and Xiaohui Y e. Horizon: Facebook’ s open source applied reinforce- ment learning platform. arXiv:1811.00260 [cs.LG] , 2018. Marjan Ghazvininejad, Chris Brock ett, Ming-W ei Chang, Bill Dolan, Jianfeng Gao, W en- tau Y ih, and Michel Galley . A kno wledge-grounded neural conv ersation model. In Pr oceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) , pages 5110–5117, Ne w Orleans, 2018. Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly , Sam Fishman, K e W ang, Ekaterina Gonina, Neal W u, Chris Harris, V incent V anhoucke, and Eugene Bre vdo. TF-Agents: A library for reinforcement learning in tensorflow . https: //github .com/tensorflo w/agents, 2018. Accessed 25-June-2019. T uomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George T ucker , Sehoon Ha, Jie T an, V ikash Kumar , Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Serge y Levine. Soft actor-critic algorithms and applications. arXiv:1812.05905 [cs.LG] , 2018. Assaf Hallak, Dotan Di Castro, and Shie Mannor . Contextual mark ov decision processes. arXiv:1502.02259 [stat.ML] , 2015. F . Maxwell Harper and Joseph A. K onstan. The MovieLens datasets: History and conte xt. A CM T ransactions on Interactive Intellig ent Systems , 5(4):19:1–19:19, 2016. Ruining He and Julian McAule y . Fusing similarity models with Markov chains for sparse sequential recommendation. In Pr oceedings of the IEEE International Conference on Data Mining (ICDM-16) , Barcelona, 2016. Bal ´ azs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session- based recommendations with recurrent neural networks. In 4th International Conference on Learning Repr esentations (ICLR-16) , San Juan, Puerto Rico, 2016. Henning Hohnhold, Deirdre O’Brien, and Diane T ang. Focusing on the long-term: It’ s good for users and business. In Pr oceedings of the T wenty-first ACM International Confer ence on Knowledge Discovery and Data Mining (KDD-15) , pages 1849–1858, Sydne y , 2015. Eugene Ie, V ihan Jain, Jing W ang, Sanmit Narvekar , Ritesh Agarwal, Rui W u, Heng-Tze Cheng, T ushar Chandra, and Craig Boutilier . SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. In International Joint Confer ence on Artifical Intelligence (IJCAI) , pages 2592–2599, Macau, 2019. 20 Ray Jiang, Sven Go wal, T imothy A. Mann, and Danilo J. Rezende. Beyond greedy ranking: Slate optimization via List-CV AE. In Pr oceedings of the Seventh International Confer ence on Learning Repr esentations (ICLR-19) , Ne w Orleans, 2019. Thorsten Joachims. Optimizing search engines using clickthrough data. In Pr oceedings of the Eighth A CM SIGKDD International Confer ence on Knowledge Discovery and Data Mining (KDD-02) , pages 133–142, 2002. Joseph A. K onstan, Bradley N. Miller , David Maltz, Jonathan L. Herlocker , Lee R. Gordon, and John Riedl. GroupLens: Applying collaborativ e filtering to Usenet news. Communi- cations of the A CM , 40(3):77–87, 1997. Ne vena Lazic, Craig Boutilier , T yler Lu, Eehern W ong, Binz Roy , MK Ryu, and Greg Imwalle. Data center cooling using model-predicti ve control. In Advances in Neural Information Pr ocessing System 31 , pages 3818–3827, Montreal, 2018. Lihong Li, W ei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized ne ws article recommendation. In Pr oceedings of the 19th International Confer ence on W orld W ide W eb (WWW-10) , pages 661–670, 2010. Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Pr oceedings of the 39th International ACM SIGIR Confer ence on Resear ch and Develop- ment in Information Retrieval (SIGIR-16) , 2016. Long-Ji Lin and T om. M. Mitchell. Memory approaches to reinforcement learning in non-Marko vian domains. T echnical Report CS–92–138, Carnegie Mellon University , Department of Computer Science, May 1992. Qianlong Liu, Baoliang Cui, Zhongyu W ei, Baolin Peng, Haikuan Huang, Hongbo Deng, Jianye Hao, Xuanjing Huang, and Kam-Fai W ong. Building personalized simulator for interacti ve search. In Pr oceedings of the T wenty-eighth International Joint Confer ence on Artificial Intelligence (IJCAI-19) , pages 5109–5115, Macau, 2019. Jordan J. Louviere, David A. Hensher, and Joffre D. Swait. Stated Choice Methods: Analysis and Application . Cambridge Univ ersity Press, Cambridge, 2000. Luke Metz, Julian Ibarz, Na vdeep Jaitly , and James Da vidson. Discrete sequential prediction of continuous actions for deep RL. arXiv:1705.05035 [cs.LG] , 2017. Martin Mladenov , Ofer Meshi, Jayden Ooi, Dale Schuurmans, and Craig Boutilier . Advan- tage amplification in slowly ev olving latent-state en vironments. In International Joint Confer ence on Artifical Intelligence (IJCAI) , pages 3165–3172, Macau, 2019. V olodymyr Mnih, K oray Kavukcuoglu, David Silv er , Andrei A Rusu, Joel V eness, Marc G Bellemare, Alex Gra ves, Martin Riedmiller , Andreas K Fidjeland, Georg Ostrovski, et al. Human-le vel control through deep reinforcement learning. Nature , 518(7540):529–533, 2015. 21 Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Kam-Fai W ong. Deep Dyna-Q: Integrating planning for task-completion dialogue polic y learning. In Pr oceedings of the 56th Annual Meeting of the Association for Computational Linguistics (A CL-18) , pages 2182–2192, Melbourne, 2018. Martin Riedmiller . Neural fitted Q-iteration—first experiences with a data ef ficient neural reinforcement learning method. In Pr oceedings of the 16th Eur opean Confer ence on Machine Learning , pages 317–328, Porto, Portugal, 2005. M. Rodriguez, C. Posse, and E. Zhang. Multiple objective optimization in recommender systems. In Pr oceedings of the sixth ACM confer ence on Recommender systems , pages 11–18. A CM, 2012. David Rohde, Stephen Bonner , Tra vis Dunlop, Fla vian V asile, and Alexandros Karatzoglou. RecoGym: A reinforcement learning en vironment for the problem of product recommen- dation in online advertising. arXiv:1808.00720 [cs.IR] , 2018. Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Information Pr ocessing Systems 20 (NIPS-07) , pages 1257–1264, V ancouver , 2007. Jost Schatzmann, Blaise Thomson, Karl W eilhammer , Hui Y e, and Ste ve Y oung. Agenda- based user simulation for bootstrapping a POMDP dialogue system. In Human Language T echnolo gies 2007: The Confer ence of the North American Chapter of the Association for Computational Linguistics; Companion V olume, Short P apers , pages 149–152, Rochester , Ne w Y ork, April 2007. Association for Computational Linguistics. URL https://www . aclweb .org/anthology/N07- 2038. Guy Shani, Da vid Heckerman, and Ronen I. Brafman. An MDP-based recommender system. J ournal of Machine Learning Resear ch , 6:1265–1295, 2005. Jing-Cheng Shi, Y ang Y u, Qing Da, Shi-Y ong Chen, and An-Xiang Zeng. V irtual-taobao: V irtualizing real-world online retail en vironment for reinforcement learning. In Pr o- ceedings of the Thirty-thir d AAAI Confer ence on Artificial Intelligence (AAAI-19) , pages 4902–4909, Honolulu, 2019. David Silver , Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George V an Den Driessche, Julian Schrittwieser , Ioannis Antonoglou, V eda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Natur e , 529(7587):484–489, 2016. Y ueming Sun and Y i Zhang. Con versational recommender system. [cs.IR] , 2018. Peter Sunehag, Richard Evans, Gabriel Dulac-Arnold, Y ori Zwols, Daniel V isentin, and Ben Coppin. Deep reinforcement learning with attention for slate Markov decision processes with high-dimensional states and actions. arXiv:1512.01124 [cs.AI] , 2015. 22 Y uandong T ian, Qucheng Gong, W enling Shang, Y uxin W u, and C. Lawrence Zitnick. ELF: an extensi v e, lightweight and flexible research platform for real-time strate gy games. In Advances in Neural Information Pr ocessing Systems 30 (NIPS-17) , pages 2659–2669, Long Beach, CA, 2017. Oriol V in yals and Quoc V . Le. A neural con versational model. [cs.CL] , 2015. Christopher J. C. H. W atkins and Peter Dayan. Q-learning. Machine Learning , 8:279–292, 1992. W ei W ei, Quoc Le, Andrew Dai, and Jia Li. AirDialogue: An environment for goal-oriented dialogue research. In Pr oceedings of 2018 Confer ence on Empirical Methods in Natur al Language Pr ocessing (EMNLP-18) , pages 3844–3854, Brussels, 2018. Mark W ilhelm, Ajith Ramanathan, Ale xander Bonomo, Sagar Jain, Ed H. Chi, and Jennifer Gillenwater . Practical div ersified recommendations on Y ouT ube with determinantal point processes. In Pr oceedings of the 27th ACM International Confer ence on Information and Knowledge Manag ement (CIKM18) , pages 2165–2173, T orino, Italy , 2018. Chao-Y uan W u, Amr Ahmed, Alex Beutel, Ale xander J. Smola, and Ho w Jing. Recurrent recommender networks. In Pr oceedings of the T enth ACM International Confer ence on W eb Sear c h and Data Mining (WSDM-17) , pages 495–503, Cambridge, UK, 2017. Amy Zhang, Y uxin W u, and Joelle Pineau. Natural en vironment benchmarks for reinforce- ment learning. arXiv:1811.06032 [cs.LG] , 2018. Xiangyu Zhao, Long Xia, Liang Zhang, Zhuo ye Ding, Dawei Y in, and Jiliang T ang. Deep reinforcement learning for page-wise recommendations. In Pr oceedings of the 12th ACM Confer ence on Recommender Systems (RecSys-18) , pages 95–103, V ancouver , 2018. Xiangyu Zhao, Long Xia, Zhuo ye Ding, Dawei Y in, and Jiliang T ang. T oward simulating en vironments in reinforcement learning based recommendations. [cs.IR] , 2019. Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Y ang Xiang, Nicholas Jing Y uan, Xing Xie, and Zhenhui Li. DRN: A deep reinforcement learning framework for ne ws recommenda- tion. In Pr oceedings of the 2018 W orld W ide W eb Conference (WWW -18) , pages 167–176, 2018. 23

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment