Split Q Learning: Reinforcement Learning with Two-Stream Rewards

Split Q Learning: Reinf or cement Learning with T wo-Str eam Rewards Baihan Lin 1 , Djallel Bouneff ouf 2 and Guillermo Cecchi 2 1 Center for Theoretical Neuroscience, Columbia Uni versity , Ne w Y ork, NY 10027, USA 2 IBM Thomas J. W atson Research Center , Y orkto wn Heights, NY 10598, USA Baihan.Lin@Columbia.edu, { dbounef fouf, gcecchi } @us.ibm.com Abstract Drawing an inspiration from behavioral studies of human decision making, we propose here a general parametric frame work for a reinforce- ment learning problem, which extends the stan- dard Q-learning approach to incorporate a two- stream framework of rew ard processing with bi- ases biologically associated with sev eral neu- rological and psychiatric conditions, including Parkinson’ s and Alzheimer’ s diseases, attention- deﬁcit/hyperacti vity disorder (ADHD), addiction, and chronic pain. For AI community , the devel- opment of agents that react dif ferently to differ- ent types of rewards can enable us to understand a wide spectrum of multi-agent interactions in com- plex real-world socioeconomic systems. More- ov er , from the behavioral modeling perspectiv e, our parametric framework can be vie wed as a ﬁrst step tow ards a unifying computational model captur- ing reward processing abnormalities across multi- ple mental conditions and user preferences in long- term recommendation systems. 1 Introduction In order to better understand and model human decision- making behavior , scientists usually in vestigate reward pro- cessing mechanisms in healthy subjects [ Perry and Kramer, 2015 ] . Howe ver , neurodegenerati ve and psychiatric disor- ders, often associated with re ward processing disruptions, can provide an additional resource for deeper understanding of human decision making mechanisms. Furthermore, from the perspectiv e of ev olutionary psychiatry , various mental disor- ders, including depression, anxiety , ADHD, addiction and ev en schizophrenia can be considered as “extreme points” in a continuous spectrum of behaviors and traits de veloped for v arious purposes during ev olution, and some what less e x- treme versions of those traits can be actually beneﬁcial in spe- ciﬁc environments (e.g., ADHD-like fast-switching attention can be life-sa ving in certain environments, etc.). Thus, mod- eling decision-making biases and traits associated with v ari- ous disorders may actually enrich the e xisting computational decision-making models, leading to potentially more ﬂexible and better-performing algorithms. Herein, we focus on re ward-processing biases associated with sev eral mental disorders, including Parkinson’ s and Alzheimer’ s disease, ADHD, addiction and chronic pain. Our questions are: is it possible to extend standard reinforcement learning algorithms to mimic human beha vior in such disor - ders? Can such generalized approaches outperform standard reinforcement learning algorithms on speciﬁc tasks? W e sho w that both questions can be answered positively . W e b uild upon the Q Learning, a state-of-art approach to RL problem, and extend it to a parametric version which allows to split the rew ard information into positi ve stream and ne gativ e stream with v arious reward-processing biases known to be as- sociated with particular disorders. For example, it was sho wn that (unmedicated) patients with Parkinson’ s disease appear to learn better from negativ e rather than from positive rew ards [ Frank et al. , 2004 ] ; another example is addictiv e behaviors which may be associated with an inability to forget strong stimulus-response associations from the past, i.e. to properly discount past re wards [ Redish et al. , 2007 ] , and so on. Mor e speciﬁcally , we pr opose a parametric model which intr oduces weights on incoming positive and negative re war ds, and on r ewar d histories, e xtending the standar d parameter update rules in Q Learning; tuning the parameter settings allows us to better captur e speciﬁc r ewar d-pr ocessing biases . 2 Proposed A pproach: Split Q Learning W e propose Split Q Learning (SQL), outlined in Algorithm 1, which updates the Q values using four weight parameters: φ 1 and φ 2 are the weights of the previously accumulated positi ve and ne gati ve rewards, respectiv ely , while φ 3 and φ 4 represent the weights on the positiv e and negati ve rewards. In our al- gorithm, we hav e two Q tables that we are using Q + and Q − which respecti vely record the positi ve and neg ativ e feedback. Algorithm 1 Split Q Learning (SQL) 1: F or each episode t do 2: Initialize s 3: Repeat 4: Q ( s, a ) := φ 2 Q + ( s, a ) + φ 4 Q − ( s, a ) 5: action i t = arg max i Q i ( t ) , observe s 0 ∈ S , r ∈ R ( s ) 6: Q + ( s, a ) := φ 1 ˆ Q + ( s, a ) + α t ( r + + γ ˆ V + ( s 0 ) − ˆ Q + ( s, a )) 7: Q − ( s, a ) := φ 3 ˆ Q − ( s, a ) + α t ( r − + γ ˆ V − ( s 0 ) − ˆ Q − ( s, a )) 8: until s is terminal φ 1 φ 2 φ 3 φ 4 AD (addiction) 1 ± 0 . 1 1 ± 0 . 1 0 . 5 ± 0 . 1 1 ± 0 . 1 ADHD 0 . 2 ± 0 . 1 1 ± 0 . 1 0 . 2 ± 0 . 1 1 ± 0 . 1 AZ (Alzheimer’ s) 0 . 1 ± 0 . 1 1 ± 0 . 1 0 . 1 ± 0 . 1 1 ± 0 . 1 CP (chronic pain) 0 . 5 ± 0 . 1 0 . 5 ± 0 . 1 1 ± 0 . 1 1 ± 0 . 1 bvFTD 0 . 5 ± 0 . 1 100 ± 10 0 . 5 ± 0 . 1 1 ± 0 . 1 PD (Parkinson’ s) 0 . 5 ± 0 . 1 1 ± 0 . 1 0 . 5 ± 0 . 1 100 ± 10 M (“moderate”) 0 . 5 ± 0 . 1 1 ± 0 . 1 0 . 5 ± 0 . 1 1 ± 0 . 1 standard SQL 1 1 1 1 T able 1: Algorithm Parameters 3 Reward Processing with Differ ent Biases In this section we describe how speciﬁc constraints on the model parameters in the proposed algorithm can yield differ - ent rew ard processing biases, and introduce several instances of the SQL model, with parameter settings reﬂecting partic- ular biases. The parameter settings are summarized in T able 1 as elaborated in [ Bouneffouf et al. , 2017 ] . Of course, the abov e models should be treated only as ﬁrst approximations of the re ward processing biases in mental disorders, since the actual changes in re ward processing are much more compli- cated, and the parameteric setting must be learned from ac- tual patient data. Herein, we ﬁrst consider those models as speciﬁc variations of our general method, inspired by certain aspects of the corresponding diseases, and focus primarily on the computational aspects of our algorithm, demonstrat- ing that the proposed parametric e xtension of Q Learning can learn better than the baseline Q Learning due to added ﬂexi- bility . In the second step of this research, we utilize in verse reinforcement learning (IRL) [ Abbeel and Ng, 2004 ] to learn the most likely re ward function R E of a human subject as an ex ecuting expert E gi ven collected behavioral trajectory con- sisting of a sequence of state-action pairs. Giv en a weight vector w , one can compute the optimal policy π w for the corresponding reward function b R w , and estimate its feature expectations ˆ µ ( π w ) . IRL compares this ˆ µ ( π w ) with e xpert’ s feature expectations ˆ µ E to learn best ﬁtting weight v ectors w . Instead of a single weight v ector , the IRL algorithm learns a set of possible weight vectors, and they ask the agent designer to pick the most appropriate weight vector among these by inspecting their corresponding policies. In this w ay , we learn the parameters φ 1 , φ 2 , φ 3 , φ 4 for the human subjects. 4 Reward-Scaling in Reinf orcement Lear ning T o demonstrate the computati onal adv antage of our proposed two-stream parametric e xtension of Q Learning can learn bet- ter than the baseline Q Learning, we tested our agents in nine computer games: Pacman, Catcher , FlappyBird, Pix elcopter , Pong, PuckW orld, Snake, W aterW orld, and Monster Kong. In each game, we tested in both stationary and non-stationary en vironments by rescaling the size and frequency of the re- ward signals in two streams. Preliminary results suggest that SQL outperform classical Q Learning in the long term in cer - tain conditions (for example, positiv e-only and normal rew ard en vironments in P acman). Our results also suggests that SQL behav es dif ferently in the transition of rew ard en vironments. T o understand this discrepancy , we further de veloped a v ari- ant of SQL which updates its four bias parameters adaptiv ely with the Gaussian Process Upper Conﬁdence Bound (GP- UCB) algorithm [ Sriniv as et al. , 2009 ] . 5 Conclusion and Future W ork This research proposes a nov el parametric family of algo- rithms for RL problem, extending the classical Q Learning to model a wide range of potential reward processing biases. Our approach draws an inspiration from extensi ve literature on decision-making behavior in neurological and psychiatric disorders stemming from disturbances of the rew ard process- ing system, and demonstrates high ﬂexibility of our multi- parameter model which allo ws to tune the weights on incom- ing two-stream re wards and memories about the prior re ward history . Our preliminary results support multiple prior obser - vations about rew ard processing biases in a range of mental disorders, thus indicating the potential of the proposed model and its future extensions to capture reward-processing aspects across various neurological and psychiatric conditions. The contribution of this research is two-fold: from the AI perspectiv e, we propose a more powerful and adaptiv e ap- proach to RL, outperforming state-of-art QL; from the neu- roscience perspective, this work is the ﬁrst attempt at gen- eral, unifying model of re ward processing and its disruptions across a wide population including both healthy subjects and those with mental disorders, which has a potential to become a useful computational tool for neuroscientists and psychia- trists studying such disorders. Among the directions for fu- ture work, we plan to in vestigate the optimal parameters in a series of computer games ev aluated on different criteria, for example, longest survi val time vs. highest ﬁnal score. In ad- dition, we also plan to explore the multi-agent interactions giv en different reward processing bias. These discoveries can help build more interpretable real-world RL systems. On the neuroscience side, the next steps w ould include further tuning and e xtending the proposed model to better capture observa- tions in modern literature, as well as testing the model on both healthy subjects and patients with speciﬁc mental conditions. References [ Abbeel and Ng, 2004 ] Pieter Abbeel and Andre w Y Ng. Appren- ticeship learning via in verse reinforcement learning. In Pr o- ceedings of the twenty-ﬁrst international conference on Machine learning , page 1. A CM, 2004. [ Bouneffouf et al. , 2017 ] Djallel Bouneffouf, Irina Rish, and Guillermo A Cecchi. Bandit models of human beha vior: Re ward processing in mental disorders. In International Confer ence on Artiﬁcial General Intelligence , pages 237–248. Springer , 2017. [ Frank et al. , 2004 ] Michael J Frank, Lauren C Seeberger , and Ran- dall C O’ reilly . By carrot or by stick: cogniti ve reinforcement learning in parkinsonism. Science , 306(5703):1940–1943, 2004. [ Perry and Kramer , 2015 ] David C Perry and Joel H Kramer . Re- ward processing in neurodegenerati ve disease. Neur ocase , 21(1):120–133, 2015. [ Redish et al. , 2007 ] A Da vid Redish, Steve Jensen, Adam John- son, and Zeb Kurth-Nelson. Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychological re view , 114(3):784, 2007. [ Sriniv as et al. , 2009 ] Niranjan Sriniv as, Andreas Krause, Sham M Kakade, and Matthias Seeger . Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv pr eprint arXiv:0912.3995 , 2009.

Split Q Learning: Reinforcement Learning with Two-Stream Rewards

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment