QMDP-Net: Deep Learning for Planning under Partial Observability

QMDP-Net: Deep Lear ning f or Planning under Partial Obser v ability Peter Karkus 1 , 2 David Hsu 1 , 2 W ee Sun Lee 2 1 NUS Graduate School for Integrati ve Sciences and Engineering 2 School of Computing National Univ ersity of Singapore {karkus, dyhsu, leews}@comp.nus.edu.sg Abstract This paper intr oduces the QMDP-net , a neural network arc hitectur e for planning under partial observability . The QMDP-net combines the str engths of model-fr ee learning and model-based planning. It is a recurr ent policy network, but it r epr esents a policy for a parameterized set of tasks by connecting a model with a planning algorithm that solves the model, thus embedding the solution structure of planning in a network learning arc hitectur e. The QMDP-net is fully differ entiable and allows for end-to-end training. W e train a QMDP- net on differ ent tasks so that it can generalize to new ones in the parameterized task set and “transfer” to other similar tasks be yond the set. In pr eliminary experiments, QMDP-net showed str ong performance on several robotic tasks in simulation. Inter estingly , while QMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm in the experiments, as a r esult of end-to-end learning. 1 Introduction Decision-making under uncertainty is of fundamental importance, but it is computationally hard, especially under partial observ ability [ 24 ]. In a partially observable world, the agent cannot determine the state exactly based on the current observation; to plan optimal actions, it must integrate information ov er the past history of actions and observations. See Fig. 1 for an example. In the model-based approach, we may formulate the problem as a partially observable Markov decision pr ocess (POMDP). Solving POMDPs e xactly is computationally intractable in the worst case [ 24 ]. Approximate POMDP algorithms hav e made dramatic progress on solving large-scale POMDPs [ 17 , 25 , 29 , 32 , 37 ]; howe ver , manually constructing POMDP models or learning them from data remains dif ﬁcult. In the model-free approach, we directly search for an optimal solution within a policy class. If we do not restrict the policy class, the dif ﬁculty is data and computational efﬁcienc y . W e may choose a parameterized policy class. The effecti veness of policy search is then constrained by this a priori choice. Deep neural networks ha ve brought unprecedented success in man y domains [ 16 , 21 , 30 ] and provide a distinct ne w approach to decision-making under uncertainty . The deep Q-network (DQN), which consists of a con v olutional neural network (CNN) together with a fully connected layer , has successfully tackled many Atari games with complex visual input [ 21 ]. Replacing the post- con volutional fully connected layer of DQN by a recurrent LSTM layer allows it to deal with partial observaiblity [ 10 ]. Howe ver , compared with planning, this approach f ails to exploit the underlying sequential nature of decision-making. W e introduce QMDP-net , a neural network architecture for planning under partial observability . QMDP-net combines the strengths of model-free learning and model-based planning. A QMDP-net is a recurrent policy network, but it represents a policy by connecting a POMDP model with an algorithm that solves the model, thus embedding the solution structure of planning in a network 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. (a) (b) (c) (d) Fig. 1: A robot learning to navigate in partially observ able grid worlds. (a) The robot has a map. It has a belief over the initial state, but does not know the exact initial state. (b) Local observations are ambiguous and are insufﬁcient to determine the exact state. (c, d) A policy trained on expert demonstrations in a set of randomly generated en vironments generalizes to a ne w en vironment. It also “transfers” to a much larger real-life en vironment, represented as a LID AR map [12]. learning architecture. Speciﬁcally , our network uses QMDP [ 18 ], a simple, b ut fast approximate POMDP algorithm, though other more sophisticated POMDP algorithms could be used as well. A QMDP-net consists of two main network modules (Fig. 2). One represents a Bayesian ﬁlter, which integrates the history of an agent’ s actions and observations into a belief , i.e. a probabilistic estimate of the agent’ s state. The other represents the QMDP algorithm, which chooses the action gi ven the current belief. Both modules are differentiable, allowing the entire netw ork to be trained end-to-end. W e train a QMDP-net on expert demonstrations in a set of randomly generated en vironments. The trained policy generalizes to ne w environments and also “transfers” to more comple x en vironments (Fig. 1c–d). Preliminary experiments show that QMDP-net outperformed state-of-the-art netw ork architectures on sev eral robotic tasks in simulation. It successfully solved difﬁcult POMDPs that require reasoning ov er many time steps, such as the well-known Hall way2 domain [ 18 ]. Interestingly , while QMDP-net encodes the QMDP algorithm, it sometimes outperformed the QMDP algorithm in our experiments, as a result of end-to-end learning. 2 Background 2.1 Planning under Uncertainty A POMDP is formally deﬁned as a tuple ( S, A, O , T , Z, R ) , where S , A and O are the state, action, and observation space, respecti v ely . The state-transition function T ( s, a, s 0 ) = P ( s 0 | s, a ) deﬁnes the probability of the agent being in state s 0 after taking action a in state s . The observ ation function Z ( s, a, o ) = p ( o | s, a ) deﬁnes the probability of receiving observation o after taking action a in state s . The reward function R ( s, a ) deﬁnes the immediate re ward for taking action a in state s . In a partially observable world, the agent does not know its e xact state. It maintains a belief , which is a probability distribution o ver S . The agent starts with an initial belief b 0 and updates the belief b t at each time step t with a Bayesian ﬁlter: b t ( s 0 ) = τ ( b t − 1 , a t , o t ) = η O ( s 0 , a t , o t ) P s ∈ S T ( s, a t , s 0 ) b t − 1 ( s ) , (1) where η is a normalizing constant. The belief b t recursiv ely inte grates information from the entir e past history ( a 1 , o 1 , a 2 , o 2 , . . . , a t , o t ) for decision making. POMDP planning seeks a policy π that maximizes the value , i.e., the expected total discounted re ward: V π ( b 0 ) = E  P ∞ t =0 γ t R ( s t , a t +1 )   b 0 , π  , (2) where s t is the state at time t , a t +1 = π ( b t ) is the action that the policy π chooses at time t , and γ ∈ (0 , 1) is a discount factor . 2.2 Related W ork T o learn policies for decision making in partially observ able domains, one approach is to learn models [ 6 , 19 , 26 ] and solve the models through planning. An alternative is to learn policies directly [ 2 , 5 ]. Model learning is usually not end-to-end. While policy learning can be end-to-end, it does not e xploit model information for ef fecti ve generalization. Our proposed approach combines model-based and 2 model-free learning by embedding a model and a planning algorithm in a recurrent neural netw ork (RNN) that represents a policy and then training the network end-to-end. RNNs hav e been used earlier for learning in partially observ able domains [ 4 , 10 , 11 ]. In particular, Hausknecht and Stone extended DQN [ 21 ], a con volutional neural network (CNN), by replacing its post-con volutional fully connected layer with a recurrent LSTM layer [ 10 ]. Similarly , Mirowski et al. [20] considered learning to navigate in partially observable 3-D mazes. The learned policy generalizes over dif ferent goals, b ut in a ﬁxed environment. Instead of using the generic LSTM, our approach embeds algorithmic structure speciﬁc to sequential decision making in the network architecture and aims to learn a policy that generalizes to ne w environments. The idea of embedding speciﬁc computation structures in the neural network architecture has been gaining attention recently . T amar et al. implemented v alue iteration in a neural network, called V alue Iteration Network (VIN), to solve Markov decision processes (MDPs) in fully observ able domains, where an agent kno ws its exact state and does not require ﬁltering [ 34 ]. Okada et al. addressed a related problem of path inte gral optimal control, which allows for continuous states and actions [ 23 ]. Neither addresses the issue of partial observability , which drastically increases the computational complexity of decision making [ 24 ]. Haarnoja et al. [9] and Jonschkowski and Brock [15] dev eloped end-to-end trainable Bayesian ﬁlters for probabilistic state estimation. Silver et al. introduced Predictron for value estimation in Markov re ward processes [ 31 ]. They do not deal with decision making or planning. Both Shankar et al. [28] and Gupta et al. [8] addressed planning under partial observ ability . The former focuses on learning a model rather than a policy . The learned model is trained on a ﬁxed en vironment and does not generalize to ne w ones. The latter proposes a network learning approach to robot na vigation in an unkno wn en vironment, with a focus on mapping. Its network architecture contains a hierarchical extension of VIN for planning and thus does not deal with partial observ ability during planning. The QMDP-net extends the prior work on network architectures for MDP planning and for Bayesian ﬁltering. It imposes the POMDP model and computation structure priors on the entire network architecture for planning under partial observability . 3 Overview W e want to learn a policy that enables an agent to act effecti v ely in a diverse set of partially observable stochastic en vironments. Consider , for example, the robot navigation domain in Fig. 1. The en vironments may correspond to different buildings. The robot agent does not observe its o wn location directly , but estimates it based on noisy readings from a laser range ﬁnder . It has access to building maps, b ut does not hav e models of its own dynamics and sensors. While the buildings may differ signiﬁcantly in their layouts, the underlying reasoning required for effecti ve na vigation is similar in all b uildings. After training the robot in a fe w buildings, we w ant to place the robot in a new b uilding and have it na vigate ef fecti v ely to a speciﬁed goal. Formally , the agent learns a polic y for a parameterized set of tasks in partially observ able stochastic en vironments: W Θ = { W ( θ ) | θ ∈ Θ } , where Θ is the set of all parameter values. The parameter v alue θ captures a wide variety of task characteristics that v ary within the set, including en vironments, goals, and agents. In our robot navigation example, θ encodes a map of the en vironment, a goal, and a belief ov er the robot’ s initial state. W e assume that all tasks in W Θ share the same state space, action space, and observation space. The agent does not have prior models of its own dynamics, sensors, or task objectiv es. After training on tasks for some subset of values in Θ , the agent learns a policy that solv es W ( θ ) for any gi ven θ ∈ Θ . A k ey issue is a general representation of a policy for W Θ , without knowing the speciﬁcs of W Θ or its parametrization. W e introduce the QMDP-net, a recurrent policy network. A QMDP-net represents a policy by connecting a parameterized POMDP model with an approximate POMDP algorithm and embedding both in a single, dif ferentiable neural network. Embedding the model allows the polic y to generalize over W Θ effecti vely . Embedding the algorithm allo ws us to train the entire network end-to-end and learn a model that compensates for the limitations of the approximate algorithm. Let M ( θ ) = ( S, A, O , f T ( ·| θ ) , f Z ( ·| θ ) , f R ( ·| θ )) be the embedded POMDP model, where S, A and O are the shared state space, action space, observation space designed manually for all tasks in W Θ and f T ( ·|· ) , f Z ( ·|· ) , f R ( ·|· ) are the state-transition, observation, and reward functions to be learned from data. It may appear that a perfect answer to our learning problem would have 3 (a) Policy (b) QMDP planner Bayesian filter (c) QMDP planner QMDP planner QMDP planner Bayesian filter Bayesian filter Bayesian filter Fig. 2: QMDP-net architecture. (a) A policy maps a history of actions and observ ations to a new action. (b) A QMDP-net is an RNN that imposes structure priors for sequential decision making under partial observability . It embeds a Bayesian ﬁlter and the QMDP algorithm in the network. The hidden state of the RNN encodes the belief for POMDP planning. (c) A QMDP-net unfolded in time. f T ( ·| θ ) , f Z ( ·| θ ) , and f R ( ·| θ ) represent the “true” underlying models of dynamics, observ ation, and rew ard for the task W ( θ ) . This is true only if the embedded POMDP algorithm is exact, but not true in general. The agent may learn an alternativ e model to mitigate an approximate algorithm’ s limitations and obtain an overall better policy . In this sense, while QMDP-net embeds a POMDP model in the network architecture, it aims to learn a good policy rather than a “correct” model. A QMDP-net consists of two modules (Fig. 2). One encodes a Bayesian ﬁlter , which performs state estimation by inte grating the past history of agent actions and observ ations into a belief. The other encodes QMDP , a simple, but fast approximate POMDP planner [ 18 ]. QMDP chooses the agent’ s actions by solving the corresponding fully observ able Marko v decision process (MDP) and performing one-step look-ahead search on the MDP values weighted by the belief. W e ev aluate the proposed network architecture in an imitation learning setting. W e train on a set of expert trajectories with randomly chosen task parameter v alues in Θ and test with new parameter values. An expert trajectory consist of a sequence of demonstrated actions and observations ( a 1 , o 1 , a 2 , o 2 , . . . ) for some θ ∈ Θ . The agent does not access the ground-truth states or beliefs along the trajectory during the training. W e deﬁne loss as the cross entrop y between predicted and demonstrated action sequences and use RMSProp [ 35 ] for training. See Appendix C.7 for details. Our implementation in T ensorﬂo w [1] is av ailable online at http://github.com/AdaCompNUS/qmdp-net. 4 QMDP-Net W e assume that all tasks in a parameterized set W Θ share the same underlying state space S , action space A , and observation space O . W e want to learn a QMDP-net policy for W Θ , conditioned on the parameters θ ∈ Θ . A QMDP-net is a recurrent policy network. The inputs to a QMDP-net are the action a t ∈ A and the observ ation o t ∈ O at time step t , as well as the task parameter θ ∈ Θ . The output is the action a t +1 for time step t + 1 . A QMDP-net encodes a parameterized POMDP model M ( θ ) = ( S, A, O , T = f T ( ·| θ ) , Z = f Z ( ·| θ ) , R = f R ( ·| θ )) and the QMDP algorithm, which selects actions by solving the model approxi- mately . W e choose S , A , and O of M ( θ ) manually , based on prior kno wledge on W Θ , speciﬁcally , prior knowledge on S , A , and O . In general, S 6 = S , A 6 = A , and O 6 = O . The model states, actions, and observ ations may be abstractions of their real-world counterparts in the task. In our robot navigation e xample (Fig. 1), while the robot moves in a continuous space, we choose S to be a grid of ﬁnite size. W e can do the same for A and O , in order to reduce representational and computational complexity . The transition function T , observ ation function Z , and re ward function R of M ( θ ) are conditioned on θ , and are learned from data through end-to-end training. In this work, we assume that T is the same for all tasks in W Θ to simplify the network architecture. In other words, T does not depend on θ . End-to-end training is feasible, because a QMDP-net encodes both a model and the associated algorithm in a single, fully differentiable neural network. The main idea for embedding the algorithm in a neural network is to represent linear operations, such as matrix multiplication and summation, by con volutional layers and represent maximum operations by max-pooling layers. Belo w we provide some details on the QMDP-net’ s architecture, which consists of two modules, a ﬁlter and a planner . 4 (a) Bayesian ﬁlter module (b) QMDP planner module Fig. 3: A QMDP-net consists of two modules. (a) The Bayesian ﬁlter module incorporates the current action a t and observation o t into the belief. (b) The QMDP planner module selects the action according to the current belief b t . Filter module. The ﬁlter module (Fig. 3a) implements a Bayesian ﬁlter . It maps from a belief, action, and observ ation to a next belief, b t +1 = f ( b t | a t , o t ) . The belief is updated in two steps. The ﬁrst accounts for actions, the second for observations: b 0 t ( s ) = P s 0 ∈ S T ( s, a t , s 0 ) b t ( s 0 ) , (3) b t +1 ( s ) = η Z ( s, o t ) b 0 t ( s ) , (4) where o t ∈ O is the observation recei v ed after taking action a t ∈ A and η is a normalization factor . W e implement the Bayesian ﬁlter by transforming Eq. (3) and Eq. (4) to layers of a neural network. For ease of discussion consider our N × N grid navigation task (Fig. 1a–c). The agent does not know its own state and only observes neighboring cells. It has access to the task parameter θ that encodes the obstacles, goal, and a belief over initial states. Given the task, we choose M ( θ ) to hav e a N × N state space. The belief, b t ( s ) , is now an N × N tensor . Eq. (3) is implemented as a con v olutional layer with | A | con v olutional ﬁlters. W e denote the con volutional layer by f T . The kernel weights of f T encode the transition function T in M ( θ ) . The output of the con volutional layer , b 0 t ( s, a ) , is a N × N ×| A | tensor . b 0 t ( s, a ) encodes the updated belief after taking each of the actions, a ∈ A . W e need to select the belief corresponding to the last action taken by the agent, a t . W e can directly index b 0 t ( s, a ) by a t if A = A . In general A 6 = A , so we cannot use simple inde xing. Instead, we will use “soft indexing”. First we encode actions in A to actions in A through a learned function f A . f A maps from a t to an indexing vector w a t , a distribution ov er actions in A . W e then weight b 0 t ( s, a ) by w a t along the appropriate dimension, i.e. b 0 t ( s ) = P a ∈ A b 0 t ( s, a ) w a t . (5) Eq. (4) incorporates observations through an observation model Z ( s, o ) . Now Z ( s, o ) is a N × N ×| O | tensor that represents the probability of recei ving observation o ∈ O in state s ∈ S . In our grid navigation task observ ations depend on the obstacle locations. W e condition Z on the task parameter , Z ( s, o ) = f Z ( s, o | θ ) for θ ∈ Θ . The function f Z is a neural network, mapping from θ to Z ( s, o ) . In this paper f Z is a CNN. Z ( s, o ) encodes observ ation probabilities for each of the observations, o ∈ O . W e need the ob- servation probabilities for the last observ ation o t . In general O 6 = O and we cannot index Z ( s, o ) directly . Instead, we will use soft indexing again. W e encode observations in O to observations in O through f O . f O is a function mapping from o t to an indexing vector , w o t , a distrib ution ov er O . W e then weight Z ( s, o ) by w o t , i.e. Z ( s ) = P o ∈ O Z ( s, o ) w o t . (6) Finally , we obtain the updated belief, b t +1 ( s ) , by multiplying b 0 t ( s ) and Z ( s ) element-wise, and normalizing ov er states. In our setting the initial belief for the task W ( θ ) is encoded in θ . W e initialize the belief in QMDP-net through an additional encoding function, b 0 = f B ( θ ) . 5 Planner module. The QMDP planner (Fig. 3b) performs value iteration at its core. Q values are computed by iterativ ely applying Bellman updates, Q k +1 ( s, a ) = R ( s, a ) + γ P s 0 ∈ S T ( s, a, s 0 ) V k ( s 0 ) , (7) V k ( s ) = max a Q k ( s, a ) . (8) Actions are then selected by weighting the Q v alues with the belief. W e can implement v alue iteration using con volutional and max pooling layers [ 28 , 34 ]. In our grid navigation task Q ( s, a ) is a N × N ×| A | tensor . Eq. (8) is e xpressed by a max pooling layer , where Q k ( s, a ) is the input and V k ( s ) is the output. Eq. (7) is a N × N con volution with | A | con volutional ﬁlters, follo wed by an addition operation with R ( s, a ) , the rew ard tensor . W e denote the conv olutional layer by f 0 T . The kernel weights of f 0 T encode the transition function T , similarly to f T in the ﬁlter . Rew ards for a na vigation task depend on the goal and obstacles. W e condition re wards on the task parameter , R ( s, a ) = f R ( s, a | θ ) . f R maps from θ to R ( s, a ) . In this paper f R is a CNN. W e implement K iterations of Bellman updates by stacking the layers representing Eq. (7) and Eq. (8) K times with tied weights. After K iterations we get Q K ( s, a ) , the approximate Q values for each state-action pair . W e weight the Q values by the belief to obtain action v alues, q ( a ) = P s ∈ S Q K ( s, a ) b t ( s ) . (9) Finally , we choose the output action through a lo w-le vel polic y function, f π , mapping from q ( a ) to the action output, a t +1 . QMDP-net naturally extends to higher dimensional discrete state spaces (e.g. our maze navigation task) where n -dimensional conv olutions can be used [ 14 ]. While M ( θ ) is restricted to a discrete space, we can handle continuous tasks W Θ by simultaneously learning a discrete M ( θ ) for planning, and f A , f O , f B , f π to map between states, actions and observations in W Θ and M ( θ ) . 5 Experiments The main objectiv e of the experiments is to understand the beneﬁts of structure priors on learning neural-network policies. W e create se veral alternativ e network architectures by gradually relaxing the structure priors and e valuate the architectures on simulated robot na vigation and manipulation tasks. While these tasks are simpler than, for e xample, Atari games, in terms of visual perception, they are in fact very challenging, because of the sophisticated long-term reasoning required to handle partial observability and distant future rewards. Since the exact state of the robot is unknown, a successful policy must reason ov er many steps to gather information and impro ve state estimation through partial and noisy observ ations. It also must reason about the trade-of f between the cost of information gathering and the re ward in the distance future. 5.1 Experimental Setup W e compare the QMDP-net with a number of related alternati ve architectures. T wo are QMDP-net variants. Untied QMDP-net relaxes the constraints on the planning module by untying the weights representing the state-transition function ov er the dif ferent CNN layers. LSTM QMDP-net replaces the ﬁlter module with a generic LSTM module. The other two architectures do not embed POMDP structure priors at all. CNN+LSTM is a state-of-the-art deep CNN connected to an LSTM. It is similar to the DRQN architecture proposed for reinforcement learning under partially observ ability [ 10 ]. RNN is a basic recurrent neural network with a single fully-connected hidden layer . RNN contains no structure speciﬁc to planning under partial observability . Each experimental domain contains a parameterized set of tasks W Θ . The parameters θ encode an en vironment, a goal, and a belief over the robot’ s initial state. T o train a polic y for W Θ , we generate random en vironments, goals, and initial beliefs. W e construct ground-truth POMDP models for the generated data and apply the QMDP algorithm. If the QMDP algorithm successfully reaches the goal, we then retain the resulting sequence of action and observations ( a 1 , o 1 , a 2 , o 2 , . . . ) as an e xpert trajectory , together with the corresponding en vironment, goal, and initial belief. It is important to note that the ground-truth POMDPs are used only for generating e xpert trajectories and not for learning the QMDP-net. 6 For fair comparison, we train all networks using the same set of expert trajectories in each domain. W e perform basic search over training parameters, the number of layers, and the number of hidden units for each network architecture. Belo w we brieﬂy describe the experimental domains. See Appendix C for implementation details. Grid-world na vigation. A robot na vigates in an unkno wn building gi ven a ﬂoor map and a goal. The robot is uncertain of its o wn location. It is equipped with a LID AR that detects obstacles in its direct neighborhood. The world is uncertain: the robot may fail to ex ecute desired actions, possibly because of wheel slippage, and the LID AR may produce false readings. W e implemented a simpliﬁed version of this task in a discrete n × n grid world (Fig. 1c). The task parameter θ is represented as an n × n image with three channels. The ﬁrst channel encodes the obstacles in the en vironment, the second channel encodes the goal, and the last channel encodes the belief over the robot’ s initial state. The robot’ s state represents its position in the grid. It has ﬁv e actions: moving in each of the four canonical directions or staying put. The LID AR observ ations are compressed into four binary values corresponding to obstacles in the four neighboring cells. W e consider both a deterministic and a stochastic variant of the domain. The stochastic variant adds action and observation uncertainties. The robot fails to execute the speciﬁed mov e action and stays in place with probability 0 . 2 . The observations are faulty with probability 0 . 1 independently in each direction. W e trained a policy using expert trajectories from 10 , 000 random en vironments, 5 trajectories from each en vironment. W e then tested on a separate set of 500 random en vironments. Fig. 4: Highly ambiguous observations in a maze. The four observations (in red) are the same, despite that the robot states are all different. Maze navigati on. A differential-dri ve robot navigates in a maze with the help of a map, b ut it does not know its pose (Fig. 1d). This domain is similar to the grid-world navigation, but it is signiﬁcant more challenging. The robot’ s state contains both its position and orientation. The robot cannot mov e freely because of kinematic con- straints. It has four actions: mov e forward, turn left, turn right and stay put. The observations are relativ e to the robot’ s current orientation, and the increased ambiguity makes it more difﬁcult to localize the robot, especially when the initial state is highly uncertain. Finally , suc- cessful trajectories in mazes are typically much longer than those in randomly-generated grid worlds. Again we trained on expert trajectories in 10 , 000 randomly generated mazes and tested them in 500 ne w ones. (a) (b) Fig. 5: Object grasping using touch sens- ing. (a) An example [ 3 ]. (b) Simpliﬁed 2-D object grasping. Objects from the training set (top) and the test set (bottom). 2-D object grasping. A robot gripper picks up novel objects from a table using a tw o-ﬁnger hand with noisy touch sensors at the ﬁnger tips. The gripper uses the ﬁngers to perform compliant motions while maintaining contact with the object or to grasp the object. It knows the shape of the object to be grasped, maybe from an object database. Howe ver , it does not know its own pose relati ve to the object and relies on the touch sensors to localize itself. W e implemented a simpliﬁed 2-D v ariant of this task, modeled as a POMDP [ 13 ]. The task parameter θ is an image with three channels encoding the object shape, the grasp point, and a belief over the gripper’ s initial pose. The gripper has four actions, each moving in a canonical direction unless it touches the object or the en vironment boundary . Each ﬁnger has 3 binary touch sensors at the tip, resulting in 64 distinct observations. W e trained on expert demonstration on 20 different objects with 500 randomly sampled poses for each object. W e then tested on 10 previously unseen objects in random poses. 5.2 Choosing QMDP-Net Components for a T ask Giv en a new task W Θ , we need to choose an appropriate neural network representation for M ( θ ) . More speciﬁcally , we need to choose S, A and O , and a representation for the functions f R , f T , f 0 T , f Z , f O , f A , f B , f π . This provides an opportunity to incorporate domain knowledge in a principled way . For example, if W Θ has a local and spatially inv ariant connectivity structure, we can choose con volutions with small kernels to represent f T , f R and f Z . 7 In our experiments we use S = N × N for N × N grid navigation, and S = N × N × 4 for N × N maze navigation where the robot has 4 possible orientations. W e use | A | = | A | and | O | = | O | for all tasks except for the object grasping task, where | O | = 64 and | O | = 16 . W e represent f T , f R and f Z by CNN components with 3 × 3 and 5 × 5 kernels depending on the task. W e enforce that f T and f Z are proper probability distrib utions by using softmax and sigmoid activ ations on the con volutional kernels, respectively . Finally , f O is a small fully connected component, f A is a one-hot encoding function, f π is a single softmax layer , and f B is the identity function. W e can adjust the amount of planning in a QMDP-net by setting K . A large K allows propagating information to more distant states without af fecting the number of parameters to learn. Howe ver , it results in deeper networks that are computationally e xpensi ve to e valuate and more dif ﬁcult to train. W e used K = 20 . . . 116 depending on the problem size. W e were able to transfer policies to larger en vironments by increasing K up to 450 when ex ecuting the polic y . In our experiments the representation of the task parameter θ is isomorphic to the chosen state space S . While the architecture is not restricted to this setting, we rely on it to represent f T , f Z , f R by con volutions with small kernels. Experiments with a more general class of problems is an interesting direction for future work. 5.3 Results and Discussion The main results are reported in T able 1. Some additional results are reported in Appendix A. For each domain, we report the task success rate and the average number of time steps for task completion. Comparing the completion time is meaningful only when the success rates are similar . QMDP-net successfully learns policies that generalize to new envir onments. When ev aluated on new en vironments, the QMDP-net has higher success rate and faster completion time than the alternativ es in nearly all domains. T o understand better the performance dif ference, we speciﬁcally compared the architectures in a ﬁxed en vironment for na vigation. Here only the initial state and the goal v ary across the task instances, while the en vironment remains the same. See the results in the last ro w of T able 1. The QMDP-net and the alternati ves ha v e comparable performance. Even RNN performs very well. Why? In a ﬁxed en vironment, a network may learn the features of an optimal policy directly , e.g., going straight to wards the goal. In contrast, the QMDP-net learns a model for planning , i.e., generating a near-optimal polic y for a gi ven arbitrary en vironment. POMDP structure priors improve the perf ormance of learning complex policies. Moving across T able 1 from left to right, we gradually relax the POMDP structure priors on the network architecture. As the structure priors weaken, so does the overall performance. Ho we ver , strong priors sometimes ov er-constrain the network and result in de graded performance. For example, we found that tying the weights of f T in the ﬁlter and f 0 T in the planner may lead to worse policies. While both f T and f 0 T represent the same underlying transition dynamics, using dif ferent weights allo ws each to choose its o wn approximation and thus greater ﬂe xibility . W e shed some light on this issue and visualize the learned POMDP model in Appendix B. QMDP-net learns “incorrect”, but useful models. Planning under partial observability is in- tractable in general, and we must rely on approximation algorithms. A QMDP-net encodes both a POMDP model and QMDP , an approximate POMDP algorithm that solves the model. W e then train the network end-to-end. This provides the opportunity to learn an “incorrect”, but useful model that compensates the limitation of the approximation algorithm, in a way similar to rew ard shaping in reinforcement learning [ 22 ]. Indeed, our results sho w that the QMDP-net achie v es higher success rate than QMDP in nearly all tasks. In particular , QMDP-net performs well on the well-known Hallway2 domain, which is designed to expose the weakness of QMDP resulting from its myopic planning horizon. The planning algorithm is the same for both the QMDP-net and QMDP , but the QMDP-net learns a more ef fecti ve model from expert demonstrations. This is true even though QMDP generates the e xpert data for training. W e note that the expert data contain only successful QMDP demonstrations. When both successful and unsuccessful QMDP demonstrations were used for training, the QMDP-net did not perform better than QMDP , as one would e xpect. QMDP-net policies learned in small envir onments transfer directly to larger envir onments. Learning a policy for large en vironments from scratch is often dif ﬁcult. A more scalable approach 8 T able 1: Performance comparison of QMDP-net and alternative architectures for recurrent policy networks. SR is the success rate in percentage. T ime is the av erage number of time steps for task completion. D- n and S- n denote deterministic and stochastic variants of a domain with en vironment size n × n . QMDP QMDP-net Untied LSTM CNN RNN QMDP-net QMDP-net +LSTM Domain SR T ime SR Time SR T ime SR T ime SR Time SR T ime Grid D-10 99 . 8 8 . 8 99 . 6 8 . 2 98 . 6 8 . 3 84 . 4 12 . 8 90 . 0 13 . 4 87 . 8 13 . 4 Grid D-18 99 . 0 15 . 5 99 . 0 14 . 6 98 . 8 14 . 8 43 . 8 27 . 9 57 . 8 33 . 7 35 . 8 24 . 5 Grid D-30 97 . 6 24 . 6 98 . 6 25 . 0 98 . 8 23 . 9 22 . 2 51 . 1 19 . 4 45 . 2 16 . 4 39 . 3 Grid S-18 98 . 1 23 . 9 98 . 8 23 . 9 95 . 9 24 . 0 23 . 8 55 . 6 41 . 4 65 . 9 34 . 0 64 . 1 Maze D-29 63 . 2 54 . 1 98 . 0 56 . 5 95 . 4 62 . 5 9 . 8 57 . 2 9 . 2 41 . 4 9 . 8 47 . 0 Maze S-19 63 . 1 50 . 5 93 . 9 60 . 4 98 . 7 57 . 1 18 . 9 79 . 0 19 . 2 80 . 8 19 . 6 82 . 1 Hallway2 37 . 3 28 . 2 82 . 9 64 . 4 69 . 6 104 . 4 82 . 8 89 . 7 77 . 8 99 . 5 68 . 0 108 . 8 Grasp 98 . 3 14 . 6 99 . 6 18 . 2 98 . 9 20 . 4 91 . 4 26 . 4 92 . 8 22 . 1 94 . 1 25 . 7 Intel Lab 90 . 2 85 . 4 94 . 4 107 . 7 20 . 0 55 . 3 - - - Freibur g 88 . 4 66 . 9 93 . 2 81 . 1 37 . 4 51 . 7 - - - Fixed grid 98 . 8 17 . 4 98 . 6 17 . 6 99 . 8 17 . 0 97 . 0 19 . 7 98 . 4 19 . 9 98 . 0 19 . 8 would be to learn a polic y in small en vironments and transfer it to large en vironments by repeating the reasoning process. T o transfer a learned QMDP-net policy , we simply expand its planning module by adding more recurrent layers. Speciﬁcally , we trained a policy in randomly generated 30 × 30 grid worlds with K = 90 . W e then set K = 450 and applied the learned policy to se veral real-life en vironments, including Intel Lab ( 100 × 101 ) and Freibur g ( 139 × 57 ), using their LID AR maps (Fig. 1c) from the Robotics Data Set Repository [ 12 ]. See the results for these two en vironments in T able 1. Additional results with different K settings and other buildings are av ailable in Appendix A. 6 Conclusion A QMDP-net is a deep recurrent policy netw ork that embeds POMDP structure priors for planning under partial observability . While generic neural networks learn a direct mapping from inputs to outputs, QMDP-net learns ho w to model and solve a planning task. The network is fully dif ferentiable and allows for end-to-end training. Experiments on se veral simulated robotic tasks sho w that learned QMDP-net policies successfully generalize to ne w en vironments and transfer to larger en vironments as well. The POMDP structure priors and end-to-end training substantially improve the performance of learned policies. Interestingly , while a QMDP-net encodes the QMDP algorithm for planning, learned QMDP-net policies sometimes outperform QMDP . There are many exciting directions for future exploration. First, a major limitation of our current approach is the state space representation. The v alue iteration algorithm used in QMDP iterates through the entire state space and is well known to suffer from the “curse of dimensionality”. T o alleviate this difﬁculty , the QMDP-net, through end-to-end training, may learn a much smaller abstract state space representation for planning. One may also incorporate hierarchical planning [ 8 ]. Second, QMDP makes strong approximations in order to reduce computational complexity . W e want to explore the possibility of embedding more sophisticated POMDP algorithms in the netw ork architecture. While these algorithms provide stronger planning performance, their algorithmic sophistication increases the difﬁculty of learning. Finally , we hav e so far restricted the work to imitation learning. It would be exciting to extend it to reinforcement learning. Based on earlier work [28, 34], this is indeed promising. Acknowledgments W e thank Leslie Kaelbling and T omás Lozano-Pérez for insightful discussions that helped to improv e our understanding of the problem. The work is supported in part by Singapore Ministry of Education AcRF grant MOE2016-T2-2-068 and National University of Singapore AcRF grant R-252-000-587- 112. 9 References [1] M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. T ensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/ . [2] J. A. Bagnell, S. Kakade, A. Y . Ng, and J. G. Schneider . Policy search by dynamic programming. In Advances in Neural Information Pr ocessing Systems , pages 831–838, 2003. [3] H. Bai, D. Hsu, W . S. Lee, and V . A. Ngo. Monte carlo v alue iteration for continuous-state POMDPs. In Algorithmic F oundations of Robotics IX , pages 175–191, 2010. [4] B. Bakker , V . Zhumatiy , G. Gruener , and J. Schmidhuber . A robot that reinforcement-learns to identify and memorize important previous observ ations. In International Conference on Intellig ent Robots and Systems , pages 430–435, 2003. [5] J. Baxter and P . L. Bartlett. Inﬁnite-horizon policy-gradient estimation. J ournal of Artiﬁcial Intelligence Resear ch , 15:319–350, 2001. [6] B. Boots, S. M. Siddiqi, and G. J. Gordon. Closing the learning-planning loop with predictiv e state representations. The International J ournal of Robotics Resear ch , 30(7):954–966, 2011. [7] K. Cho, B. V an Merriënboer , C. Gulcehre, D. Bahdanau, F . Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 , 2014. [8] S. Gupta, J. Davidson, S. Levine, R. Sukthankar , and J. Malik. Cognitiv e mapping and planning for visual navigation. arXiv preprint , 2017. [9] T . Haarnoja, A. Ajay , S. Levine, and P . Abbeel. Backprop kf: Learning discriminativ e deterministic state estimators. In Advances in Neural Information Pr ocessing Systems , pages 4376–4384, 2016. [10] M. J. Hausknecht and P . Stone. Deep recurrent Q-learning for partially observable MDPs. arXiv pr eprint , 2015. URL . [11] S. Hochreiter and J. Schmidhuber . Long short-term memory . Neural Computation , 9(8):1735–1780, 1997. [12] A. How ard and N. Roy . The robotics data set repository (radish), 2003. URL http://radish. sourceforge.net/ . [13] K. Hsiao, L. P . Kaelbling, and T . Lozano-Pérez. Grasping POMDPs. In International Confer ence on Robotics and Automation , pages 4685–4692, 2007. [14] S. Ji, W . Xu, M. Y ang, and K. Y u. 3D conv olutional neural networks for human action recognition. IEEE T ransactions on P attern Analysis and Machine Intellig ence , 35(1):221–231, 2013. [15] R. Jonschko wski and O. Brock. End-to-end learnable histogram ﬁlters. In W orkshop on Deep Learning for Action and Interaction at NIPS , 2016. URL http://www.robotics.tu- berlin.de/fileadmin/ fg170/Publikationen_pdf/Jonschkowski- 16- NIPS- WS.pdf . [16] A. Krizhe vsky , I. Sutske ver , and G. E. Hinton. Imagenet classiﬁcation with deep conv olutional neural networks. In Advances in Neur al Information Pr ocessing Systems , pages 1097–1105, 2012. [17] H. Kurnia wati, D. Hsu, and W . S. Lee. Sarsop: Efﬁcient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems , volume 2008, 2008. [18] M. L. Littman, A. R. Cassandra, and L. P . Kaelbling. Learning policies for partially observ able environ- ments: Scaling up. In International Conference on Mac hine Learning , pages 362–370, 1995. [19] M. L. Littman, R. S. Sutton, and S. Singh. Predictiv e representations of state. In Advances in Neural Information Pr ocessing Systems , pages 1555–1562, 2002. [20] P . Mirowski, R. Pascanu, F . V iola, H. Soyer , A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. Learning to navig ate in complex en vironments. arXiv preprint , 2016. [21] V . Mnih, K. Ka vukcuoglu, D. Silver , A. A. Rusu, J. V eness, M. G. Bellemare, A. Grav es, M. Riedmiller , A. K. Fidjeland, G. Ostro vski, et al. Human-level control through deep reinforcement learning. Natur e , 518(7540):529–533, 2015. 10 [22] A. Y . Ng, D. Harada, and S. Russell. Policy in variance under rew ard transformations: Theory and application to rew ard shaping. In International Confer ence on Machine Learning , pages 278–287, 1999. [23] M. Okada, L. Rigazio, and T . Aoshima. Path integral networks: End-to-end differentiable optimal control. arXiv pr eprint arXiv:1706.09597 , 2017. [24] C. H. Papadimitriou and J. N . Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Resear ch , 12(3):441–450, 1987. [25] J. Pineau, G. J. Gordon, and S. Thrun. Applying metric-trees to belief-point POMDPs. In Advances in Neural Information Pr ocessing Systems , page None, 2003. [26] G. Shani, R. I. Brafman, and S. E. Shimony . Model-based online learning of POMDPs. In Eur opean Confer ence on Machine Learning , pages 353–364, 2005. [27] G. Shani, J. Pineau, and R. Kaplo w . A surv ey of point-based POMDP solvers. Autonomous Agents and Multi-agent Systems , 27(1):1–51, 2013. [28] T . Shankar, S. K. Dwiv edy , and P . Guha. Reinforcement learning via recurrent conv olutional neural networks. In International Confer ence on P attern Recognition , pages 2592–2597, 2016. [29] D. Silver and J. V eness. Monte-carlo planning in large POMDPs. In Advances in Neural Information Pr ocessing Systems , pages 2164–2172, 2010. [30] D. Silver , A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V an Den Driessche, J. Schrittwieser , I. Antonoglou, V . Panneershelv am, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Natur e , 529(7587):484–489, 2016. [31] D. Silver , H. v an Hasselt, M. Hessel, T . Schaul, A. Guez, T . Harley , G. Dulac-Arnold, D. Reichert, N. Rabinowitz, A. Barreto, et al. The predictron: End-to-end learning and planning. arXiv preprint , 2016. URL . [32] M. T . Spaan and N. Vlassis. Perseus: Randomized point-based v alue iteration for POMDPs. Journal of Artiﬁcial Intelligence Resear ch , 24:195–220, 2005. [33] C. Stachniss. Robotics 2D-laser dataset. URL http://www.ipb.uni- bonn.de/datasets/ . [34] A. T amar , S. Levine, P . Abbeel, Y . W u, and G. Thomas. V alue iteration networks. In Advances in Neural Information Pr ocessing Systems , pages 2146–2154, 2016. [35] T . Tieleman and G. Hinton. Lecture 6.5 - rmsprop: Divide the gradient by a running a verage of its recent magnitude. COURSERA: Neural networks for mac hine learning , pages 26–31, 2012. [36] S. Xingjian, Z. Chen, H. W ang, D.-Y . Y eung, W .-k. W ong, and W .-c. W oo. Con v olutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Pr ocessing Systems , pages 802–810, 2015. [37] N. Y e, A. Somani, D. Hsu, and W . S. Lee. Despot: Online POMDP planning with regularization. Journal of Artiﬁcial Intelligence Resear ch , 58:231–266, 2017. 11 A Supplementary Experiments A.1 Navigation on Lar ge LID AR Maps W e pro vide results on additional en vironments for the LID AR map navigation task. LID AR maps are obtained from [ 33 ]. See Section C.5 for details. Intel corresponds to Intel Research Lab . Freib urg corresponds to Freibur g, Building 079. Belgioioso corresponds to Belgioioso Castle. MIT corresponds to the western wing of the MIT CSAIL building. W e note the size of the grid size N xM for each en vironment. A QMDP-net policy is trained on the 30 x 30 -D grid navigation domain on randomly generated en vironments using K = 90 . W e then ex ecute the learned QMDP-net policy with dif ferent K settings, i.e. we add con volutional layers to the planner that share the same k ernel weights. W e report the task success rate and the av erage number of time steps for task completion. T able 2: Additional results for navigation on lar ge LID AR maps. QMDP QMDP-net QMDP-net QMDP-net Untied K=450 K=180 K=90 QMDP-net Domain SR T ime SR Time SR T ime SR Time SR T ime Intel 100 × 101 90 . 2 85 . 4 94.4 108 . 0 83 . 4 89 . 6 40 . 8 78 . 6 20 . 0 55 . 3 Freibur g 139 × 57 88 . 4 66 . 9 92 . 0 91 . 4 93.2 81 . 1 55 . 8 68 . 0 37 . 4 51 . 7 Belgioioso 151 × 35 95 . 8 63 . 9 95.4 71 . 8 90 . 6 62 . 0 60 . 0 54 . 3 41 . 0 47 . 7 MIT 41 × 83 94 . 4 42 . 6 91 . 4 53 . 8 96.0 48 . 5 86 . 2 45 . 4 66 . 6 41 . 4 In the con v entional setting, when value iteration is ex ecuted on a fully kno wn MDP , increasing K improv es the value function approximation and improves the policy in return for the increased computation. In a QMDP-net increasing K has tw o effects on the overall planning quality . Estimation accuracy of the latent v alues increases and reward information can propagate to more distant states. On the other hand the learned latent model does not necessarily ﬁt the true underlying model, and it can be overﬁtted to the K setting during training. Therefore a too high K can degrade the overall performance. W e found that K test = 2 K train signiﬁcantly improved success rates in all our test cases. Further increasing K test = 5 K train was beneﬁcial in the Intel and Belgioioso en vironments, but it slightly decreased success rates for the Freib urg and MIT en vironments. W e compare QMDP-net to its untied variant, Untied QMDP-net. W e cannot expand the layers of Untied QMDP-net during e xecution. In consequence, the performance is poor . Note that the other alternativ e architectures we considered are speciﬁc to the input size and thus they are not applicable. A.2 Learning “Incorr ect” b ut Useful Models W e demonstrate that an “incorrect” model can result in better policies when solved by the approximate QMDP algorithm. W e compute QMDP policies on a POMDP with modiﬁed reward values, then ev aluate the policies using the original rewards. W e use the deterministic 29 × 29 maze navigation task where QMDP did poorly . W e attempt to shape re wards manually . Our motiv ation is to break symmetry in the model, and to implicitly encourage information gathering and compensate for the one-step look-ahead approximation in QMDP . Modiﬁed 1. W e increase the cost for the stay actions to 20 times of its original v alue. Modiﬁed 2. W e increase the cost for the stay action to 50 times of its original value, and the cost for the turn right action to 10 times of its original v alue. T able 3: QMDP policies computed on an “incorrect” model and ev aluated on the “correct” model. Original V ariant SR T ime rew ard Original 63.2 54.1 1.09 Modiﬁed 1 65.0 58.1 1.71 Modiﬁed 2 93.0 71.4 4.96 Why does the “correct” model result in poor policies when solved by QMDP? At a given point the Q value for a set of possible states may be high for the turn left action and low for the turn right action; 12 while for another set of states it may be the opposite way around. In expectation, both next states hav e lower value than the current one, thus the policy chooses the stay action, the robot does not gather information and it is stuck in one place. Results demonstrate that planning on an “incorrect” model may improv e the performance on the “correct” model. B V isualizing the Learned Model B.1 V alue Function W e plot the v alue function predicted by a QMDP-net for the 18 × 18 stochastic grid navigation task. W e used K = 54 iterations in the QMDP-net. As one would expect, states close to the goal have high values. Fig. 6: Map of a test environment and the corresponding learned v alue function V K . B.2 Belief Propagation W e plot the e xecution of a learned QMDP-net policy and the internal belief propagation on the 18 × 18 stochastic grid navigation task. The ﬁrst row in Fig. 7 sho ws the environment including the goal (red) and the unobserved pose of the robot (blue). The second row sho ws ground-truth beliefs for reference. W e do not access ground-truth beliefs during training except for the initial belief. The third ro w sho ws beliefs predicted by a QMDP-net. The last row sho ws the dif ference between the ground-truth and predicted beliefs. Fig. 7: Policy execution and belief propag ation in the 18 × 18 stochastic grid navig ation task. 13 The ﬁgure demonstrates that QMDP-net w as able to learn a reasonable ﬁlter for state estimation in a noisy en vironment. In the depicted example the initial belief is uniform o ver approximately half of the state space (Step 0). Due to the highly uncertain initial belief and the observation noise the robot stays in place for two steps (Step 1 and 2). After two steps the state estimation is still highly uncertain, but it is mostly spread out right from the goal. Therefore, mo ving left is a reasonable choice (Step 3). After an additional stay action (Step 4) the belief distribution is small enough and the robot starts moving to wards the goal (not shown). B.3 State-T ransition Function W e plot the learned and ground-truth state-transition functions. Columns of the table correspond to actions. The ﬁrst row shows the ground-truth transition function. The second ro w sho ws f T , the learned state-transition function in the ﬁlter . The third row shows f 0 T , the learned state-transition function in the planner . Fig. 8: Learned transition function T in the 18 × 18 stochastic grid navig ation task. While both f T and f 0 T represent the same underlying transition dynamics, the learned transition probabilities are dif ferent in the ﬁlter and planner . Different weights allo ws each module to choose its own approximation and thus pro vides greater ﬂexibility . The actions in the model a ∈ A are learned abstractions of the agent’ s actions a ∈ A . Indeed, in the planner the learned transition probabilities for action a i ∈ A do not match the transition probabilities of a i ∈ A . B.4 Reward Function Next plot the learned re ward function R for each action a ∈ A . Fig. 9: Learned reward function R in the 18 × 18 stochastic grid na vigation domain. While the learned re wards do not directly correspond to re wards in the underlying task, they are reasonable: obstacles are assigned negativ e re wards and the goal is assigned a positi ve re ward. Note that learned rew ard v alues correspond to the re ward after taking an action, therefore they should be interpreted together with the corresponding transition probabilities (third row of Fig. 8). 14 C Implementation Details C.1 Grid-W orld Na vigation W e implement the grid na vigation task in randomly generated discrete N × N grids where each cell has p = 0 . 25 probability of being an obstacle. The robot has 5 actions: mov e in the four canonical directions and stay put. Observ ations are four binary v alues corresponding to obstacles in the four neighboring cells. W e consider a deterministic v ariant (denoted by -D) and a stochastic v ariant (denoted by -S). In the stochastic variant the robot fails to execute each action with probability P t = 0 . 2 , in which case it stays in place. The observations are faulty with probability P o = 0 . 1 independently in each direction. Since we recei v e observations from 4 directions, the probability of receiving the correct observ ation vector is only 0 . 9 4 = 0 . 656 . The task parameter , θ , is an N × N × 3 image that encodes information about the environment. The ﬁrst channel encodes obstacles, 1 for obstacles, 0 for free space. The second channel encodes the goal, 1 for the goal, 0 otherwise. The third channel encodes the initial belief over robot states, each pixel v alue corresponds to the probability of the robot being in the corresponding state. W e construct a ground-truth POMDP model to obtain e xpert trajectories for training. It is important to note that the learning agent has no access to the ground-truth POMDP models. In the ground-truth model the robot receives a rewa rd of − 0 . 1 for each step, +20 for reaching the goal, and − 10 for bumping into an obstacle. W e use QMDP to solve the POMDP model, and ex ecute the QMDP policy to obtain e xpert trajectories. W e use 10 , 000 random grids for training. Initial and goal states are sampled from the free space uniformly . W e exclude samples where there is no feasible path. The initial belief is uniform ov er a random fraction of the free space which includes the underlying initial state. More speciﬁcally , the number of non-zero values in the initial-belief are sampled from { 1 , 2 , . . . N f / 2 , N f } where N f is the number of free cells in the grid. For each grid we generate 5 expert trajectories with dif ferent initial state, initial belief and goal. Note that we do not access the true beliefs after the ﬁrst step nor the underlying states along the trajectory . W e test on a set of 500 en vironments generated separately in equal conditions. W e declare failure after 10 N steps without reaching the goal. Note that the expert polic y is sub-optimal and it may fail to reach the goal. W e exclude these samples from the training set b ut include them in the test set. W e choose the structure of M ( θ ) , the model in QMDP-net, to match the structure of the underlying task. The transition function in the ﬁlter f T and the planner f 0 T are both 3 × 3 con v olutions. While they both represent the same transition function we do not tie their weights. W e apply a softmax function on the kernel matrix so its values sum to one. The re ward function, f R , is a CNN with two conv olutional layers. The ﬁrst has 3 × 3 kernel, 150 ﬁlters, ReLU acti v ation. The second has 1 × 1 kernel, 5 ﬁlters and linear acti v ation. The observation model, f Z , is a similar two-layer CNN. The ﬁrst con volution has a 3 × 3 kernel, 150 ﬁlters, linear acti v ation. The second has 1 × 1 kernel, 17 ﬁlters and linear activ ation. The action mapping, f A , is a one-hot encoding function. The observation mapping, f O , is a fully connected network with one hidden layer with 17 units and tanh activ ation. It has 17 output units and softmax activ ation. The low-le vel policy function, f π , is a single softmax layer . The state space mapping function, f B , is the identity function. Finally , we choose the number of iterations in the planner module, K = { 30 , 54 , 90 } for grids of size N = { 10 , 18 , 30 } respectiv ely . The 3 × 3 con v olutions in f T and f Z imply that T and O are spatially in v ariant and local. In the underlying task the locality assumption holds b ut spatial in variance does not: transitions depend on the arrangement of obstacles. Ne vertheless, the additional ﬂexibility in the model allo ws QMDP-net to learn high-quality policies, e.g. by shaping the rewards and the observ ation function. C.2 Maze Navigation In the maze navigation task a differential driv e robot has to navigate to a given goal. W e generate random mazes on N × N grids using Kruskal’ s algorithm. The state space has 3 dimensions where the third dimension represents 4 possible orientations of the robot. The goal conﬁguration is in v ariant to the orientation. The robot no w has 4 actions: move forward, turn left, turn right and stay put. The initial belief is chosen in a similar manner to the grid navigation case b ut in the 3-D space. The observations are identical to grid navigation but they are relative to the robot’ s orientation, which signiﬁcantly increases the dif ﬁculty of state estimation. The stochastic v ariant (denoted by -S) has 15 a motion and observation noise identical to the grid navigation. T raining and test data is prepared identically as well. W e use K = { 76 , 116 } for mazes of size N = { 19 , 29 } respectiv ely . W e use a model in QMDP-net with a 3 -dimensional state space of size N × N × 4 and an action space with 4 actions. The components of the network are chosen identically to the pre vious case, e xcept that all CNN components operate on 3-D tensors of size N × N × 4 . While it would be possible to use 3-D con volutions, we treat the third dimension as channels of a 2-D image instead, and use con v entional 2-D conv olutions. If the output of the last con v olutional layer is of size N × N × N c for the grid navigation task, it is of size N × N × 4 N c for the maze navigation task. When necessary , these tensors are transformed into a 4 dimensional form N × N × 4 × N c and the max-pool or softmax activ ation is computed along the last dimension. C.3 Object Grasping W e consider a 2-D implementation of the grasping task based on the POMDP model proposed by Hsiao et al. [13]. Hsiao et al. focused on the difﬁculty of planning with high uncertainty and solv ed manually designed POMDPs for single objects. W e phrase the problem as a learning task where we hav e no access to a model and we do not kno w all objects in adv ance. In our setting the robot recei ves an image of the tar get object and a feasible grasp point, b ut it does not kno w its pose relative to the object. W e aim to learn a policy on a set of object that generalizes to similar b ut unseen objects. The object and the gripper are represented in a discrete grid. The workspace is a 14 × 14 grid, and the gripper is a “U” shape in the grid. The gripper moves in the four canonical directions, unless it reaches the boundaries of the workspace or it is touching the object. in which case it stays in place. The gripper fails to mo ve with probability 0 . 2 . The gripper has two ﬁngers with 3 touch sensors on each ﬁnger . The touch sensors indicate contact with the object or reaching the limits of the w orkspace. The sensors produce an incorrect reading with probability 0 . 1 independently for each sensor . In each trial an object is placed on the bottom of the workspace at a random location. The initial gripper pose is unknown; the belief o ver possible states is uniform o ver a random fraction of the upper half of the workspace. The local observations, o t , are readings from the touch sensors. The task parameter θ is an image with three channels. The ﬁrst channel encodes the en vironment with an object; the second channel encodes the position of the target grasping point; the third channel encodes the initial belief ov er the gripper position. W e ha ve 30 artiﬁcial objects of different sizes up to 6 × 6 grid cells. Each object has at least one cell on its top that the gripper can grasp. For training we use 20 of the objects. W e generate 500 expert trajectories for each object in random conﬁguration. W e test the learned policies on 10 new objects in 20 random conﬁgurations each. The expert trajectories are obtained by solving a ground-truth POMDP model by the QMDP algorithm. In the ground-truth POMDP the robot receives a re ward of 1 for reaching the grasp point and 0 for every other state. In QMDP-net we choose a model with S = 14 × 14 , | A | = 4 and | O | = 16 . Note that the underlying task has | O | = 64 possible observ ations. The network components are chosen similarly to the grid navigation task, b ut the ﬁrst conv olution kernel in f Z is increased to 5 × 5 to account for more distant observations. W e set the number of iterations K = 20 . C.4 Hallway2 The Hallway2 navigation problem was proposed by Littman et al. [18] and has been used as a benchmark problem for POMDP planning [ 27 ]. It was speciﬁcally designed to expose the weakness of the QMDP algorithm resulting from its myopic planning horizon. While QMDP-net embeds the QMDP algorithm, through end-to-end training QMDP-net was able to learn a model that is signiﬁcantly more effecti ve giv en the QMDP algorithm. Hallway2 is a particular instance of the maze problem that inv olves more complex dynamics and high noise. For details we refer to the original problem deﬁnition [ 18 ]. W e train a QMDP-net on random 8 × 8 grids generated similarly to the grid na vigation case, b ut using transitions that match the Hallway2 POMDP model. W e then e xecute the learned polic y on a particularly dif ﬁcult instance of this problem that embeds the Hallway2 layout in a 8 × 8 grid. The initial state is uniform over the full state space. In each trial the robot starts from a random underlying state. The trial is deemed unsuccessful after 251 steps. 16 C.5 Navigation on a Lar ge LID AR Map W e obtain real-w orld building layouts using 2-D laser data from the Robotics Data Set Repository [ 12 ]. More speciﬁcally , we use SLAM maps preprocessed to gray-scale images a vailable online [ 33 ]. W e downscale the raw images to N xM and classify each pixel to be free or an obstacle by simple thresholding. The resulting maps are sho wn in Fig. 10. W e e xecute policies in simulation where a grid is deﬁned by the preprocessed map. The simulation employs the same dynamics as the grid navigation domain. The initial state and initial belief are chosen identically to the grid navig ation case. Fig. 10: Preprocessed N × M maps. A, Intel Research Lab, 100 × 101 . B, Freibur g, building 079, 139 × 57 . C, Belgioioso Castle, 151 × 35 . D, western wing of the MIT CSAIL building, 41 × 83 . A QMDP-net policy is trained on the 30 x 30 -D grid navig ation task on randomly generated en viron- ments. For training we set K = 90 in the QMDP-net. W e then execute the learned policy on the LID AR maps. T o account for the larger grid size we increase the number of iterations to K = 450 when ex ecuting the policy . C.6 Architectur es f or Comparison W e compare QMDP-net with two of its variants where we remove some of the POMDP priors embedded in the network (Untied QMDP-net, LSTM QMDP-net). W e also compare with two generic network architectures that do not embed structural priors for decision making (CNN+LSTM, RNN). W e also considered additional architectures for comparison, including networks with GR U [ 7 ] and Con vLSTM [ 36 ] cells. ConvLSTM is a variant of LSTM where the fully connected layers are replaced by con volutions. These architectures performed worse than CNN+LSTM for most of our task. Untied QMDP-net. W e obtain Untied QMDP-net by untying the kernel weights in the conv olu- tional layers that implement value iteration in the planner module of QMDP-net. W e also remove the softmax activ ation on the kernel weights. This is equi v alent to allowing a different transition model at each iteration of value iteration, and allowing transition probabilities that do not sum to one. In principle, Untied QMDP-net can represent the same policy as QMDP-net and it has some additional ﬂexibility . Howe v er , Untied QMDP-net has more parameters to learn as K increases. The training difﬁculty increases with more parameters, especially on comple x domains or when training with small amount of data. LSTM QMDP-net. In LSTM QMDP-net we replace the ﬁlter module of QMDP-net with a generic LSTM network b ut keep the v alue iteration implementation in the planner . The output of the LSTM component is a belief estimate which is input to the planner module of QMDP-net. W e ﬁrst process the task parameter input θ , an image encoding the en vironment and goal, by a CNN. W e separately process the action a t and observation o t input vectors by a two-layer fully connected component. These processed inputs are concatenated into a single vector which is the input of the LSTM layer . The size of the LSTM hidden state and output is chosen to match the number of states in the grid, e.g. N 2 for an N × N grid. W e initialize the hidden state of the LSTM using the appropriate channel of the input θ that encodes the initial belief. 17 CNN+LSTM. CNN+LSTM is a state-of-the-art deep con volutional network with LSTM cells. It is similar in structure to DRQN [ 10 ], which was used for learning to play partially observable Atari games in a reinforcement learning setting. Note that we train the networks in an imitation learning setting using the same set of expert trajectories, and not using reinforcement learning, so the comparison with QMDP-net is fair . The CNN+LSTM network has more structure to encode a decision making policy compared to a vanilla RNN, and it is also more tailored to our input representation. W e process the image input, θ , by a CNN component and the vector input, a t and o t , by a fully connected network component. The output of the CNN and the fully connected component are then combined into a single vector and fed to the LSTM layer . RNN. The considered RNN architecture is a v anilla recurrent neural network with 512 hidden units and tanh acti vation. At each step inputs are transformed into a single concatenated vector . The outputs are obtained by a fully connected layer with softmax activ ation. W e performed hyperparameter search on the number of layers and hidden units, and adjusted learning rate and batch size for all alternati ve networks. In particular , we ran trials for the deterministic grid navigation task. For each architecture we chose the best parametrization found. W e then used the same parametrization for all tasks. C.7 T raining T echnique W e train all networks, QMDP-net and alternativ es, in an imitation learning setting. The loss is deﬁned as the cross-entropy between predicted and demonstrated actions along the expert trajectories. W e do not receiv e supervision on the underlying ground-truth POMDP models. W e train the networks with backpropagation through time on mini-batches of 100 . The networks are implemented in T ensorﬂow [ 1 ]. W e use RMSProp optimizer [ 35 ] with 0 . 9 decay rate and 0 momentum setting. The learning rate was set to 1 × 10 − 3 for QMDP-net and 1 × 10 − 4 for the alternati ve netw orks. W e limit the number of backpropagation steps to 4 for QMDP-net and its untied variant; and to 6 for the other alternativ es, which ga ve slightly better results. W e used a combination of early stopping with patience and exponential learning rate decay of 0 . 9 . In particular, we started to decrease the learning rate if the prediction error did not decrease for 30 consecutiv e epochs on a validation set, 10% of the training data. W e performed 15 iterations of learning rate decay . W e perform multiple rounds of the training method described above. In our partially observ able domains predictions are increasingly difﬁcult along a trajectory , as they require multiple steps of ﬁltering, i.e. integrating information from a long sequence of observations. Therefore, for the ﬁrst round of training we limit the number of steps along the expert trajectories, for training both QMDP-net and its alternativ es. After conv ergence we perform a second round of training on the full length trajectories. Let L r be the number of steps along the expert trajectories for training round r . W e used two training rounds with L 1 = 4 and L 2 = 100 for training QMDP-net and its untied vari ant. For training the other alternati ve networks we used L 1 = 6 and L 2 = 100 , which gav e better results. W e trained policies for the grid na vigation task when the grid is ﬁxed, only the initial state and goal vary . In this variant we found that a low L r setting degrades the ﬁnal performance for the alternativ e networks. W e used a single training round with L 1 = 100 for this task. 18

QMDP-Net: Deep Learning for Planning under Partial Observability

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment