Markov decision processes capture sequential decision making under uncertainty, where an agent must choose actions so as to optimize long term reward. The paper studies efficient reasoning mechanisms for Relational Markov Decision Processes (RMDP) where world states have an internal relational structure that can be naturally described in terms of objects and relations among them. Two contributions are presented. First, the paper develops First Order Decision Diagrams (FODD), a new compact representation for functions over relational structures, together with a set of operators to combine FODDs, and novel reduction techniques to keep the representation small. Second, the paper shows how FODDs can be used to develop solutions for RMDPs, where reasoning is performed at the abstract level and the resulting optimal policy is independent of domain size (number of objects) or instantiation. In particular, a variant of the value iteration algorithm is developed by using special operations over FODDs, and the algorithm is shown to converge to the optimal policy.
Many real-world problems can be cast as sequential decision making under uncertainty. Consider a simple example in a logistics domain where an agent delivers boxes. The agent can take three types of actions: to load a box on a truck, to unload a box from a truck, and to drive a truck to a city. However the effects of actions may not be perfectly predictable. For example its gripper may be slippery so load actions may not succeed, or its navigation module may not be reliable and it may end up in a wrong location. This uncertainty compounds the already complex problem of planning a course of action to achieve some goals or maximize rewards.
Markov Decision Processes (MDP) have become the standard model for sequential decision making under uncertainty (Boutilier, Dean, & Hanks, 1999). These models also provide a general framework for artificial intelligence (AI) planning, where an agent has to achieve or maintain a well-defined goal. MDPs model an agent interacting with the world. The agent can fully observe the state of the world and takes actions so as to change the state. In doing that, the agent tries to optimize a measure of the long term reward it can obtain using such actions.
The classical representation and algorithms for MDPs (Puterman, 1994) require enumeration of the state space. For more complex situations we can specify the state space in terms of a set of propositional variables called state attributes. These state attributes together determine the world state. Consider a very simple logistics problem that has only one box and one truck. Then we can have state attributes such as truck in Paris (TP), box in Paris (BP), box in Boston (BB), etc. If we let the state space be represented by n binary state attributes then the total number of states would be 2 n . For some problems, however, the domain dynamics and resulting solutions have a simple structure that can be described compactly using the state attributes, and previous work known as the propositionally factored approach has developed a suite of algorithms that take advantage of such structure and avoid state enumeration. For example, one can use dynamic Bayesian networks, decision trees, and algebraic decision diagrams to concisely represent the MDP model. This line of work showed substantial speedup for propositionally factored domains (Boutilier, Dearden, & Goldszmidt, 1995;Boutilier, Dean, & Goldszmidt, 2000;Hoey, St-Aubin, Hu, & Boutilier, 1999).
The logistics example presented above is very small. Any realistic problem will have a large number of objects and corresponding relations among them. Consider a problem with four trucks, three boxes, and where the goal is to have a box in Paris, but it does not matter which box is in Paris. With the propositionally factored approach, we need to have one propositional variable for every possible instantiation of the relations in the domain, e.g., box 1 in Paris, box 2 in Paris, box 1 on truck 1, box 2 on truck 1, and so on, and the action space expands in the same way. The goal becomes a ground disjunction over different instances stating “box 1 in Paris, or box 2 in Paris, or box 3 in Paris, or box 4 in Paris”. Thus we get a very large MDP and at the same time we lose the structure implicit in the relations and the potential benefits of this structure in terms of computation. This is the main motivation behind relational or first order MDPs (RMDP). 1 A first order representation of MDPs can describe domain objects and relations among them, and can use quantification in specifying objectives. In the logistics example, we can introduce three predicates to capture the relations among domain objects, i.e., Bin(Box, City), T in(T ruck, City), and On(Box, T ruck) with their obvious meaning. We have three parameterized actions, i.e., load(Box, T ruck), unload(Box, T ruck), and drive(T ruck, City). Now domain dynamics, reward, and solutions can be described compactly and abstractly using the relational notation. For example, we can define the goal using existential quantification, i.e., ∃b, Bin(b, P aris). Using this goal one can identify an abstract policy, which is optimal for every possible instance of the domain. Intuitively when there are 0 steps to go, the agent will be rewarded if there is any box in Paris. When there is one step to go and there is no box in Paris yet, the agent can take one action to help achieve the goal. If there is a box (say b 1 ) on a truck (say t 1 ) and the truck is in Paris, then the agent can execute the action unload(b 1 , t 1 ), which may make Bin(b 1 , P aris) true, thus the goal will be achieved. When there are two steps to go, if there is a box on a truck that is in Paris, the agent can take the unload action twice (to increase the probability of successful unloading of the box), or if there is a box on a truck that is not in Paris, the agent can first take the action drive followed by unload. The preferred plan will depend on the success probability of the dif
This content is AI-processed based on open access ArXiv data.