Online Active Perception for Partially Observable Markov Decision Processes with Limited Budget

Online Activ e P er ception f or Partially Obser vable Marko v Decision Pr ocesses with Limited Budget Mahsa Ghasemi, Ufuk T opcu Abstract — Active perception strategies enable an agent to selectively gather inf ormation in a way to improv e its per- formance. In applications in which the agent does not hav e prior knowledge about the a vailable inf ormation sources, it is crucial to synthesize active perception strategies at runtime. W e consider a setting in which at runtime an agent is capable of gathering information under a limited budget. W e pose the problem in the context of partially observable Markov decision processes. W e propose a generalized greedy strategy that selects a subset of information sources with near -optimality guarantees on uncertainty r eduction. Our theor etical analysis establishes that the proposed active perception strategy achieves near - optimal performance in terms of expected cumulative reward. W e demonstrate the resulting strategies in simulations on a robotic na vigation problem. I . I N T R O D UC T I O N An intelligent system should be able to exploit the av ail- able information in its surroundings tow ard better accom- plishment of its task. Ho wev er, in many applications in robotics and control, a decision-maker (called an agent) is not necessarily aware of the av ailable information sources during a priori planning. For instance, consider an en vironment in which multiple agents, each with individual plans for their speciﬁc tasks, operate together . An agent may have no or only limited access to the behavioral model of other agents, and hence their observability of the en vironment and whether they are in the communication range. Nev ertheless, at run- time, the agents may decide to exchange their information in order to enhance their performance. In practical settings, the ability of an agent in gathering information is subject to budget constraints originating from power , communication, or computational limitations. If an agent decides to employ a sensor, it incurs a cost associated with the required power , or , if an agent decides to communi- cate with another agent, it incurs a communication cost. Such budget constraints accentuate the need for activ ely selecting a subset of available information that are most beneﬁcial to the agent. W e call this decision-making problem budget- constrained online active perception . W e formulate budget-constrained online activ e percep- tion for partially observ able Markov decision processes (POMDPs). Computing an optimal policy for POMDPs that maximizes the e xpected cumulati ve reward, is generally Mahsa Ghasemi is with the Department of Electrical and Computer Engineering, University of T exas at Austin, Austin, TX 78712 USA. Ufuk T opcu is with the Department of Aerospace Engineering and Engineering Mechanics, University of T exas at Austin, Austin, TX 78712 USA. This work was supported in part by ONR grants N00014-19-1-2054 and N00014-18-1-2829, and D ARP A grant D19AP00004. PSP A CE-complete [1]. This complexity result has led to de- sign of numerous approximation algorithms. A well-known family of these approximate methods relies on point-based value iteration solv ers [2]–[4]. Point-based solvers exploit the piecewise linearity and con ve xity [5] of value function to approximate it as the maximum of a set of hyperplanes, each associated with a sampled belief point. It is prov able that the error due to this approximation is bounded by a factor depending on the density of sampled belief points [6]. The combinatorial nature of selecting a subset of av ailable information subject to budget constraints renders the task of ﬁnding an optimal solution NP-hard. W e propose an efﬁcient yet near-optimal online activ e perception strate gy for POMDPs that aims to minimize the agent’ s uncertainty about the state while respecting the constraint. W e prov e the near- optimality of the proposed algorithm. Further , we ev aluate the ef ﬁcacy of the proposed solution for a robotic navigation task where the robot can communicate with unmanned aerial vehicles (U A Vs) to better localize itself. A. Related W ork Activ e perception has been studied in many applications including robotics [7]–[10] and image processing [11], [12]. A body of literature formalizes activ e perception as a rew ard- based task of a POMDP , enabling non-myopic decision- making. The reward-based treatment of perception has been employed for activ e classiﬁcation [13] and cooperative acti ve perception [14]–[16]. Araya et al. [17] introduce ρ POMDP model in which the reward is the entropy of the belief and Spaan et al. [18] propose POMDP-IR in which the rew ard depends on the accuracy of state prediction. In [19], the authors exploit the submodularity of value function for ρ POMDP and POMDP-IR to design a greedy maximization technique for ﬁnding a near-optimal activ e perception policy . Our setting differs from the existing work in two aspects. First, we consider both planning and perception where the perception serves the planning objective. Second, we con- sider settings in which the perception model in only partially known in a priori planning. An instance of acti ve perception, considered in this pa- per , is that of dynamically selecting a subset of a vailable information sources. The existing work on subset selection quantify usefulness of an information source by information- theoretic utility functions such as scalarizations of error co- variance matrix of the estimated parameter [20], [21], mutual information between the measurements and the parameter of interest, or entropy of the selected measurements [22], [23]. Given a speciﬁc utility function, selecting an optimal subset of information sources under constraint is a combi- natorial problem [24]. Howe ver , if the utility function has properties such as monotonicity or (weak) submodularity , greedy algorithms can achie ve near-optimal solutions with only polynomial number of function ev aluations [25]–[27]. W e use mutual information between the current state and the observations as the utility function. W e obtain theoretical guarantee for the performance of the proposed generalized greedy maximization algorithm by exploiting monotonicity and submodularity of mutual information as well as linearity of cost constraint. I I . P R E L I M I N A R I E S A N D P R O B L E M S TA T E M E N T In this section, we provide an outline of the related con- cepts and deﬁnitions in order to formally state the problem. A. Pr eliminaries W e ﬁrst overvie w the necessary background on partially observable Marko v decision processes (POMDPs), point- based value iteration solvers, and properties of set functions. 1) POMDP: A POMDP is a tuple P = ( S, A, T , Ω , O , R, γ ) , where S is the ﬁnite set of states, A is the ﬁnite set of actions, T : S × A × S → [0 , 1] is the probabilistic transition function, Ω is the set of observations, O : S × A × Ω → [0 , 1] is the probabilistic observation function, and γ ∈ [0 , 1] is the discount factor . At each time step, the environment is in some state s ∈ S . The agent takes an action a ∈ A that causes a transition to a state s 0 ∈ S with probability P r ( s 0 | s, a ) = T ( s, a, s 0 ) . Then it receiv es an observation ω ∈ Ω with probability P r ( ω | s 0 , a ) = O ( s 0 , a, ω ) , and a scalar reward R ( s, a ) . The belief of the agent at each time step, denoted by b t is the posterior probability distribution of states giv en the history of previous actions and observations, i.e., h t = ( a 0 , ω 1 , a 1 , . . . , a t − 1 , ω t ) . A well-known fact is that due to Markovian property , a sufﬁcient statistics to represent history of actions and observations is the belief [28], [29]. Giv en the initial belief b 0 , the following update equation holds between pre vious belief b and the belief b 0 a,ω b after taking action a and recei ving observ ation ω : b 0 a,ω b ( s 0 ) = O ( s 0 , a, ω ) P s T ( s, a, s 0 ) b ( s ) P s 0 O ( s 0 , a, ω ) P s T ( s, a, s 0 ) b ( s ) . (1) The agent’ s objectiv e is to ﬁnd a pure policy that maxi- mizes its expected discounted cumulativ e reward denoted by E [ P ∞ t =0 γ t R ( s t , a t ) | b 0 ] . A pure policy is a mapping from belief to actions π : B → A , where B is the set of belief states. Note that B constructs a ( | S | − 1) -dimensional probability simplex which we indicate by ∆ B . 2) P oint-Based V alue Iteration: POMDP solvers apply value iteration [5], a dynamic programming technique, to ﬁnd the optimal policy . Let V be a value function that maps beliefs to values in R that represent the expected discounted cumulativ e reward for a giv en belief. The following recursiv e expression holds for V : V t ( b ) = max a X s ∈ S b ( s ) R ( s, a )+ γ X ω ∈ Ω P r ( ω | b, a ) V t − 1 ( b 0 a,ω b ) ! . (2) The value iteration process con ver ges to the optimal value function which satisﬁes the Bellman’ s optimality equa- tion [30]. Then, an optimal policy can be deriv ed from the optimal value function. An important outcome of (2) is that at any horizon, the value function is piecewise linear and con vex [29] and hence, can be represented by a ﬁnite set of hyperplanes. Each hyperplane is associated with an action. Let α ’ s to denote the corresponding v ectors of the h yperplane parameters and let Γ t to be the set of α vectors at horizon t . Then, V t ( b ) = max α ∈ Γ t α · b, (3) where · indicates the dot product of the two vectors. Ad- ditionally , the action corresponding to the optimal α in (3) determines the optimal action at b . This representation of the v alue function has motiv ated approximate point based solvers to try to approximate the value function by updating the hyperplanes over a ﬁnite set of sampled belief points. Generic point-based solvers consist of three main steps, namely sampling, backup, and pruning. These steps are applied repeatedly until a desired conv ergence criterion for the v alue function is realized. For the sampling step, dif- ferent approaches exist including discretization of the belief simplex and adaptive sampling techniques [3], [4], [6]. The backup step follows the standard Bellman backup operation. More speciﬁcally , one can rewrite (2) using (3) to obtain: V t ( b ) = max a X s ∈ S b ( s ) R ( s, a ) + X ω ∈ Ω max α ∈ Γ t − 1 X s ∈ S X s 0 ∈ S α ( s 0 ) O ( s 0 , a, ω ) T ( s, a, s 0 ) b ( s ) ! , where Γ t − 1 is the set of α vectors from previous iteration. Let B t to denote the current set of sampled belief points. The Bellman backup operator on B t is performed through the following procedure [6]: Step 1: For all a ∈ A : Γ a, ∗ t ← α a, ∗ ( s ) = R ( s, a ) Step 2: For all a ∈ A, α ∈ Γ t − 1 , and ω ∈ Ω : Γ a,ω t ← α a,ω ( s ) = γ X s 0 ∈ S O ( s 0 , a, ω ) T ( s, a, s 0 ) α ( s 0 ) Step 3: For all a ∈ A, and b ∈ B t : Γ b,a t ← α b,a = α a, ∗ + X ω ∈ Ω arg max α ∈ Γ a,ω t α · b Step 4: For all b ∈ B t : α b = arg max α ∈ Γ b,a t ,a ∈ A α · b Step 5: Γ t = [ b ∈ B t α b where Γ t is the new set of α vectors. Lastly , in the pruning step, the α vectors that are dominated by other α vectors are remov ed to simplify next round of computation [17]. 3) Pr operties of Set Functions: Since the proposed acti ve perception algorithm is founded upon the theoretical results from the ﬁeld of submodular optimization for set functions, here, we overvie w the necessary deﬁnitions. Let X to denote a ground set and f a set function that maps an input set to a real number . Deﬁnition 1. A set function f : 2 X → R is monotone nondecr easing if f ( T 1 ) ≤ f ( T 2 ) for all T 1 ⊆ T 2 ⊆ X . Deﬁnition 2. A set function f : 2 X → R is submodular if f ( T 1 ∪ { i } ) − f ( T 1 ) ≥ f ( T 2 ∪ { i } ) − f ( T 2 ) for all subsets T 1 ⊆ T 2 ⊂ X and i ∈ X \ T 2 . The term f i ( T 1 ) = f ( T 1 ∪ { i } ) − f ( T 1 ) is the mar ginal value of adding element i to set T 1 . Monotonicity states that adding elements to a set increases the function v alue while submodularity refers to diminishing returns property . B. Pr oblem Statement In this paper, we consider an agent whose interaction with the en vironment, i.e., stochastic transitions and observ ations, is captured by a POMDP . In addition to a priori known observations captured by the POMDP , during runtime, the agent can further collect auxiliary observations, e.g., by means of communicating with other nearby agents. Howe ver , there is a budget constraint, such as limited communication bandwidth or limited communication power , on the auxil- iary information gathering. Therefore, the agent must pick (or activ ate) a subset of auxiliary information sources that maximally increase its expected reward in the future while respecting the constraint. W e formally state the problem ne xt. Problem 1. Consider a POMDP P = ( S, A, T , Ω , O , R , γ ) with initial belief b 0 . Let set Ω aux t = Ω 1 × Ω 2 × . . . × Ω n t to denote n t auxiliary observations available at time step t , with associated costs of c 1 t , c 2 t , . . . , c n t t , and an upper bound ¯ c t on the cost. Also, let I t = { ι = ( i 1 , i 2 , . . . , i k ) | i j , k ∈ { 1 , 2 , . . . , n t }} to repr esent the power set obtained fr om Ω aux t . In a priori planning, we aim to compute a pur e belief-based policy π : B → A that maximizes the expected discounted cumulative r ewar d, i.e, π ∗ = argmax π E [ ∞ X t =0 γ t R ( s t , π ( b t )) | b 0 ] . Furthermor e, at runtime, we aim to compute an active per ception policy µ t : B → I t that given current belief b t , maximizes the expected discounted cumulative r ewar d in the futur e while r especting the cost constraint, i.e., µ ∗ t = argmax µ t E [ ∞ X t γ t R ( s t , π ( b t )) | b t ] such that X i ∈ ι ι =( i 1 ,i 2 ,...,i k ) ∈ I t c i t ≤ ¯ c t . I I I . O N L I N E A C T I VE P E R C E P T I O N W I T H L I M I T E D B U D G E T Problem 1 consists of two stages. The ﬁrst stage is an a priori planning based on the POMDP model. W e resort to point-based v alue iteration (see Section II) to compute a near- optimal policy ˆ π for this planning problem. As discussed earlier , various heuristics for adaptiv e sampling of belief points have been de veloped. The core idea of these methods is to guide the sampling toward the reachable subspace of the belief simplex ∆ B . Nev ertheless, since the reachable belief points depend on possible observations and the agent is not aware of auxiliary observations a priori, we propose a uniform sampling of the belief simplex. While uniform sampling is not as efﬁcient as that of adaptive sampling for large POMDPs, it ensures co verage of the whole belief space. The second stage of the problem is an online computation of an optimal subset of information sources with respect to e xpected future reward while complying with the cost constraint. T o that end, we design a generalized greedy strategy , to be applied at each time step, which is com- putationally efﬁcient and achieves near-optimal guarantees. Before introducing the algorithm, we state the following assumption regarding dependency of observations from the auxiliary information sources. Assumption 1. W e assume that the observations fr om the information sour ces are mutually independent given the curr ent state and the pre vious action, i.e., ∀ I , J ⊆ { 1 , 2 , . . . , n } , I ∩ J = ∅ : P r ( [ i ∈ I ω i , [ j ∈ J ω j | s , a ) = P r ( [ i ∈ I ω i | s , a ) P r ( [ j ∈ J ω j | s , a ) . Let b 0 a,ω b ( s 0 ) to denote the updated belief after tak- ing action a and receiving observation ω . Assume the agent then picks a perception action corresponding to ι = ( i 1 , i 2 , . . . , i k ) and receiv es an auxiliary observation ¯ ω = ( ω i 1 , ω i 2 , . . . , ω i k , ) . Then, if Assumption 1 holds, ac- cording to Bayes’ theorem, the agent’ s belief will be further updated by the following rule: b 00 a,ι, ¯ ω b 0 ( s 00 ) = Q i ∈ ι O i ( s 00 , a, ω i ) b 0 ( s 00 ) P s 00 Q i ∈ ι O i ( s 00 , a, ω i ) b 0 ( s 00 ) , (4) where O i ( s 00 , a, ω i ) = P r ( ω i | s 00 , a, ι ) . A. Pr oposed Generalized Gr eedy Algorithm T o quantify utility of information sources, we use mutual information between the state and auxiliary informations. Mutual information between two random variables is a positiv e and symmetric measure of their dependence and is deﬁned as: I ( x ; y ) = X x,y p x , y ( x, y ) log p x , y ( x, y ) p x ( x ) p y ( y ) . Mutual information, due to its monotonicity and submodular characteristics, has inspired many subset selection algo- rithms [23]. The mutual information between the state and the auxiliary informations is closely related to the change in the entropy of the state after receiving the additional observations, as expressed by the following equation: I ( s ; [ i ∈ ι ω i ) = H ( s ) − H ( s | [ i ∈ ι ω i ) . (5) For a discrete random variable x , the entropy is deﬁned as H ( x ) = − P i p ( x i ) log p ( x i ) and captures the amount of uncertainty . Therefore, intuitively , maximizing the mutual information is equi valent to minimizing the uncertainty in the state. Minimizing the state uncertainty is the goal of perception actions as it leads to higher expected reward in the future. Notice that the entropy is strictly concav e on ∆ B [31]. Hence, minimizing the entropy pushes the belief tow ard the boundary of the simplex that due to con vexity of the value function, possesses higher value. That being the case, in order to select the optimal perception action, we deﬁne the objective function as the following set function: f ( ι ) = I ( s ; [ i ∈ ι ω i ) = H ( s ) − H ( s | [ i ∈ ι ω i ) , (6) and aim to compute ι ∗ by solving the following discrete optimization problem: ι ∗ = argmax ι f ( ι ) such that X i ∈ ι ι =( i 1 ,i 2 ,...,i k ) ∈ I t c i t ≤ ¯ c t . (7) Note that H ( s ) is constant and does not af fect the selection procedure. Furthermore, H ( s | S i ∈ ι ω i ) yields the expected value of entropy o ver all possible realizations of observations and can be computed via: H ( s | [ i ∈ ι ω i ) = − X ω i 1 ∈ Ω i 1 . . . X ω i k ∈ Ω i k X s ∈ S b ( s ) Y i j ∈ ι O i j ( s, a, ω i j ) log b ( s ) Q i j ∈ ι O i j ( s, a, ω i j ) P s 0 ∈ S b ( s 0 ) Q i j ∈ ι O i j ( s 0 , a, ω i j ) ! ! . (8) At each time step, there is 2 n possible perception actions ι with their associated costs. Finding an optimal subset of information sources with respect to (7) is a combinatorial optimization problem and is NP-hard [24]. Hence, we pro- pose an approximate solution based on greedy maximization schemes. The proposed greedy algorithm, outlined in Al- gorithm 1, is founded upon the idea of generalized greedy algorithm in [32]. The algorithm takes as input the agent’ s belief and action along the current set of av ailable auxiliary Fig. 1: The robot aims to reach the target state (starred) while av oiding the obstacles (dark cells) in the map. UA Vs periodically patrol the dashed paths and can view their nearby area (shaded area). The robot can ask for infor- mation from U A Vs to better localize itself. Algorithm 1 Perception policy as a generalized greedy scheme 1: Input: POMDP P = ( S, A, T , Ω , O , R , γ ) , Current belief b , Action a , Auxiliary information Ω aux t with costs c 1 t , c 2 t , . . . , c n t t , Cost constraint ¯ c t , Scaling factor β > 0 . 2: Output: Perception action ι g t . 3: Initialize X = { 1 , 2 , . . . , n t } , ˜ X = X , ˜ ι = ∅ . 4: while ˜ X 6 = ∅ do 5: j ∗ = arg max j ∈ ˜ X \ ˜ ι H ( s | S i ∈ ˜ ι ω i ) − H ( s | S i ∈ ˜ ι ∪{ j } ω i ) ( c j t ) β 6: if X i ∈ ˜ ι c i t + c j ∗ t ≤ ¯ c t then 7: ˜ ι ← ˜ ι ∪ { j ∗ } 8: end if 9: ˜ X ← ˜ X \{ j ∗ } 10: end while 11: j ∗ 1 = arg max j ∈X ,c j t ≤ ¯ c t −H ( s | ω j ) 12: ι g t = arg max ι ∈{ ˜ ι, { j ∗ 1 }} −H ( s | [ i ∈ ι ω i ) 13: r eturn ι g t . informations. Then it iterativ ely adds elements from the ground set (set of all information sources) whose marginal gain with respect to f , scaled by the added cost, is maximal and terminates when no more elements can be added due to the constraint. Parameter β is a scaling factor of the cost which can be adjusted to calibrate the effect of cost for a particular problem. The output set is the superior of the constructed subset and the best singleton subset. B. Theor etical Analysis Next, we theoretically analyze the performance of the proposed online active perception algorithm. The following lemma states the required properties of the objectiv e function to prove near-optimality result. Lemma 1. Let Ω = { ω 1 , ω 2 , . . . , ω n } to repr esent a set of observations of the state s for which Assumption 1 holds. Then, f ( ι ) , deﬁned in (6) , r ealizes the following pr operties: 1) f ( ∅ ) = 0 , 2) f is monotone nondecr easing, and 3) f is submodular . The proof of the lemma follows from submodularity of conditional entropy [33] and its monotonicity . The abov e lemma enables us to establish the approximation factor using the analysis in [32]. Theorem 1. Let ι ∗ to denote the optimal subset of obser- vations obtained fr om the optimization problem in (7) , and ι g to denote the output of Algorithm 1 for β = 1 . Then, the following performance guarantee holds: I ( s ; [ i ∈ ι g ω i ) ≥  1 − 1 √ e  I ( s ; [ i ∈ ι ∗ ω i ) . (9) (a) No auxiliary observa- tions (b) Random selection of 2 observations (c) Greedy selection of 2 observations Fig. 2: The frequency of visiting states when using different online acti ve perception methods. Theorem 1 prov es that the mutual information obtained by the generalized greedy algorithm is close to that of optimal solution in (7). Nev ertheless, we need to analyze the near- optimality of the proposed online activ e perception policy compared to µ ∗ t in Problem 1. T o that end, we sho w that the expected distance between the two belief points from greedy and optimal perception actions is bounded. Using this fact, we prove that the value loss is bounded as well. Theorem 2. Let b to denote the agent’ s curr ent belief and a to denote its last action. Further , let ι g and ι ∗ to be the gr eedy perception action and the optimal action, r espectively . Then, it holds that E [ k b g − b ∗ k 1 ] ≤ s 2 √ e E S i ∈ ι ∗ ω i [ D KL ( p ∗ k p 0 )] , wher e b ∗ and b g ar e the updated beliefs according to (4) . Now , we can use Theorem 2 to bound the value loss in the objective function in Problem 1. Theorem 3. Instate the notation and hypothesis of Theo- r em 2. Additionally , let V to be the computed value function for POMDP . It holds that: E [ V ( b g ) − V ( b ∗ )] ≤ δ max {| R max | , | R min |} 1 − γ , wher e δ is the right hand side of the inequality in Theor em 2. I V . S I M U L AT I O N R E S U LT S W e ev aluate the proposed online activ e perception algo- rithm in a robotic navigation task. T o that end, we imple- ment a simple point-based v alue iteration solver that uses a ﬁxed set of belief points. The belief points are uniformly distributed over ∆ B and their associated α v ectors are initialized by 1 1 − γ min s,a R ( s, a ) 1 | S | [34]. W e run the solver until the ` 1 -norm distance between value functions in two consecutiv e iterations falls below a predeﬁned threshold of 0.001 or a maximum iteration number of 1000 is reached. W e implement the proposed generalized greedy selection algorithm as well as a random selection algorithm that selects a subset of information sources, uniformly at random. After learning the policy from the solver , we apply the online acti ve perception policies for 50 Monte Carlo simulation runs. 1 1 The code is available at https://github.com/MahsaGhasemi/greedy- perception-POMDP The robotic navigation scenario models a robot in a 8 × 8 grid map whose objectiv e is to reach a goal state while av oiding the obstacles in the environment, see Fig. 1. The goal state has a re ward of 10, obstacle cells have a reward of -5, and other cells ha ve a re ward of -1. The na vigation actions of the robot are A = { up, r ight, dow n, l ef t, stop } . The robot’ s transitions are probabilistic due to possible actuation errors with 0.7 probability of taking the correct action. The robot has an inaccurate sensor as well that can localize it correctly with probability 0.5. In addition to the robot, there are 12 U A Vs that are patrolling the area in periodic motions. The ﬁeld of view of each U A V is a 3 × 3 area. At each time step, the robot can select some of the U A Vs and ask them to send their information regarding the state of the robot. Howe ver , note that the observation model of UA Vs is time- varying and changes based on their location. Besides, the robot does not know the policies of UA Vs during planning time. W e assume that the cost of communicating with each U A V is the same. At each time step, the cost constraint allows communication with at most 2 UA Vs. W e ﬁrst ﬁnd a planning policy via the implemented point- based solver . Next, we let the robot to run for a horizon of 40 steps, with no auxiliary information, with random selection of information sources, and with the proposed generalized greedy selection based on mutual information. W e terminate the simulations once the robot reaches the goal. Fig. 2 illus- trates the normalized frequenc y of visiting each state for each perception algorithm. No use of auxiliary informations leads to worst performance as it visits the obstacle cells frequently . Random addition of auxiliary information sources improves the performance since it results in better obstacle av oidance. Howe ver , the best obstacle av oidance performance is for the proposed generalized greedy algorithm and it shows more concentration around the optimal path. Fig. 3 demonstrates the discounted cumulative rew ard, av eraged ov er 50 Monte Carlo runs, for all three policies, i.e., no auxiliary informa- tion, random selection of 1 and 2 information sources, and greedy selection of 1 and 2 information sources. It can be seen that the generalized greedy selection scheme obtains the highest reward. V . C O N C L U S I O N W e studied online activ e perception for POMDPs where at each time step, the agent can pick a subset of available Fig. 3: The av erage discounted cumulativ e reward over 50 runs for each perception policy . The solid lines depict the corresponding standard deviations. information sources, under a budget constraint, to enhance its belief. W e deﬁned a utility function based on the mutual information between the state and the information sources. W e developed an efﬁcient generalized greedy scheme to iterativ ely pick observ ation sources with highest marginal gain, scaled by the added cost. W e theoretically established near-optimality of the proposed scheme and further e valuated it on a robotic navigation task. As part of the future work, we aim to employ P A C greedy maximization [35] to accelerate the information selection process since instead of exact computation, it only requires bounds on the utility function. A P P E N D I X I P RO O F O F L E M M A 1 It is clear that f ( ∅ ) = H ( s ) − H ( s ) = 0 . Let [ n ] = { 1 , 2 , . . . , n } . T o prove monotonicity , consider ι 1 ⊂ [ n ] and j ∈ [ n ] \ ι 1 . Then, H ( s | [ i ∈ ι 1 ∪{ j } ω i ) ( a ) = H ( [ i ∈ ι 1 ∪{ j } ω i | s ) + H ( s ) − H ( [ i ∈ ι 1 ∪{ j } ω i ) ( b ) = H ( [ i ∈ ι 1 ω i | s ) + H ( ω j | s ) + H ( s ) − H ( [ i ∈ ι 1 ω i ) − H ( ω j | [ i ∈ ι 1 ω i ) ( c ) = H ( s | [ i ∈ ι 1 ω i ) + H ( ω j | s ) − H ( ω j | [ i ∈ ι 1 ω i ) ( d ) = H ( s | [ i ∈ ι 1 ω i ) + H ( ω j | s , [ i ∈ ι 1 ω i ) − H ( ω j | [ i ∈ ι 1 ω i ) ( e ) ≤ H ( s | [ i ∈ ι 1 ω i ) + H ( ω j | [ i ∈ ι 1 ω i ) − H ( ω j | [ i ∈ ι 1 ω i ) = H ( s | [ i ∈ ι 1 ω i ) , where ( a ) and ( c ) are due to Bayes’ rule for entropy , ( b ) follows from the conditional independence assumption and joint entropy deﬁnition, ( d ) is due to the conditional independence assumption, and ( e ) stems from the fact that conditioning does not increase entropy . Furthermore, from the third line of above proof, we can deriv e the marginal gain as: f j ( ι 1 ) = H ( s | [ i ∈ ι 1 ω i ) − H ( s | [ i ∈ ι 1 ∪{ j } ω i ) = H ( ω j | [ i ∈ ι 1 ω i ) − H ( ω j | s ) T o prov e submodularity , let ι 1 ⊆ ι 2 ⊂ [ n ] and j ∈ [ n ] \ ι 2 . Then, f j ( ι 1 ) = H ( ω j | [ i ∈ ι 1 ω i ) − H ( ω j | s ) ( a ) ≥ H ( ω j | [ i ∈ ι 1 ∪ ( ι 2 \ ι 1 ) ω i ) − H ( ω j | s ) ( b ) = H ( ω j | [ i ∈ ι 2 ω i ) − H ( ω j | s ) = f j ( ι 2 ) , where ( a ) is based on the fact that conditioning does not increase entropy , and ( b ) results from ι 1 ⊆ ι 2 . A P P E N D I X I I P RO O F O F T H E O R E M 2 Let p 0 := b 0 a,ω b to be the updated belief (see (1)) after taking action a and receiving observ ation ω . Also, let p g := b 00 a,ι g , ¯ ω b 0 and p ∗ := b 00 a,ι ∗ , ¯ ω b 0 to denote the updated beliefs (see (4)) after receiving auxiliary observations corresponding to the proposed generalized greedy scheme and the optimal se- lection, respectiv ely . First, by lev eraging the relation between mutual information and Kullback-Leibler (KL-) div ergence, we establish the followings: I ( s ; [ i ∈ ι g ω i ) = E S i ∈ ι g ω i  D KL ( p g k p 0 )  , (10a) I ( s ; [ i ∈ ι ∗ ω i ) = E S i ∈ ι ∗ ω i  D KL ( p ∗ k p 0 )  . (10b) In other words, the mutual information between the state and a set of information sources is equiv alent to expected KL- div ergence from current belief to posterior belief. Therefore, using (10) along the result of Theorem 1 yields: E S i ∈ ι g ω i  D KL ( p g k p 0 )  ≥  1 − 1 √ e  E S i ∈ ι ∗ ω i  D KL ( p ∗ k p 0 )  . (11) Next, we use the Pythagorean theorem for KL- div ergence [36] and take expectation ov er all realizations of the observations to obtain: E S i ∈ ι ∗ ω i  D KL ( p ∗ k p 0 )  ≥ E S i ∈ [ n ] ω i [ D KL ( p ∗ k p g )] + E S i ∈ ι g ω i  D KL ( p g k p 0 )  . (12) W e combine (11) and (12), and rearrange the terms to establish the following: E S i ∈ [ n ] ω i [ D KL ( p ∗ k p g )] ≤ 1 √ e E S i ∈ ι ∗ ω i  D KL ( p ∗ k p 0 )  , (13) where the right hand side is a constant. Lastly , we exploit Pinkster’ s inequality which relates the total variation distance to KL-diver gence and apply Jansen’ s inequality for square- root function (a conca ve function) to deri ve the desired result: E [ k b g − b ∗ k 1 ] ≤ s 2 √ e E S i ∈ ι ∗ ω i [ D KL ( p ∗ k p 0 )] . A P P E N D I X I I I P RO O F O F T H E O R E M 3 Let α g and α ∗ to represent the gradient of value function at b g and b ∗ , respectiv ely . Let R max = max s,a R ( s, a ) and R min = min s,a R ( s, a ) . Therefore, we can show that E [ V ( b g ) − V ( b ∗ )] = E [ α g .b g − α ∗ .b ∗ ] = E [ α g .b g − α g .b ∗ + α g .b ∗ − α ∗ .b ∗ ] ( a ) ≤ E [ α g .b g − α g .b ∗ + α ∗ .b ∗ − α ∗ .b ∗ ] = E [ α g . ( b g − b ∗ )] ( b ) ≤ E [ k α g k ∞ k b g − b ∗ k 1 ] ( c ) ≤ δ max {| R max | , | R min |} 1 − γ , where ( a ) follows from the fact that α ∗ is the gradient of optimal value function, ( b ) is due to H ¨ older’ s inequal- ity , and ( c ) is the result of Theorem 2 and the fact that k α k ∞ ≤ max {| R max | , | R min |} 1 − γ for every α vector . R E F E R E N C E S [1] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markov decision processes, ” Mathematics of operations resear ch , v ol. 12, no. 3, pp. 441–450, 1987. [2] H.-T . Cheng, Algorithms for partially observable Markov decision pr ocesses . PhD thesis, University of British Columbia, 1988. [3] H. Kurniawati, D. Hsu, and W . S. Lee, “SARSOP: Efﬁcient point- based POMDP planning by approximating optimally reachable belief spaces, ” in Robotics: Science and systems , v ol. 2008, Zurich, Switzer- land., 2008. [4] T . Smith and R. Simmons, “Point-based POMDP algorithms: Im- proved analysis and implementation, ” arXiv preprint , 2012. [5] E. J. Sondik, “The optimal control of partially observable Markov processes over the inﬁnite horizon: Discounted costs, ” Operations r esear ch , vol. 26, no. 2, pp. 282–304, 1978. [6] J. Pineau, G. Gordon, and S. Thrun, “ Anytime point-based approxima- tions for large POMDPs, ” Journal of Artiﬁcial Intelligence Research , vol. 27, pp. 335–380, 2006. [7] A. Elfes, “Occupancy grids: A stochastic spatial representation for activ e robot perception, ” in Pr oc. Uncertainty in Artiﬁcal Intelligence , vol. 2929, p. 6, 1990. [8] P . Stone, M. Sridharan, D. Stronger, G. Kuhlmann, N. Kohl, P . Fidel- man, and N. K. Jong, “From pixels to multi-robot decision-making: A study in uncertainty , ” Robotics and Autonomous Systems , vol. 54, no. 11, pp. 933–943, 2006. [9] B. Charrow , N. Michael, and V . Kumar , “ Activ e control strategies for discovering and localizing devices with range-only sensors, ” in International W orkshop on the Algorithmic F oundations of Robotics XI , pp. 55–71, Springer, 2015. [10] G. Best, O. Cliff, T . Patten, R. Mettu, and R. Fitch, “Decentralised Monte Carlo tree search for activ e perception, ” in International W orkshop on the Algorithmic F oundations of Robotics , Springer , 2016. [11] T . Darrell and A. Pentland, “ Active gesture recognition using partially observable Markov decision processes, ” in Pr oc. International Con- fer ence on P attern Recognition , vol. 13, pp. 984–988, 1996. [12] J. V ogel and K. Murphy , “ A non-myopic approach to visual search, ” in Proc. Computer and Robot V ision , pp. 227–234, IEEE, 2007. [13] A. Guo, “Decision-theoretic acti ve sensing for autonomous agents, ” in Proc. international joint conference on Autonomous agents and multiagent systems , pp. 1002–1003, ACM, 2003. [14] M. T . Spaan, “Cooperativ e activ e perception using POMDPs, ” in workshop on advancements in POMDP solvers , Association for the Advancement of Artiﬁcial Intelligence, 2008. [15] M. T . Spaan and P . U. Lima, “ A decision-theoretic approach to dynamic sensor selection in camera networks, ” in Pr oc. International Confer ence on Automated Planning and Scheduling , pp. 279—-304, 2009. [16] P . Natarajan, P . K. Atrey , and M. Kankanhalli, “Multi-camera coordi- nation and control in surveillance systems: A surve y , ” ACM T ransac- tions on Multimedia Computing, Communications, and Applications , vol. 11, no. 4, p. 57, 2015. [17] M. Araya, O. Buffet, V . Thomas, and F . Charpillet, “ A POMDP extension with belief-dependent rewards, ” in Proc. Advances in neural information processing systems , pp. 64–72, 2010. [18] M. T . Spaan, T . S. V eiga, and P . U. Lima, “Decision-theoretic planning under uncertainty with information rewards for acti ve cooperativ e perception, ” Autonomous Agents and Multi-Agent Systems , vol. 29, no. 6, pp. 1157–1185, 2015. [19] Y . Satsangi, S. Whiteson, F . A. Oliehoek, and M. T . Spaan, “Ex- ploiting submodular value functions for scaling up activ e perception, ” Autonomous Robots , vol. 42, no. 2, pp. 209–233, 2018. [20] M. Shamaiah, S. Banerjee, and H. V ikalo, “Greedy sensor selection: Lev eraging submodularity , ” in Pr oc. IEEE Conference on Decision and Control , pp. 2572–2577, IEEE, 2010. [21] A. Hashemi, M. Ghasemi, H. V ikalo, and U. T opcu, “ A randomized greedy algorithm for near-optimal sensor scheduling in large-scale sensor networks, ” in Proc. American Control Conference , pp. 1027– 1032, IEEE, 2018. [22] A. Krause and C. Guestrin, “Near-optimal observ ation selection using submodular functions, ” in Proc. Association for the Advancement of Artiﬁcial Intelligence , vol. 7, pp. 1650–1654, 2007. [23] A. Krause and D. Golovin, “Submodular function maximization, ” in T ractability: Practical Approac hes to Hard Problems , pp. 71–104, Cambridge University Press, 2014. [24] D. P . W illiamson and D. B. Shmoys, The design of approximation algorithms . Cambridge uni versity press, 2011. [25] G. L. Nemhauser, L. A. W olsey , and M. L. Fisher, “ An analysis of approximations for maximizing submodular set functions—I, ” Math- ematical pro gramming , vol. 14, no. 1, pp. 265–294, 1978. [26] Z. W ang, B. Moran, X. W ang, and Q. Pan, “ Approximation for maximizing monotone non-decreasing set functions with a greedy method, ” Journal of Combinatorial Optimization , vol. 31, no. 1, pp. 29–43, 2016. [27] C. Qian, J.-C. Shi, Y . Y u, and K. T ang, “On subset selection with general cost constraints, ” in Pr oc. International Joint Confer ence on Artiﬁcial Intelligence , vol. 17, pp. 2613–2619, 2017. [28] K. J. ˚ Astr ¨ om, “Optimal control of Markov processes with incomplete state information, ” Journal of Mathematical Analysis and Applications , vol. 10, no. 1, pp. 174–205, 1965. [29] R. D. Smallwood and E. J. Sondik, “The optimal control of partially observable Markov processes ov er a ﬁnite horizon, ” Operations r e- sear ch , vol. 21, no. 5, pp. 1071–1088, 1973. [30] R. Bellman, “ A Markovian decision process, ” Journal of Mathematics and Mechanics , pp. 679–684, 1957. [31] T . M. Cover and J. A. Thomas, Elements of information theory . John W iley & Sons, 2012. [32] H. Lin and J. Bilmes, “Multi-document summarization via budgeted maximization of submodular functions, ” in Pr oc. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies , pp. 912–920, 2010. [33] C.-W . K o, J. Lee, and M. Queyranne, “ An exact algorithm for maximum entropy sampling, ” Operations Research , vol. 43, no. 4, pp. 684–691, 1995. [34] G. Shani, J. Pineau, and R. Kaplow , “ A survey of point-based POMDP solvers, ” Autonomous Agents and Multi-Agent Systems , v ol. 27, no. 1, pp. 1–51, 2013. [35] Y . Satsangi, S. Whiteson, and F . A. Oliehoek, “P AC greedy maximiza- tion with efﬁcient bounds on information gain for sensor selection, ” in Pr oc. International Joint Confer ence on Artiﬁcial Intelligence , pp. 3220–3227, 2016. [36] I. Csisz ´ ar , “I-div ergence geometry of probability distributions and minimization problems, ” Annals of Pr obability , pp. 146–158, 1975.

Online Active Perception for Partially Observable Markov Decision Processes with Limited Budget

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment