A Dynamic Game Framework for Rational and Persistent Robot Deception With an Application to Deceptive Pursuit-Evasion

This article studies rational and persistent deception among intelligent robots to enhance security and operational efficiency. We present an N-player K-stage game with an asymmetric information structure where each robot's private information is mod…

Authors: Linan Huang, Quanyan Zhu

A Dynamic Game Framework for Rational and Persistent Robot Deception   With an Application to Deceptive Pursuit-Evasion
1 A Dynamic Game Frame work for Rational and Persistent Robot Deception with an Application to Decepti v e Pursuit-Ev asion Linan Huang, Student Member , IEEE, Quanyan Zhu, Member , IEEE Abstract —This paper studies rational and persistent deception among intelligent robots to enhance security and operational effi- ciency . W e present an N-player K-stage game with an asymmetric information structure where each robot’ s private information is modeled as a random variable or its type. The deception is persistent as each robot’ s private type remains unknown to other robots for all stages. The deception is rational as robots aim to achieve their deception goals at minimum cost. Each robot forms a dynamic belief of others’ types based on intrinsic or extrinsic information. P erfect Bayesian Nash Equilibrium (PBNE) is a natural solution concept for dynamic games of incomplete inf ormation. Due to its requirements of sequential rationality and belief consistency , PBNE pro vides a reliable prediction of players’ actions, beliefs, and expected cumulative costs over the entire K stages. The contrib ution of this work is fourf old. First, we identify the PBNE computation as a nonlinear stochastic control problem and characterize the structures of players’ actions and costs under PBNE. W e further derive a set of extended Riccati equations with cognitive coupling under the linear -quadratic setting and extrinsic belief dynamics. Second, we dev elop a receding-horizon algorithm with low temporal and spatial complexity to compute PBNE under intrinsic belief dynamics. Third, we inv estigate a deceptive pursuit-evasion game as a case study and use numerical experiments to corroborate the r esults. Finally , we propose metrics, such as deceivability , reachability , and the price of deception, to evaluate the strategy design and the system perf ormance under deception. Note to Practitioners —Recent advances in automation and adaptive control in multi-agent systems enable robots to use deception to accomplish their objectives. Deception inv olves intentional inf ormation hiding to compromise the security and operational efficiency of the robotic systems. This work proposes a dynamic game framework to quantify the impact of deception, understand the robots’ behaviors and intentions, and design cost-efficient strategies under the deception that persists over stages. Existing researches on robot deception have relied on experiments while this work aims to lay a theoretical foundation of deception with quantitative metrics, such as deceivability and the price of deception. The proposed model has wide applications, including cooperative robots, pursuit and evasion, and human- robot teaming. The pursuit-evasion games are used as case studies to show how the deceiver can amplify the deception by belief manipulation and how the deceiv ed robots can reduce the negative impact of deception by enhanced maneuverability and Bayesian learning. The future work would focus on designing This paper has been accepted for publication in IEEE Transactions on Automation Science and Engineering This research is partially supported by awards ECCS-1847056, CNS- 1544782, CNS-2027884, and SES-1541164 from National Science of Founda- tion (NSF), and grant W911NF-19-1-0041 from Army Research Office (AR O). L. Huang and Q. Zhu are with the Department of Electrical and Computer Engineering, New Y ork University , 370 Jay Street, Brooklyn, NY 11201, USA; E-mail: { lh2328,qz494 } @nyu.edu Digital Object Identifier 10.1109/T ASE.2021.3097286 cooperative deception among swarm robotics and robotic systems that ar e rob ust to or further benefit from deception. Index T erms —Robot deception, perfect Bayesian equilibrium, pursuit-evasion, linear -quadratic games, discrete-time Riccati equations I . I N T RO D U C T I O N D ECEPTION is a ubiquitous phenomenon in biology [1], military [2], politics and media [3], and cyberspace [4]. In particular , deception plays an increasingly significant role in cyber-physical systems, including autonomous vehicles and robots driv en by artificial intelligence (AI). Recent advances in these AI-enabled technologies have not only allo wed robots to adapt to the dynamic en vironment via real-time observations, but also made them decei v able. A decei ver can intentionally hide or re veal selected information to alter the beliefs and behaviors of the target robots for a higher re ward. Since deception has man y forms and deli very methods, understand- ing deception in a unified and quantitative frame work is an indispensable step tow ard assessing the outcomes, measuring the impact, and designing strategies. This work aims to design robots that can interact with others efficiently under deceptiv e en vironments. W e identify the follo wing challenges and features of robot deception. First, by definition, deception in volves at least two participants interacting with each other . An intelligent robot should further consider other participants’ rationality , predict their potential deceptiv e behaviors, and adjust its ac- tions accordingly to alleviate the neg ativ e effect of deception. Second, due to the robots’ dynamic nature, one-shot deception can ex ert a subsequent influence. The participating robots need to form long-term objectives to decei ve or counter- deceiv e other robots. The multi-stage interactions also make it possible for the deceiver to apply deception at different stages. Third, each robot contains heterogeneous pri vate information, which results in an asymmetric cognition structure; i.e., robots can form different beliefs over the same piece of unkno wn information. Thus, besides the couplings of state dynamics and costs, the multi-agent system further has cognitive coupling ; i.e., each robot’ s beha viors are not only affected by its o wn belief but also the beliefs of the others. T o capture these features, we model the deceptiv e inter- action between N strategic robots as a dynamic game of incomplete information. During the finite K stages of inter- action, N robots accomplish non-cooperative tasks such as Copyright © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating ne w collectiv e works, for resale or redistrib ution to servers or lists, or reuse of any copyrighted component of this work in other works. 2 pursuit-ev asion in the battlefield [5] or cooperativ e tasks such as collectiv e to wing [6]. Robots introduce deception in the abov e interacting scenarios due to antagonism, selfishness, and pri vac y concerns. Follo wing Harsanyi’ s approach [7], we capture each robot’ s priv ate information by a random variable. The realization of the random variable, which is called the robot’ s type , is known only to itself, while the support of the random variable, which contains all its possible types, is known to all robots. T ake the pursuit-ev asion scenario as an example, due to the constraints of weather, terrain, and weapon, both the ev ading and the pursuing robots kno w the feasible beachheads for the e vader to land on. Howe ver , the ev ader chooses only one beachhead as his true target and the ev ader’ s choice, i.e., his type, is unknown to the pursuer . The pursuer in the battlefield knows the existence of the deception and learns to counter the deception by forming and updating her belief based on real-time observations. Since these tasks are usually time-constrained, robots cannot wait and freeze until they have learned the true type. Instead, they have to take concurrent actions while the deceiv er’ s type remains uncertain. W e consider two classes of belief dynamics based on whether robots exploit the intrinsic information such as the prediction of other robots’ actions, or the extrinsic information to update their beliefs. Each robot aims to minimize its expected cumulativ e cost ov er K stages. Since the expectation in volves its K -stage belief sequence of other players’ pri v ate types, its actions should be sequentially rational under its belief sequence and the belief sequence should be consistent with the belief dynamics as well. These two requirements lead to the solution concept of Perfect Bayesian Nash Equi- librium (PBNE) where a player’ s unilateral deviation from the equilibrium increases his long-run cost. By appending the belief state (i.e., all players’ beliefs under all possible types) to the system state, the PBNE computation is equiv alent to a multi-agent nonlinear stochastic control problem and the method of dynamic programming applies. W ithout loss of generality , we characterize the structure of the action and the cost under PBNE as a feedback function of the belief state and the system state at the current stage. T o provide an of fline ev aluation metric of the equilibrium cost under incomplete information, we use the expected equilibrium cost under complete information as a benchmark and define the Price of Deception (PoD). Due to their tractability and generality , we focus on incomplete-information Linear -Quadratic (LQ) games with extrinsic belief dynamics to obtain the PBNE action that is unique and affine to the system state. W e obtain a set of extended Riccati equations, which explicitly characterizes the coupling in the state dynamics, costs, and cognition of all robots. Under proper decoupling structures, the extended Riccati equations degenerate to the classical Riccati equations for the problems of optimal control or complete-information LQ games. Under the incomplete-information LQ games with intrinsic belief dynamics , the equilibrium action is in general not af fine feedback of the system state. Thus, we adopt a receding-horizon approach to provide a reasonable approxima- tion of PBNE; i.e., instead of offline planning of all K -stage actions before the game starts, players recompute their actions based on the real-time observations and their updated beliefs at each new stage during the interaction. Finally , we in vestigate a target protection problem where an ev ader aims to decepti vely reach one of the possible targets and simultaneously ev ade the pursuer . The game has doubled- sided asymmetric information. The ev ader’ s pri v ate or hidden information is his true target while the pursuer’ s priv ate infor- mation is her capability to maneuver or the maneuverability . W e propose multi-dimensional metrics, including the sta ge of truth r evelation and the endpoint distance , to assess the deception impact. W e define the concept of deceivability to characterize the fundamental limits of deception and in vesti- gate ho w it is af fected by the distinguishability of the pri vate information. W e compare the proposed control policy with two heuristic polices to demonstrate its ef ficacy to counter deception at a much lower cost. W e show that Bayesian learning can significantly reduce the impact of initial belief manipulation and result in a win-win situation for some cases. The increase of the pursuer’ s maneuverability improves her control performance under deception yet has a marginal effect. W e also find that applying deception to counter deception is not always ef fectiv e; e.g., it can be beneficial for a less maneuverable pursuer to disguise as a more maneuverable pursuer but not vice versa. The numerical results corroborate that PoD can exceed 1; i.e., deception among players may not only benefit the deceiv er b ut also the deceiv ee. A. Related W orks The secure and efficient operation of robots, autonomous vehicles, and industrial control systems is vital for recent advances in technologies. Many works [8]–[10] have in ves- tigated ho w to protect these systems from various attacks on sensor measurements [11], communication channels [12], and control signals [13], [14]. Deception is a key feature of sophisticated attacks with a focus on intentionally hiding priv ate information [15], [16], introducing randomness [17], and manipulating other players’ beliefs [18], [19]. Deception in robotic systems can be conducted through visual displays [20], facial expressions and body gestures [21], and trajectories [15], [22]. Existing works on robot deception are largely based on experimental approaches [15], [23], [24]. There is a need for a formal and quantitati ve framework to assess the deception impact, understand the fundamental limit and tradeof f of deception, and determine real-time strategies. Compared to the theoretical works of decepti ve path planning and goal recognition [25], [26], which focus on identifying the true target behind deception, our work further determines optimal and cost-effecti ve control policies to counteract deception and physically protect the true target; e.g., the pursuer adopts the action sequence of minimum cost to reach and protect the true beachhead selected by the e vader . Compared to control- theoretic deception frame works based on Mark ov decision processes [17], [18] and stochastic games [27], we adopt a state-space representation to better characterize the physical dynamics of robots and autonomous vehicles. Game models such as hyper games [28], dynamic Bayesian games [16], partially observ able stochastic games [19], [29], 3 and games that in volve signaling mechanisms [30], [31] hav e been adopted as natural analytic paradigms to under- stand deception between intelligent players. The computation of equilibrium solutions for dynamic games of incomplete information, especially ones with non-classical information structure [32], is often a challenging task. Previous works hav e adopted conjugate prior assumptions to simplify Bayesian update and decouple the forward type estimation and backw ard action optimization under a finite state space and a continuous type space [33], [34]. T o solve the coupling between players’ belief dynamics and the multi-agent optimal control problem in the context of robotic systems where states are continuous and constrained by physical dynamics with noises, we adopt a receding-horizon approach to compute PBNE, which yields computationally tractable online strategies for the players. Similar receding-horizon approaches hav e been used in other contexts, including cyber-physical systems [35], military air operation [36], and autonomous racing [37]. B. Notations and Or ganization of the P aper Calligraphic letter A defines a set and | A | represents its cardinality . Define B \ A as the set of elements in B but not in A . The Euclidean norm of a vector x is represented by || x || 2 . Let E a ∼ A [ f ( a )] denote the expectation of f ( a ) ov er random v ariable a whose probability distribution is A . Let 0 represent matrix transpose and Diag [ a 1 , · · · , a N ] represent a block diagonal matrix with possibly non-square matrices a i , i ∈ N , on its diagonal. Define { a i } i ∈ N : = { a 1 , · · · , a N } as a set of N elements, [ a i ] i ∈ N : = [ a 1 , · · · , a N ] as N block matrices of the same number of rows arranged in one row vector , and [ a 1 ; · · · ; a N ] = [ a 1 , · · · , a N ] 0 as N block matrices of the same number of columns arranged in one column vector . Let I r , 0 m , n be the r × r identity matrix and the m × n zero matrix, respectiv ely . The superscript k ∈ K is the stage index and the subscript i ∈ N is the player index. W e omit a function’ s arguments when there is no ambiguity , e.g., S k i : = S k i ( β k i , θ i ) . A piece of information for a group of players is called common knowledge if all players know it, all players kno w that all players kno w it, and so on ad infinitum. W e summarize main notations in T able I. The rest of paper is or ganized as follo ws. Section II intro- duces the dynamic game of incomplete information and the solution concept of PBNE. T o obtain explicit and practical solutions, we consider a class of a linear-quadratic problems in Section III and obtain a set of extended Riccati equations. W e present a case study of decepti ve pursuit-e vasion in Section IV and Section V concludes the paper . I I . D Y N A M I C G A M E W I T H P R I V A T E T Y P E S W e model deception as a K -stage game consisting of N robots as players and each robot has asymmetric information. Let N : = { 1 , · · · , N } be the set of N players and K : = { 0 , 1 , 2 , · · · , K } be the set of K discrete stages. Pri vate information of player i ∈ N , i.e., his type θ i , is modeled as the realization of a discrete random variable with a finite support Θ i : = { θ 1 i , θ 2 i , · · · , θ N i i } and a prior probability distribution Ξ i ( · ) . Hence, N i is the number of possible types for player i and Ξ i ( θ i ) is the T ABLE I: Summary of variables and their meanings. V ariable Meaning N : = { 1 , 2 , · · · , N } Set of N players in the dynamic game K : = { 0 , 1 , 2 , · · · , K } Set of K discrete stages in the dynamic game Θ i : = { θ 1 i , θ 2 i , · · · , θ N i i } Set of N i possible types for player i ∈ N θ i ∈ Θ i T ype of player i ∈ N θ : = [ θ 1 , · · · , θ N ] N players’ joint type Θ − i : = ∏ j ∈ N \{ i } Θ j Set of types of all players except for player i θ − i : = [ θ j ] j ∈ N \{ i } ∈ Θ − i T ypes of all players except for player i ∆ ( Θ − i ) Set of probability distributions over set Θ − i Ξ i ( · ) Probability distribution of player i ’s type Ξ = [ Ξ i ] i ∈ N Probability distribution of the joint type θ Ξ w ( · ) Probability distribution of noise w k , ∀ k ∈ K x k ∈ R n × 1 System state of dimension n at stage k x k i ∈ R n i × 1 Player i ’ s state of dimension n i at stage k [ ˆ x k i ( θ i )] k ∈ K Reference trajectory for player i of type θ i β k i ∈ Λ i ⊆ [ 0 , 1 ] | Θ − i |×| Θ i | Player i ’ s belief state at stage k β k = [ β k i ] i ∈ N ∈ Λ N players’ joint belief state at stage k h k : = [ x 0 , · · · , x k ] ∈ H k State history f k State transition function at stage k Γ k i Player i ’ s belief transition function at stage k g k i Player i ’ s cost function at stage k V k i ( β k , x k , θ i ) Player i ’ s PBNE cost ¯ V k i ( x k , θ ) Player i ’ s PBNE cost when all players’ types are common knowledge u k i ∈ R m i × 1 Player i ’ s action of dimension m i at stage k u k : = [ u k 1 , · · · , u k N ] N players’ joint action at stage k u k 0 : K i : = [ u k 0 i , · · · , u K i ] Player i ’ s action sequence from k 0 to K u k 0 : K : = [ u k 0 : K i , u k 0 : K − i ] Player i ’ s and all other players’ control se- quences from stage k 0 to K l k i ( θ − i | h k , θ i ) Player i ’ s belief at stage k , i.e., the probability of other players’ types being θ − i based on player i ’ s available information of h k , θ i probability that player i ’ s type is θ i . Define shorthand notation Ξ : = [ Ξ i ] i ∈ N and let Θ − i : = ∏ j ∈ N \{ i } Θ j be the set of types of all players except for player i ∈ N . Each player i knows the value of his own type θ i , but does not know the values of other players’ types θ − i : = [ θ j ] j ∈ N \{ i } ∈ Θ − i , throughout K stages of the game. The system state dynamics under N players’ joint action u k : = [ u k 1 , · · · , u k N ] , joint type θ : = [ θ 1 , · · · , θ N ] , and an additiv e external noise w k ∈ R n × 1 are shown in (1): x k + 1 = f k ( x k , u k 1 , · · · , u k N , θ 1 , · · · , θ N ) + w k , k ∈ K \ { K } . (1) The dynamics in (1) can have different interpretations based on applications. In the pursuit-ev asion scenario as in [5], x k i ∈ R n i × 1 represents robot i ’ s local states such as its location and speed. The system state x k ∈ R n × 1 can be explicitly rep- resented by N robots’ joint state [ x k 1 , · · · , x k N ] with n = ∑ N i = 1 n i . In the application where N robots cooperati vely transport a payload, e.g., [6], [38], system state x k ∈ R n × 1 represents the payload’ s location and posture, which does not e xplicitly relate to robots’ local states. The noise sequence [ w k ] k ∈ K assumed to be independent with probability density function Ξ w ( · ) , i.e., E w k , w h ∼ Ξ w [ w k ( w h ) 0 ] = 0 , ∀ k ∈ K , h ∈ K \ { k } . The noise is not necessarily Gaussian distributed but is assumed to hav e a zero mean, i.e., E w k ∼ Ξ w [ w k ] = 0 , ∀ k ∈ K . W e assume that system dynamics (1) are multi-agent controllable as defined in Definition 1 so that players can design their deceptiv e actions to reach the entire state space in finite stages. Definition 1 ( Multi-Agent Controllability ) . System dynamics (1) ar e called multi-agent contr ollable if for any targ et state x k ∈ R n × 1 at stage k ∈ K \ { 0 } , initial state x 0 ∈ R n × 1 , and 4 joint type θ ∈ Θ , ther e exists a sequence of finite joint actions u 0: k that drive the system state fr om x 0 to x k in expectation. A. F orwar d Belief Dynamics At each stage k ∈ K , the information av ailable to player i compromises all players’ state history h k : = [ x 0 , · · · , x k ] ∈ H k as well as his own type value θ i . Define ∆ ( Θ − i ) as the set of probability distributions over set Θ − i . Each player i at stage k forms a belief l k i : H k × Θ i 7→ 4 Θ − i based on his av ailable information. Thus, l k i ( ·| h k , θ i ) is a probability measure of other players’ types, i.e., ∑ θ − i ∈ Θ − i l k i ( θ − i | h k , θ i ) = 1 , ∀ h k ∈ H k , θ i ∈ Θ i . Define a vector β k i : = [ l k i ( θ − i | h k , θ 1 i ) , l k i ( θ − i | h k , θ 2 i ) , · · · , l k i ( θ − i | h k , θ N i i )] θ − i ∈ Θ − i as player i ’ s belief state at stage k ∈ K . W e assume that the set of belief states is independent of stages, i.e., β k i ∈ Λ i ⊆ [ 0 , 1 ] | Θ − i |×| Θ i | . Then, we can represent player i ’ s belief dynamics as β k + 1 i : = Γ k i ( β k i , u k , w k , θ i ) , ∀ k ∈ { 0 , · · · , K − 1 } . (2) Note that the belief transition function Γ k i can be dif ferent for each i and k , i.e., players’ belief updates can be heterogeneous and time-v arying. Define β k : = [ β k i ] i ∈ N ∈ Λ : = ∏ i ∈ N Λ i . In this work, we assume that the initial beliefs of all players of all types β 0 and the belief update rules Γ k i , ∀ i ∈ N , ∀ k ∈ { 0 , · · · , K − 1 } , are common knowledge . In the next two sub- sections, we provide two specific forms of Γ k i that rely on intrinsic and extrinsic information, respecti vely . 1) Bayesian Belief Dynamics: The most common belief update rule Γ k i in (2) for player i at stage k + 1 uses Bayesian inference. Giv en the kno wledge of the sequential state obser- vations x k , x k + 1 and all players’ actions u k , each player i of type θ i ∈ Θ i at stage k + 1 can update his belief as follo ws: ∀ θ − i ∈ Θ − i , l k + 1 i ( θ − i | h k + 1 , θ i ) = l k i ( θ − i | h k , θ i ) Pr ( x k + 1 | θ − i , x k , θ i ) ∑ ¯ θ − i ∈ Θ − i l k i ( ¯ θ − i | h k , θ i ) Pr ( x k + 1 | ¯ θ − i , x k , θ i ) . (3) In (3), we use the Markov property , i.e., Pr ( x k + 1 | θ − i , h k , θ i ) = Pr ( x k + 1 | θ − i , x k , θ i ) = Ξ w ( x k + 1 − f k ( x k , u k , θ )) . The denomina- tor is positiv e as w k ∈ R n × 1 . Remark 1 ( Actions Rev eal T ype Information ) . Even if the state dynamics f k in (1) are independent of θ j , ∀ j ∈ N \ { i } , player i ∈ N can still learn player j ’ type via (3) as player j’s action u k j is a function 1 of his type θ j . 2) Markov-Chain Belief Dynamics: In section II-A1, we assume that players can exploit the intrinsic information of state dynamics f k , state observ ations x k , x k + 1 , and the pre- diction of all players’ actions u k . Since the above intrinsic information may not be available in practice, we consider the belief dynamics with extrinsic information in this subsection. In particular, we assume that each player i ’ s belief dynamics β k + 1 i : = Γ k i ( β k i , w k , θ i ) , ∀ k ∈ { 0 , · · · , K − 1 } , are a discrete-time 1 Each player’ s action is a function of his type as his cost is related to his type and the action aims to minimize his cost. Markov chain where the extrinsic information at stage k is characterized by the transition function Γ k i ( · , w k , θ i ) . Note that the transition function only characterizes how players update their beliefs at each stage yet does not guarantee that a player can learn the true types of others. The follo wing example illustrates a class of players whose belief dynamics exhibit the confirmation bias [39] where players tend to ignore intrinsic evidence such as u k and preserve their belief update rules Γ k i at each stage k . Example 1. Consider a two-person game N = 2 where the first player has two types N 1 = 2 , Θ 1 = { θ 1 1 , θ 2 1 } and the second player only has one type N 1 = 1 , Θ 2 = { θ 1 2 } . The second player’s belief state β k 2 = [ l k 2 ( θ 1 1 | θ 1 2 ) , l k 2 ( θ 2 1 | θ 1 2 )] towar d the first player’s type belongs to a finite set Λ 2 = { [ 0 . 2 , 0 . 8 ] , [ 0 . 5 , 0 . 5 ] , [ 0 . 8 , 0 . 2 ] } . The transition function Γ k 2 is independent of k : if the curr ent belief state is [ 0 . 5 , 0 . 5 ] , then the belief at the next stage is [ 0 . 2 , 0 . 8 ] , [ 0 . 5 , 0 . 5 ] , or [ 0 . 8 , 0 . 2 ] with pr obability 0 . 4 , 0 . 2 , 0 . 4 , respectively . If the current belief state is [ 0 . 8 , 0 . 2 ] (r esp. [ 0 . 2 , 0 . 8 ] ), then the belief at the next stage is [ 0 . 8 , 0 . 2 ] (r esp. [ 0 . 2 , 0 . 8 ] ) or [ 0 . 5 , 0 . 5 ] with pr obability 0 . 9 and 0 . 1 , respectively . The above transition function Γ k 2 means that the second player tends to interpr et the e xtrinsic information of the first player’s type based on his curr ent belief. If the second player already belie ves that the first player is of type θ 1 1 with a high probability of 0 . 8 at stage k , i.e., β k 2 = [ 0 . 8 , 0 . 2 ] , then the second player is more inclined to enhance his current belief, i.e., his belief state at the next stage , i.e., β k + 1 2 , will r emain to be [ 0 . 8 , 0 . 2 ] with a high pr obability of 0 . 9 . The above transition function r epr esents the phenomena of attitude polarization and confirmation bias wher e players pr eserve their existing beliefs and the disagr eement becomes more extr eme at each stage even when players are e xposed to the same evidence . B. Nonzero-Sum Cost Function and Equilibrium Concept At non-terminal stage k ∈ K \ { K } , player i ’ s cost func- tion is g k i : R n × 1 × ∏ N j = 1 R m j × 1 × Θ i 7→ R . The final stage cost is g K i : R n × 1 × Θ i 7→ R . Define u k 0 : K − 1 i : = [ u k 0 i , · · · , u K − 1 i ] as player i ’ s action sequence from stage k 0 to K − 1 and u k 0 : K − 1 : = [ u k 0 : K − 1 i , u k 0 : K − 1 − i ] as player i ’ s and all other players’ action sequences from stage k 0 to K − 1. Player i ’ s expected cumulativ e cost from arbitrary initial stage k 0 ∈ K to the terminal stage K is defined as J k 0 i ( l k 0 : K − 1 i , u k 0 : K − 1 , x k 0 , θ i ) = E w K − 1 ∼ Ξ w [ g K i ( x K , θ i )] + K − 1 ∑ k = k 0 E w k − 1 ∼ Ξ w h E θ − i ∼ l k i [ g k i ( x k , u k , θ i )] i . (4) The expectations are taken first over the external noise se- quence w k and then ov er other players’ internal type uncer- tainty . W e cannot e xchange the order of these two expectations as l k i is a function of w k − 1 . Each player i at stage k 0 ∈ K aims to minimize J k 0 i by choosing only his action sequence u k 0 : K − 1 i but not other players’ action sequence u k 0 : K − 1 − i . The following definition of sequential rationality in Definition 2 guarantees that each player i has no motiv ation to deviate from the sequentially rational action at any stage k ∈ { k 0 , · · · , K − 1 } 5 during the interaction if all other players adopt the sequentially rational actions. Definition 2 ( Sequential Rationality ) . An action sequence u ∗ , k 0 : K − 1 : = { u ∗ , k 0 : K − 1 i , u ∗ , k 0 : K − 1 − i } is called sequentially ratio- nal for player i under the belief sequence l k 0 : K − 1 i , state x k 0 , and type θ i , if for any state x k at stage k ∈ { k 0 , · · · , K − 1 } , player i does not benefit fr om taking any other ac- tion sequence u k : K − 1 i , i.e., J k i ( l k : K − 1 i , u ∗ , k : K − 1 i , u ∗ , k : K − 1 − i , x k , θ i ) ≤ J k i ( l k : K − 1 i , u k : K − 1 i , u ∗ , k : K − 1 − i , x k , θ i ) , ∀ u k : K − 1 i . Since players’ actions may affect their future beliefs as captured by the belief dynamics Γ k i in (2), we further require the equilibrium action u ∗ , k 0 : K − 1 in Definition 2 to be consis- tent with the belief dynamics, which leads to the following definition of Perfect Bayesian Nash Equilibrium (PBNE). Definition 3 ( Perfect Bayesian Nash Equilibrium ) . Consider the N -player dynamic game of private types and asymmetric information defined by the state dynamics (1) and the ex- pected cumulative cost (4) . The action sequence u ∗ , 0: K − 1 : = { u ∗ , 0: K − 1 i , u ∗ , 0: K − 1 − i } of N players over K stages compr omises the P erfect Bayesian Nash Equilibrium (PBNE) if, r e gardless of eac h player i’ s type θ i ∈ Θ i , the following statements hold. 1) Sequential rationality : u ∗ , 0: K − 1 is sequential r ational for each player i ∈ N under his belief sequence l ∗ , 0: K − 1 i ; 2) Belief consistency : each player i’s belief sequence l ∗ , 0: K − 1 i is consistent with (2) under u ∗ , 0: K − 1 . Proposition 1. It is sufficient to repr esent player i’s equilib- rium cost J k i ( l ∗ , k : K − 1 i , u ∗ , k : K − 1 , x k , θ i ) under the PBNE action u ∗ , k : K − 1 at stag e k ∈ K as a function of β k , x k and θ i , which is defined as V k i ( β k , x k , θ i ) . Under the boundary con- dition V K i ( β K , x K , θ i ) : = g K i ( x K , θ i ) , the following holds for all k ∈ { 0 , · · · , K − 1 } and all x k ∈ R n × 1 , β k ∈ Λ , i.e., V k i ( β k , x k , θ i ) = min u k i ∑ θ − i l k i ( θ − i | h k , θ i ) { g k i ( x k , u k , θ i )+ E w k ∼ Ξ w [ V k + 1 i ( β k + 1 , x k + 1 , θ i )] } , ∀ θ i ∈ Θ i , ∀ i ∈ N , (5) wher e β k + 1 and x k + 1 satisfy (2) and (1) , r espectively . Pr oof. According to the definition of PBNE, at the second last stage k = K − 1, each player i ’ s equilibrium action u ∗ , k i = arg min u k i E θ − i ∼ l k i [ g k i ( x k , u k , θ i )] + E w k ∼ Ξ w [ g K i ( x K , θ i )] is in general a function of θ i , x k , l ∗ , k i , u ∗ , k − i . Due to the coupling between u ∗ , k i and u ∗ , k − i , we need to solv e a set of system equations for all i ∈ N and θ i ∈ Θ i . Then, u ∗ , k i will be a function of β k , x k , θ i and we obtain (5) at stage k = K − 1. W e can repeat the above procedure from k = K − 2 to k = 0 to obtain the recursiv e form in (5). Proposition 1 characterizes the structure of the equilibrium action u ∗ , k i and the equilibrium cost V k i ( β k , x k , θ i ) for each player i of type θ i under the solution concept of PBNE; i.e., both terms are feedback functions of the belief state β k , the physical state x k , and the player’ type θ i . Although J k i is a function of beliefs l k : K − 1 i ov er all the remaining stages, V k i ( β k , x k , θ i ) only depends on the belief state at the current stage k . If all players’ types are common knowledge , PBNE still applies and we can define a new function ¯ V k i ( x k , θ ) to represent the resulting equilibrium cost V k i ( β k , x k , θ i ) for all k ∈ K without loss of generality . C. Offline Evaluation of Equilibrium Cost If each player i ’ s initial belief confirms to the prior distribu- tion of other players’ types, i.e., l 0 i ( θ j | x 0 , θ i ) = Ξ j ( θ j ) , ∀ θ i ∈ Θ i , j ∈ N , θ j ∈ Θ j , ∀ x 0 , then each player i at system state x 0 with belief state β 0 can use his expected equilibrium cost E θ i ∼ Ξ i [ V 0 i ( β 0 , x 0 , θ i )] ov er his type uncertainty Ξ i as an of fline performance measure of the equilibrium action u ∗ , 0: K . As a comparison, player i ’ s expected equilibrium cost E θ ∼ Ξ [ ¯ V 0 i ( x 0 , θ )] under the complete information game serves as a benchmark. Note that player i does not need to kno w the realization of the joint type θ to compute E θ ∼ Ξ [ ¯ V 0 i ( x 0 , θ )] . Due to the coupling in dynamics, costs, and cognition among N players, obtaining more informa- tion and kno wing the type of another player j ∈ N \ { i } may not always improve player i ’ s performance; i.e., there is no guarantee that E θ i ∼ Ξ i [ V 0 i ( β 0 , x 0 , θ i )] ≥ E θ ∼ Ξ [ ¯ V 0 i ( x 0 , θ )] . Besides the above performance ev aluation for an indi vid- ual player i ∈ N under deception, we may also aim to ev aluate the overall performance of multiple players or all N players. W e define the Price of Deception (PoD) in Definition 4 with a set of coef ficients η i ∈ [ 0 , 1 ] , ∀ i ∈ N , ∑ i ∈ N η i = 1. Since the equilibrium cost can be neg- ativ e, we let η 0 ( Ξ ) : = − min ( 0 , { E θ i ∼ Ξ i [ V 0 i ( β 0 , x 0 , θ i )] } i ∈ N , { E θ ∼ Ξ [ ¯ V 0 i ( x 0 , θ )] } i ∈ N ) be the normalizing constant to guar- antee that p η ( Ξ ) is non-negativ e for all chosen coefficients η i , i ∈ N . Definition 4 ( Price of Deception ) . F or a given set of coef fi- cients η : = { η i } i ∈ N ∪{ 0 } , the Price of Deception (P oD) of the N -player K -stage game defined by (1) , (4) , and (2) under the prior pr obability distribution Ξ = [ Ξ i ] i ∈ N is p η ( Ξ ) : = ∑ i ∈ N η i E θ ∼ Ξ [ ¯ V 0 i ( x 0 , θ )] + η 0 ( Ξ ) ∑ i ∈ N η i E θ i ∼ Ξ i [ V 0 i ( β 0 , x 0 , θ i )] + η 0 ( Ξ ) ∈ [ 0 , ∞ ) . The PoD is a crucial ev aluation and design metric. W e can endow PoD with dif ferent meanings by properly choosing the weighting coefficients η i , i ∈ N . For example, if besides N players, there is a central planner who aims to minimize the total cost of all N players under their deceptiv e interaction. Then, we can pick η i = 1 / N , i ∈ N , to represent the ov erall system performance. Although the central planner cannot control players’ state dynamics, costs, and belief dynamics directly , he can still affect their deceptive interaction if he can design the prior probability distribution Ξ of the joint type θ . If the central planner instead only aims to reduce the cost of one player j ∈ N , then we can pick η j = 1 and η h = 0 , ∀ h ∈ N \ { j } . W ith a given weighting parameters η , a larger value of p η ( Ξ ) indicates a better accomplishment of the above goals. Note that individual deception may improv e the system performance, i.e., p η ( Ξ ) > 1. I I I . L I N E A R - Q UA D R A T I C S P E C I FI C A T I O N Linear-Quadratic (LQ) game is an important class of dy- namic games. They can also be applied iterati vely to approxi- mate nonlinear stochastic systems with general cost functions 6 and obtain equilibrium actions [40]. In the following sections, we consider linear state dynamics f k ( x k , u k , θ ) : = A k ( θ ) x k + N ∑ i = 1 B k i ( θ i ) u k i , (6) with stage-varying matrices A k ( θ ) ∈ R n × n , B k i ( θ i ) ∈ R n × m i . Remark 2. System (6) is multi-ag ent contr ollable if and only if matrices H k i ( θ ) : = [ B k − 1 i ( θ i ) , · · · , ∏ k − 1 h = 2 A h ( θ ) B 1 i ( θ i ) , ∏ k − 1 h = 1 A h ( θ ) B 0 i ( θ i )] , ∀ i ∈ N , ∀ θ ∈ Θ , ∀ k ∈ K , ar e of full rank as noise w k has zer o mean and we can obtain E [ x k ] = ∏ k − 1 h = 0 A h ( θ ) x 0 + ∑ N r = 1 H k r ( θ )[ u k − 1 r ; · · · ; u 0 r ] by induction. Each player i ’ s cost is quadratic in both x k and u k ; i.e., g k i ( x k , u k , θ i ) = ( x k − ˆ x k i ( θ i )) 0 D k i ( θ i )( x k − ˆ x k i ( θ i )) + ˆ f k i ( ˆ x k i ( θ i )) + N ∑ j = 1 ( u k j ) 0 F k i j ( θ i ) u k j , ∀ k ∈ K , (7) where [ ˆ x k i ( θ i )] k ∈ K is a known type-dependent reference trajec- tory for player i ∈ N and ˆ f k i is a known function of ˆ x k i ( θ i ) . The cost matrices D k i ( θ i ) ∈ R n × n , F k i j ( θ i ) ∈ R m i × m i , ∀ i , j ∈ N , k ∈ K , are symmetric. At the final stage, F K i j ( θ i ) ≡ 0 m i , m i , ∀ i , j ∈ N , ∀ θ i ∈ Θ i . W e introduce the follo wing three sets of nota- tions for the belief matrix, the extended Riccati equations, and the matrix-form equilibrium action, respectiv ely . a) Belief Matrix: W ith a little abuse of notation, we can define the marginal probability l k i ( θ j | h k , θ i ) : = ∑ θ r ∈ Θ r , r ∈ N \{ i , j } l k i ( θ − i | h k , θ i ) , ∀ j ∈ N \ { i } , as the player i ’ s belief toward the player j ’ s type at stage k . Define the belief matrix for all i ∈ N , j ∈ N \ { i } , k ∈ { 0 , · · · , K − 1 } , as L k i j : =       L k i ( θ 1 j | h k , θ 1 i ) , · · · L k i ( θ N j j | h k , θ 1 i ) L k i ( θ 1 j | h k , θ 2 i ) , · · · L k i ( θ N j j | h k , θ 2 i ) . . . . . . . . . L k i ( θ 1 j | h k , θ N i i ) , · · · L k i ( θ N j j | h k , θ N i i )       , (8) where each block element L k i ( θ r j | h k , θ h i ) = Diag [ l k i ( θ r j | h k , θ h i ) , · · · , l k i ( θ r j | h k , θ h i )] ∈ R n × n , ∀ r ∈ { 1 , · · · , N j } , ∀ h ∈ { 1 , · · · , N i } . Since all its elements are positi ve and all rows sum to one, the belief matrix L k i j is a right stochastic matrix . b) Extended Riccati Equations: Let a sequence of sym- metric matrices S k i ( β k , θ i ) ∈ R n × n , vectors N k i ( β k , θ i ) ∈ R n × 1 , and scalars q k i ( β k , θ i ) ∈ R satisfy the following extended Ric- cati equations for all β k ∈ Λ , i ∈ N , θ i ∈ Θ i , k ∈ { 0 , · · · , K − 1 } : S k i = D k i + E θ − i ∼ l k i  ( A k + N ∑ j = 1 B k j Ψ 1 , k j ) 0 E w k ∼ Ξ w [ S k + 1 i ] · ( A k + N ∑ j = 1 B k j Ψ 1 , k j ) + N ∑ j = 1 ( Ψ 1 , k j ) 0 F k i j Ψ 1 , k j  , (9) N k i = − 2 D k i ˆ x k i + E θ − i ∼ l k i  ( N ∑ j = 1 B k j Ψ 1 , k j + A k ) 0 ( E w k ∼ Ξ w [ N k + 1 i ] + 2 E w k ∼ Ξ w [ S k + 1 i ] N ∑ j = 1 B k j Ψ 2 , k j ) + 2 N ∑ j = 1 ( Ψ 1 , k j ) 0 F k i j Ψ 2 , k j  , (10) q k i =( ˆ x k i ) 0 D k i ˆ x k i + ˆ f k i ( ˆ x k i ) + E w k ∼ Ξ w [( w k ) 0 S k + 1 i w k + q k + 1 i ] + E θ − i ∼ l k i  ( N ∑ j = 1 B k j Ψ 2 , k j ) 0 E w k ∼ Ξ w [ S k + 1 i ] N ∑ j = 1 B k j Ψ 2 , k j + ( N ∑ j = 1 B k j Ψ 2 , k j ) 0 E w k ∼ Ξ w [ N k + 1 i ] + N ∑ j = 1 ( Ψ 2 , k j ) 0 F k i j Ψ 2 , k j  , (11) where functions Ψ 1 , k i , Ψ 2 , k i , ∀ i ∈ N , are defined belo w . The boundary conditions of the extended Riccati equations are S K i = D K i ; N K i = − 2 D K i ˆ x K i ; q K i = ( ˆ x K i ) 0 D K i ˆ x K i + ˆ f K i ( ˆ x K i ) . (12) c) Equilibrium Action in Matrix F orm: W e need to represent the equilibrium action of all players under all types in matrix form as each player’ s action is coupled with other players’ actions under PBNE. Since each player i has different equilibrium actions under different types, with a little ab use of notation, we write each player i ’ s action as a function of his type θ i and define two action vectors u k i : = [ u k i ( θ 1 i ) , · · · , u k i ( θ N i i )] 0 ∈ R m i N i × 1 and u k : = [ u k 1 , u k 2 · · · , u k N ] 0 ∈ R ∑ N r = 1 m r N r × 1 . For all i ∈ N , l k i , θ i ∈ Θ i , k ∈ { 0 , · · · , K − 1 } , define a series of ( m i ) -by- ( m i ) square matrices R k i ( β k , θ i ) : = F k ii ( θ i ) + ( B k i ( θ i )) 0 S k + 1 i ( β k , θ i ) B k i ( θ i ) . Let B k i : = Diag [ B k i ( θ 1 i ) · · · , B k i ( θ N i i )] be ( N i n ) -by- ( N i m i ) block matrices and S k i ( β k ) : = Diag [ S k i ( β k , θ 1 i ) , · · · , S k i ( β k , θ N i i )] be ( N i n ) -by- ( N i n ) block matrices. Finally , define parameter ma- trices W 1 , k ( β k ) = [ W 1 , k 1 ( β k ) ; · · · ; W 1 , k N ( β k )] ∈ R ∑ N r = 1 m r N r × n , W 2 , k ( β k ) = [ W 2 , k 1 ( β k ) ; · · · ; W 2 , k N ( β k )] ∈ R ∑ N r = 1 m r N r × 1 , and W 0 , k ( β k ) : = [ W 0 , k i j ( β k ) ∈ R m i N i × m j N j ] i , j ∈ N for any β k ∈ Λ . Their elements are giv en as follows; i.e., ∀ i ∈ N , ∀ k ∈ { 0 , · · · , K − 1 } , W 1 , k i ( β k ) =  ( B k i ( θ 1 i )) 0 S k + 1 i ( β k , θ 1 i ) E θ − i ∼ l k i [ A k ( θ 1 i , θ − i )] ; · · · ; ( B k i ( θ N i i )) 0 S k + 1 i ( β k , θ N i i ) E θ − i ∼ l k i [ A k ( θ N i i , θ − i )]  , W 2 , k i ( β k ) = 1 2  ( B k i ( θ 1 i )) 0 N k + 1 i ( β k , θ 1 i ) ; · · · ; ( B k i ( θ N i i )) 0 N k + 1 i ( β k , θ N i i )  , W 0 , k ii ( β k ) = Diag [ R k i ( β k , θ 1 i ) , · · · , R k i ( β k , θ N i i )] , W 0 , k i j ( β k ) = ( B k i ) 0 S k + 1 i ( β k ) L k i j B k j , ∀ j ∈ N \ { i } . Let matrix M k i ( β k , θ l i ) ∈ R m i × ∑ N r = 1 m r N r , l ∈ { 1 , 2 , · · · , N i } , i ∈ N , k ∈ { 0 , · · · , K − 1 } , be the truncated row block, i.e., from row ∑ i − 1 r = 1 m r N r + m i ( l − 1 ) to ∑ i − 1 r = 1 m r N r + m i l , of matrix ( − W 0 , k ( β k )) − 1 . Define shorthand notations Ψ 1 , k i ( β k , θ i ) : = M k i ( β k , θ i ) W 1 , k ( β k ) and Ψ 2 , k i ( β k , θ i ) : = M k i ( β k , θ i ) W 2 , k ( β k ) . A. Extrinsic Belief Dynamics and Extended Riccati Equations In this section, we focus on the extrinsic belief dynam- ics where Γ k i is independent of players’ actions u k for all i ∈ N , k ∈ { 0 , · · · , K − 1 } . The proof of Theorem 1 generalizes the one of classical LQ games (e.g., Chapter 5 . 5 and 6 . 2 in 7 [41]) where we further incorporate players’ asymmetric belief dynamics into their objecti ve functions to minimize their ex- pected costs under deception. W e apply dynamic pr ogramming from stage K − 1 backward to stage 0 to obtain a closed-form solution of PBNE. Theorem 1. An N -player K -stage LQ game of incomplete information defined by (6) , (7) , and extrinsic belief dynamics β k + 1 i = Γ k i ( β k i , w k , θ i ) , ∀ i ∈ N , ∀ k ∈ { 0 , · · · , K − 1 } , admits a unique state-feedback PBNE u ∗ , k i ( β k , x k , θ i ) = Ψ 1 , k i ( β k , θ i ) x k + Ψ 2 , k i ( β k , θ i ) , (13) if and only if R k i ( β k , θ i ) is positive definite and W 0 , k ( β k ) is non-singular for all β k ∈ Λ , i ∈ N , θ i ∈ Θ i , k ∈ { 0 , · · · , K − 1 } . The equilibrium cost V k i is quadratic in x k , i.e., V k i ( β k , x k , θ i ) = q k i ( β k , θ i ) + ( x k ) 0 N k i ( β k , θ i ) + ( x k ) 0 S k i ( β k , θ i ) x k , ∀ i ∈ N , k ∈ K . (14) Pr oof. W e use backw ard induction to prov e the result. At the final stage K , the v alue function V K i ( β K , x K , θ i ) = ( x K − ˆ x K i ( θ i )) 0 D K i ( θ i )( x K − ˆ x K i ( θ i )) + ˆ f K i ( ˆ x K i ( θ i )) is quadratic in x K and we obtain the boundary conditions for S K i , N K i , q K i in (12) by matching the RHS of (14). At any stage k ∈ { 0 , · · · , K − 1 } , if (14) is true at stage k + 1, we can expand E w k ∼ Ξ w [ V k + 1 i ( β k + 1 , x k + 1 , θ i )] by plugging in the state dynam- ics x k + 1 = A k ( θ ) x k + ∑ N i = 1 B k i ( θ i ) u k i + w k and the belief dynam- ics β k + 1 i = Γ k i ( β k i , w k , θ i ) . Then, the Right-Hand Side (RHS) of (5) is quadratic in u k i for each player i . If the coefficient matrix R k i of the quadratic form ( u k i ) 0 R k i u k i is positiv e definite, then the first-order necessary conditions for minimization are also sufficient and we obtain the follo wing unique set of equations for the equilibrium action u ∗ , k by differentiating the RHS of (5) and setting it to zero, i.e., ∀ θ i ∈ Θ i , − R k i u ∗ , k i ( θ i ) = ( B k i ) 0 S k + 1 i E θ − i ∼ l k i [ A k ] x k + 1 2 ( B k i ) 0 N k + 1 i + ( B k i ) 0 S k + 1 i ∑ j 6 = i E θ j ∼ l k i [ B k j ( θ j ) u ∗ , k j ( θ j )] , ∀ i ∈ N . (15) Due to the coupling in players’ actions and beliefs, we rewrite (15) in matrix form, i.e., − W 0 , k ( β k ) u ∗ , k = W 1 , k ( β k ) x k + W 2 , k ( β k ) , to solve the set of equations. Giv en the existence of ( − W 0 , k ( β k )) − 1 , each player i ’ s equilibrium action is an affine function in x k , i.e., u ∗ , k i ( β k , x k , θ i ) = Ψ 1 , k i ( β k , θ i ) x k + Ψ 2 , k i ( β k , θ i ) . Note that the coefficients Ψ 1 , k i , Ψ 2 , k i for player i are functions of β k , i.e., the beliefs of all players under all types at stage k . Finally , after substituting the equilibrium action u ∗ , k i ( β k , x k , θ i ) = Ψ 1 , k i ( β k , θ i ) x k + Ψ 2 , k i ( β k , θ i ) into the RHS of (5) and representing V k i in the Left-Hand Side (LHS) in its quadratic form of x k , we can match the coefficients of quadratic, linear, and constant terms in the LHS and RHS to obtain the extended Riccati equations (9), (10), and (11). Remark 3 ( Positiv e Definiteness ) . If D k i ( θ i ) and F k i j ( θ i ) , ∀ j ∈ N , ar e positive definite for all k ∈ K , then R k i ( β k , θ i ) is positive definite for all k ∈ K , β k ∈ Λ , because the linear combination of positive definite matrices in (9) pr eserves positive definiteness. Note that the above condition is only a necessary condition; i.e., D k i and F k i j do not need to be positive definite to make R k i positive definite as shown in Section IV. Remark 4 ( Cognitive Coupling ) . Compar ed with the clas- sical LQ games (e.g ., Chapter 6 in [41]), the deception of players’ types results in a unique feature of cognitive coupling r epresented by the belief matrix in (8) ; i.e., each player’s action hinges on not only his own belief but also all other players’ beliefs as these beliefs can affect their actions and further the outcome of the interaction. Thus, player i can change other players’ actions by manipulating their beliefs of his type θ i , i.e., l k j , ∀ j ∈ N \ { i } , or making them believe that his belief l k i on their types θ − i has changed. W e introduce matrix block partitions as follows. For each type θ i ∈ Θ i , we divide A k ( θ ) , D k i ( θ i ) , S k i ( θ i ) into N -by- N blocks where the ( i , i ) block is A k i ( θ ) , ¯ D k i ( θ i ) , ¯ S k i ( θ i ) ∈ R n i × n i , respectively . The i - t h row block of N k i ( θ i ) , ˆ x k i ( θ i ) is ¯ N k i ( θ i ) , ¯ x k i ( θ i ) ∈ R n i × 1 , respecti vely . The i - t h ro w block of B k i ( θ i ) is ¯ B k i ( θ i ) ∈ R n i × m i . When the system state x k can be represented by players’ joint states [ x k i ] i ∈ N , Corollary 1 shows that the LQ game of asymmetric information degenerates to an LQ control problem if players have decoupled cost and state dynamics defined as follows. Definition 5 ( Decoupled Dynamics and Cost ) . Player i ∈ N has decoupled dynamics if for all k ∈ K , A k i ( θ ) = ¯ A k i ( θ i ) , ∀ θ ∈ Θ , while all other elements in the i-t h r ow block and the i-t h column block of A k ( θ ) ar e 0 . Besides, all elements of B k i ( θ i ) except for the row block ¯ B k i ( θ i ) ar e r equired to be 0 . Player i ∈ N has a decoupled cost if for all stag e k ∈ K , F k i j ( θ i ) = 0 m i , m i , ∀ θ i ∈ Θ i , j ∈ N \ { i } , and all elements of D k i ( θ i ) equal 0 except for ¯ D k i ( θ i ) . Corollary 1 ( Degeneration to LQ Control ) . If x k = [ x k i ] i ∈ N for all stage k ∈ K and player i has both decoupled cost and state dynamics, then his action under PBNE is independent of other players’ actions, types, and beliefs, i.e., u ∗ , k i = − ( R k i ) − 1 ( ¯ B k i ) 0 ¯ S k + 1 i A k i x k i − 1 2 ( R k i ) − 1 ( ¯ B k i ) 0 ¯ N k + 1 i , where R k i = F k ii + ( ¯ B k i ) 0 ¯ S k + 1 i ¯ B k i , ( G k i ) 0 = I n − ¯ S k + 1 i ¯ B k i ( R k i ) − 1 ( ¯ B k i ) 0 , ¯ S k i = ( A k i ) 0 ( G k i ) 0 ¯ S k + 1 i A k i + ¯ D k i , and ¯ N k i = ( A k i ) 0 ( G k i ) 0 ¯ N k + 1 i − 2 ¯ D k i ¯ x k i . Pr oof. W e sho w by induction that S k i , N k i , ∀ k ∈ K , satisfy the sparsity condition that only the ( i , i ) block of S k i and the i - t h ro w block of N k i are nonzero. At stage K , S K i = D K i and N K i = − 2 D K i ˆ x K i satisfy the abov e condition. At stage k ∈ { 0 , · · · , K − 1 } , if S k + 1 i , N k + 1 i satisfy the sparsity condition, W 0 , k ( β k ) becomes a diagonal block matrix where W 0 , k i j ( β k ) = 0 m i N i , m j N j and M k i ( β k , θ i ) = − ( R k i ( β k , θ i )) − 1 for all β k ∈ Λ . Then, S k i , N k i satisfy the condition based on (9) and (10). B. Intrinsic Belief Dynamics and Receding-Horizon Contr ol If there exists a player i ∈ N whose belief dynamics Γ k i de- pend on intrinsic information at some stage k ∈ { 0 , · · · , K − 1 } as sho wn in (2), then the equilibrium action u ∗ , k i is in general a nonlinear function of x k and the equilibrium cost V k i is not quadratic in x k ev en under the LQ setting of (6) and (7). Besides the static cognitive coupling among N players in Remark 4, the intrinsic information of u k in the belief update introduces another dynamic cognitive coupling between the forward belief dynamics via (2) and the backward equi- librium computation via (5), which makes it challenging to 8 compute PBNE. T o reduce the computational complexity and further obtain implementable actions, we adopt a receding- horizon approach that computes the sequentially rational ac- tion sequence of all the future stages u ∗ , k : K − 1 at current stage k ∈ { 0 , · · · , K − 1 } assuming β ¯ k = β k , ∀ ¯ k ∈ { k , ..., K − 1 } , yet only implements the current-stage action u ∗ , k . Then, at the new stage k + 1, each player observes the ne w system state x k + 1 and updates the belief to β k + 1 and recomputes the entire action sequence u ∗ , k + 1: K − 1 under assumption of β ¯ k = β k + 1 , ∀ ¯ k ∈ { k + 1 , ..., K − 1 } , yet still only implements the ne w current-stage action u ∗ , k + 1 . Players repeat the above procedure until they reach the final stage of the interaction. Compared with PBNE, which produces an offline planning for all future stages under all possible scenarios before the game has taken place, the receding-horizon approach enables an online replanning of their actions repeatedly at the begin- ning of each ne w stage as the interaction continues. Although we assume that players’ beliefs at the future stages are the same as the current beliefs during the phase of equilibrium computation, players can correct and update their beliefs and actions based on the online observation of x k during each replanning phase. Thus, the receding-horizon approach provides a reasonable approximation of the PBNE action and is more adaptiv e to unexpected environmental changes of the state dynamics f k and cost structure g k i , ∀ i ∈ N . Under the LQ specification in (6) and (7) and Bayesian belief dynamics in (3), we summarize the computation phase and online implementation phase in Algorithm 1 and 2, re- spectiv ely . T o in vestigate the scalability of our algorithms, we analyze the temporal and spatial complexity concerning N , K , and N i . T o simplify the notation and enhance readability , we focus on the symmetric setting where N i = N 0 ∈ Z + , ∀ i ∈ N . For each player i ∈ N of type θ i ∈ Θ i at the beginning of the interaction, i.e., k = 0, he needs to store the game parameters A 0 , B 0 r ( θ r ) , D 0 r ( θ r ) , F 0 rh ( θ r ) , ∀ θ r ∈ Θ r , and the belief matrix L 0 rh for all r , h ∈ N , which are common knowledge . The spatial complexity to store the game parameters and the belief matrix is O ( N 2 N 0 ) and O ( N 2 N 2 0 ) , respectively . Note that in general, player i has coupled cognition as sho wn in Remark 4 and has to keep track of not only his belief L k i , j , ∀ j ∈ N , but also other players’ beliefs L k r , h , ∀ r ∈ N \ { i } , h ∈ N , to decide his equilibrium action under deception at each stage k . During the K -stage interaction, each player i ∈ N of type θ i ∈ Θ i observes the system state x k and computes his equilibrium action u ∗ , k i ( β k , x k , θ i ) at stage k based on Algorithm 1. After all players implement their equilibrium actions at stage k , the system state e volves to x k + 1 . Based on the new state observa- tion x k + 1 , each player i updates the belief matrix in (8) via (3). Since player i can delete the game parameters and the belief matrices of previous stages, the spatial complexity remains the same as the real-time stage index k increases. Thus, our algorithm can handle the interaction of long duration. All players repeat the abov e procedure stated in lines 14-17 of Algorithm 2 until reaching the terminal stage k = K . The computational complexity of the belief matrix update in the line 15 of Algorithm 2 is O ( N N 0 N ) . For any β k , the term W 0 , k ( β k ) has computational complexity O ( N N 0 N ) + O ( N 3 0 N 2 ) , which is determined by the belief matrix update Algorithm 1: PBNE computation with β ¯ k = β k , ∀ ¯ k ∈ { k , ..., K − 1 } at stage k ∈ { 0 , · · · , K − 1 } for player i ∈ N of type θ i ∈ Θ i 1 Load game parameters A k , B k r ( ¯ θ r ) , D k r ( ¯ θ r ) , F k rh ( ¯ θ r ) , ∀ ¯ θ r ∈ Θ r and the belief matrix L k r , h for all r , h ∈ N ; 2 Input state observ ation x k ; 3 for ¯ k ← K − 1 to k do 4 for j ← 1 to N do 5 for θ j ← θ 1 j to θ N j j do 6 Compute S ¯ k j , N ¯ k j via (9), (10) with β ¯ k = β k ; 7 end 8 end 9 end 10 Return his equilibrium action u ∗ , k i ( l k i , x k , θ i ) via (13); Algorithm 2: K -stage receding-horizon control for player i ∈ N of type θ i ∈ Θ i 11 Initialize k = 0; 12 Store game parameters A k , B k r ( ¯ θ r ) , D k r ( ¯ θ r ) , F k rh ( ¯ θ r ) , ∀ ¯ θ r ∈ Θ r and the belief matrix L k r , h for all r , h ∈ N ; 13 while k < K do 14 Call Algorithm 1 to implement u ∗ , k i ( l k i , x k , θ i ) ; 15 Observe state x k + 1 and update all elements of the belief matrix via (3) to obtain L k + 1 r , h , ∀ r , h ∈ N ; 16 Delete A k , B k r ( ¯ θ r ) , D k r ( ¯ θ r ) , F k rh ( ¯ θ r ) , L k r , h and Store A k + 1 , B k + 1 r ( ¯ θ r ) , D k + 1 r ( ¯ θ r ) , F k + 1 rh ( ¯ θ r ) , L k + 1 r , h for all ¯ θ r ∈ Θ r and for all r , h ∈ N ; 17 Update stage index k ← k + 1; 18 end and the matrix chain multiplication of W 0 , k i j ( β k ) , respecti vely . Then, the computational complexity of ( W 0 , k ( β k )) − 1 and W 1 , k ( β k ) is O ( N N 0 N ) + O ( N 3 0 N 3 ) and O ( N N 0 N ) + O ( N 3 0 N 2 ) , respectiv ely . Giv en β k and θ i , the computational complexity of S k i ( β k , θ i ) in (9) is O ( N N 0 N ) + O ( N 3 0 N 3 ) + O ( N 3 0 N 2 ) + O ( N 0 N ) = O ( max ( N N 0 N , N 3 0 N 3 )) , which hinges on the com- putational complexity of M k i ( β k , θ i ) (or ( W 0 , k ( β k )) − 1 ), W 1 , k ( β k ) , and the matrix chain multiplication in (9). Similarly , N k i ( β k , θ i ) and W 2 , k ( β k ) both hav e computational complexity of O ( N N 0 N ) + O ( N 0 N ) . Therefore, player i ’ s temporal com- plexity at each stage k ∈ { 0 , 1 , · · · , K − 1 } is O (( K − k ) · N 0 N · max ( N N 0 N , N 3 0 N 3 )) . The temporal complexity has the maximum value of O ( K · max { N N + 1 0 N 2 , N 4 0 N 4 } ) at the initial stage k = 0 where each player has to predict the entire K future stages to act optimally under the deception. Since the temporal complexity decreases as the real-time stage index k increases, a player who can compute the equilibrium action within the required time at the initial stage k = 0 is guaranteed to meet the real-time requirement in the following stages of interaction. If the number of types and agents are on the same scale, e.g., N 0 = N , then lim N → ∞ ( N N + 1 0 N 2 ) / ( N 4 0 N 4 ) → ∞ and the computation of 9 belief matrix update plays a dominant role as each player keeps track of all players’ beliefs to obtain the equilibrium action under deception. If N 0  N , e.g., N 0 = N 1 / N , then lim N → ∞ ( N N + 1 0 N 2 ) / ( N 4 0 N 4 ) → 0 and the in verse of W 0 , k ( β k ) becomes the most time-consuming operation due to the cou- pling in dynamics, costs, and cognition. Effecti ve deception can prev ent or delay other players from learning the deceiv er’ s priv ate type. W e define the criterion of successful learning of the deceiv er’ s type in Definition 6 and ε -dece viability and ε -learnability in Definition 7. Definition 6 ( Stage of T ruth Rev elation ) . Consider two players i , j ∈ N with type θ i and θ j , r espectively . Stage k t r i , j ∈ K ∪ { K + 1 } is said to be player i’s truth-r evealing stage with accur acy δ ∈ ( 0 , 1 ] 2 if it satisfies the following two conditions. • The bounded mismatch condition : player i’s belief mis- match r emains less than δ after stage k t r i , j ∈ K , i.e ., 1 − l k i ( θ j | h k , θ i ) ≤ δ , ∀ k ≥ k t r i , j . (16) • The first-hitting-time condition : k t r i , j ∈ K is the first stage satisfying (16) , i.e., 1 − l k t r i , j − 1 i ( θ j | h k t r i , j − 1 , θ i ) > δ , k t r i , j > 1 . If ther e does not exist k t r i , j ∈ K that satisfies (16) , we define k t r i , j : = K + 1 . If there ar e only two players N = 2 , we write k t r i , j as k t r i without ambiguity . Due to deceivers’ deceptiv e actions and the external noises, the belief sequence may be fluctuant; i.e., there can exist k < k t r i , j such that 1 − l k i ( θ j | h k , θ i ) ≤ δ . Thus, as sho wn in Definition 6, a player should only claim a successful learning of other players’ types if his belief mismatch remains less than δ for the remaining stages. Definition 7 ( Deceviability and Learnability ) . Consider players i , j ∈ N with type θ i and θ j , thresholds δ ∈ ( 0 , 1 ] , ε ∈ [ 0 , 1 ] , and a given stage index ˜ k ∈ K ∪ { K + 1 } . Player i is ˜ k-stage ε -deceivable if the pr obability Pr ( k t r i , j < ˜ k ) , or equivalently Pr ( l ˜ k i ( θ j | x ˜ k , θ i ) > 1 − δ ) , is not gr eater than ε for all l 0 i ∈ ( 0 , 1 ) . If the above does not hold, player j’s type is said to be ˜ k-stage ε -learnable by player i. Since robot deception in volves only a finite number of stages, it is essential that the decei ved robot can learn the deceiv er’ s type as quickly as possible so that he has sufficient stages to plan on and mitigate the deception impact from the previous stages. Therefore, the definition of learnability , i.e., non-deceviability in Definition 7, not only requires the deceiv ed player to be capable of learning the deceiver’ s priv ate information, but also learning it in a desirable rate, i.e., within ˜ k stage. Due to the external noise, k t r i , j is a random v ariable. Thus, the definition of learnability requires Pr ( k t r i , j < ˜ k ) > ε ; i.e., player i has a large probability to correctly learn the type of player j before stage ˜ k . I V . D Y NA M I C T A R G E T P RO T E C T I O N U N D E R D E C E P T I O N W e in vestigate a pursuit-ev asion scenario that contains two U A Vs with the decoupled linear time-inv ariant state dynamics, 2 Since the belief mismatch does not reduce to 0 in finite stages with initial belief l 0 i ∈ ( 0 , 1 ) , the accuracy threshold δ 6 = 0. i.e., A k ( θ ) = I 4 , ¯ B k i ( θ i ) = [ ˜ B i ( θ i ) , 0; 0 , ˜ B i ( θ i )] ∈ R 2 × 2 , ∀ k ∈ K . W e use ‘she’ for U A V 1, the pursuer , and ‘he’ for U A V 2, the ev ader . U A V i ’ s state x k i : = [ x k i , x , x k i , y ] 0 ∈ R 2 × 1 represents i ’ s location ( x k i , x , x k i , y ) in the 2 D space, and action u k i = [ u k i , x , u k i , y ] ∈ R 2 × 1 affects i ’ s speed in x and y directions. U A V 2 as the ev ader selects either the harbor in ‘Normandy’ or ‘Calais’ as his final location based on his type θ 2 ∈ { θ g 2 , θ b 2 } . He aims to reach ‘Normandy’ located at γ ( θ g 2 ) : = ( x g , y g ) in K = 40 stages if his type is θ g 2 , otherwise ‘Calais’ located at γ ( θ b 2 ) : = ( x b , y b ) if his type is θ b 2 . UA V 1 as the pursuer can make interfering signals and aims to be close to UA V 2 at the final stage to protect the harbor targeted by the ev ader, i.e., g k 1 ( x k , u k , θ 1 ) = d k 12 ( θ 1 )(( x k 2 , y − x k 1 , y ) 2 + ( x k 2 , x − x k 1 , x ) 2 ) + f k 11 ( θ 1 )(( u k 1 , x ) 2 + ( u k 1 , y ) 2 ) − f k 12 ( θ 1 )(( u k 2 , x ) 2 + ( u k 2 , y ) 2 ) , ∀ k ∈ K , where d k 12 ( θ 1 ) ∈ R ≥ 0 penalizes her distance from the ev ader at stage k ∈ K , f k 11 ( θ 1 ) ∈ R ≥ 0 prev ents her from a high action cost, and f k 12 ( θ 1 ) ∈ R ≥ 0 incites her opponent, i.e., the ev ader, to take costly actions. W e classify UA V 1 into two types, i.e., Θ 1 = { θ H 1 , θ L 1 } , based on her maneuverability represented by the value of ˜ B 1 ( θ 1 ) . Giv en higher maneuverability ˜ B 1 ( θ H 1 ) > ˜ B 1 ( θ L 1 ) , the pursuer of type θ H 1 can obtain a higher speed under the same action u k 1 and thus cov er a longer distance. The ev ader’ s goals of deceptiv e target reaching and pursuit ev asion are incorporated into the cost struc- ture g k 2 ( x k , u k , θ 2 ) = d k 2 , b ( θ 2 )(( x k 2 , y − y b ) 2 + ( x k 2 , x − x b ) 2 ) + d k 2 , g ( θ 2 )(( x k 2 , y − y g ) 2 + ( x k 2 , x − x g ) 2 ) − d k 21 ( θ 2 )(( x k 1 , y − x k 2 , y ) 2 + ( x k 1 , x − x k 2 , x ) 2 ) + f k 22 ( θ 2 )(( u k 2 , x ) 2 + ( u k 2 , y ) 2 ) − f k 21 ( θ 2 )(( u k 1 , x ) 2 + ( u k 1 , y ) 2 ) , ∀ k ∈ K . Similar to the pursuer’ s cost parameters, d k 21 ( θ 2 ) ∈ R ≥ 0 represents the e vader’ s lev el of evasion de- termination to keep a distance from the pursuer along the trajectory . The action costs of the ev ader and the pursuer are regulated by f k 22 ( θ 2 ) ∈ R ≥ 0 and f k 21 ( θ 2 ) ∈ R ≥ 0 , respecti vely . The parameters d k 2 , b ( θ 2 ) and d k 2 , g ( θ 2 ) represent the ev ader’ s attempt to head to ward ‘Normandy’ and ‘Calais’, respectiv ely , at stage k ∈ K under type θ 2 ∈ Θ 2 . W e use the ratio d k 2 , g ( θ 2 ) / d k 2 , b ( θ 2 ) to represent the ev ader’ s le vel of trajectory deception . Since the pursuer can learn the e vader’ s type based on the real-time observ ations of state x k 2 , the ev ader attempts to make his target ε 0 -ambiguous at all pre vious stages, i.e., | d k 2 , b ( θ 2 ) / d k 2 , g ( θ 2 ) − 1 | ≤ ε 0 , ∀ θ 2 , ∀ k 6 = K , and rev eal his true target only at the final stage, i.e., d K 2 , g ( θ b 2 ) = 0 and d K 2 , b ( θ g 2 ) = 0. The ev ader chooses a small ε 0 ≥ 0 and achieves the maximum ambiguity when ε 0 = 0. T wo blue lines in Fig. 1a illustrate how the e vader manages to remain ambiguous in a cost-effecti ve manner from two different initial locations. Instead of keeping an equal distance to both potential targets, the ev ader heads tow ard the midpoint (( x g + x b ) / 2 , ( y g + y b ) / 2 ) at the early stages to confuse the pursuer . Howe ver , the ev ader starts to head to ward the true target at around half of K stages rather than the last few stages so that he can reach the target with a moderate control cost ( u k 2 ) 0 F k 22 ( θ 2 ) u k 2 . Fig. 1a also sho ws that for a given initial location, the ev ader who adopts a higher lev el of trajectory deception heads more to ward the misleading target at the early stages. In this case study , we suppose that the ev ader’ s true target is Calais and let θ b 2 be his true type and θ g 2 be the misleading type . The following two ratios capture the ev ader’ s tradeof f 10 of being deceptive, effecti ve, and e vasi ve. On one hand, the ratio d k 2 , b ( θ b 2 ) / d K 2 , b ( θ b 2 ) , k 6 = K , reflects the ev ader’ s tradeof f between applying deception along the trajectory and staying close to the true target at the final stage. Fig. 1b shows that as the ev ader focuses more on a decepti ve trajectory represented by a larger value of d k 2 , b ( θ b 2 ) / d K 2 , b ( θ b 2 ) , k 6 = K , his trajectory remains ambiguous for longer stages while his final location is farther away from the true target. On the other hand, the ratio d k 21 ( θ b 2 ) / d K 2 , b ( θ b 2 ) , k 6 = K , reflects the ev ader’ s tradeoff between ev asion and target-reaching. As the ev ader focuses more on keeping a distance from the pursuer along the trajectory , he takes a bigger detour and stays farther away from his true target at the final stage as sho wn in Fig. 1c. Finally , we transform U A V i ’ s coupled cost g k i into the ma- trix form given in Section III, i.e., ˆ x k 1 ( θ 1 ) = 0 4 , 1 , ˆ f k 1 ( ˆ x k 1 ( θ 1 )) = 0 , F k ii ( θ 1 ) = f k ii ( θ 1 ) · I 2 , F k i j ( θ 1 ) = − f k i j ( θ 1 ) · I 2 , j 6 = i , D k 1 ( θ 1 ) = d k 12 ( θ 1 ) · [ 1 , 0 , − 1 , 0; 0 , 1 , 0 , − 1; − 1 , 0 , 1 , 0; 0 , − 1 , 0 , 1 ] , D k 2 ( θ 2 ) =     − d k 21 0 d k 21 0 0 − d k 21 0 d k 21 d k 21 0 d k 2 , b + d k 2 , g − d k 21 0 0 d k 21 0 d k 2 , b + d k 2 , g − d k 21     , ˆ x k 2 ( θ 2 ) = 1 d k 2 , b + d k 2 , g · [ d k 2 , b x b + d k 2 , g x g ; d k 2 , b y b + d k 2 , g y g ; d k 2 , b x b + d k 2 , g x g ; d k 2 , b y b + d k 2 , g y g ] , ˆ f k 2 ( ˆ x k 2 ( θ 2 )) = d k 2 , b d k 2 , g (( x b − x g ) 2 +( y b − y g ) 2 ) d k 2 , b + d k 2 , g . A. Deceptive Evader with Decoupled Cost Structur e W e first in vestigate the scenario where the e vader has a decoupled cost structure 3 defined in Definition 5, i.e., d k 21 ( θ 2 ) = 0 , ∀ θ 2 ∈ Θ 2 , ∀ k ∈ K . According to Corollary 1, the ev ader’ s trajectory is then independent of the pursuer’ s action, type, and belief. Fig. 2 visualizes the pursuer’ s trajectories. Although the pursuer only aims to be close to the ev ader at the final stage, she also takes proactive actions in the previous stages to be cost-efficient. If the pursuer knows the ev ader’ s type, then she can head tow ard the true target directly and will not be misled by the ev ader’ s trajectory ambiguity at the early stages as illustrated by the black dashed line in Fig. 2. If the ev ader’ s type is priv ate, then a larger initial belief mismatch 1 − l 0 1 ( θ b 2 | x 0 , θ H 1 ) makes the pursuer head more tow ard the misleading target at the early stages as illustrated by the three solid lines in Fig. 2. Howe ver , due to the pursuer’ s online learning, which is compatible, efficient, and robust as shown in Section IV -A1, she manages to approach the ev ader at the final stage reg ardless of her initial belief mismatch. Fig. 3 sho ws the pursuer’ s K -stage belief variation. The e v ader’ s ambiguous trajectory results in belief fluctuations at the early stages, yet the pursuer can quickly reduce the belief mismatch when the ev ader starts to head toward the true target. After the pursuer has corrected her initial belief mismatch at around stage k = 16, she can head to ward the true target in the cost- efficient way; i.e, she attempts to keep a uniform linear motion under the external noise as shown in the upper right re gion of Fig. 2. 3 This paper has supplementary do wnloadable materials av ailable at http: //ieeexplore.ieee.or g, provided by the authors. This includes a video demo of two UA Vs’ trajectories and belief updates under the decoupled structure. 1) Finite-Horizon Analysis of Bayesian Update: In this subsection, we illustrate the compatibility , efficiency , and robustness of the finite-horizon Bayesian update in (3) to reduce the initial belief mismatch. The pursuer is of high- maneuverability and the ev ader’ s true type is θ b 2 . Define the likelihood function of θ b 2 and θ g 2 as a k : = Pr ( x k + 1 | θ b 2 , x k , θ H 1 ) and c k : = Pr ( x k + 1 | θ g 2 , x k , θ H 1 ) , respectiv ely . As w k ∈ R n × 1 , a k and c k are positive. W ith an initial belief l 0 1 ∈ ( 0 , 1 ) and a finite likelihood ratio e k : = c k / a k ∈ ( 0 , ∞ ) , we can represent (3) in the following form with three properties: l k + 1 1 = l k 1 · a k l k 1 · a k + ( 1 − l k 1 ) · c k = 1 1 + ( 1 l 0 1 − 1 ) ∏ k ¯ k = 0 e ¯ k ∈ ( 0 , 1 ) . 1) ( Compatibility ): For all l k 1 ∈ ( 0 , 1 ) , the belief update at stage k is compatible to the evidence represented by the ratio e k . In particular , if e k < 1, then l k + 1 1 > l k 1 ; if e k > 1, then l k + 1 1 < l k 1 ; if e k = 1, then l k + 1 1 = l k 1 . 2) ( Efficiency ): If the evidence of state observation x k + 1 indicates that the type is more likely to be the true type θ b 2 , i.e., e k < 1, then the function l k + 1 1 / l k 1 = 1 / ( l k 1 + ( 1 − l k 1 ) e k ) at stage k is monotonically decreasing over l k 1 . If the evidence indicates that the type is more likely to be the misleading type θ g 2 , i.e., e k > 1, then the function l k + 1 1 / l k 1 is monotonically increasing over l k 1 . 3) ( Robustness ): The order of the evidence sequence e ¯ k , ¯ k = 0 , · · · , k , has no impact on the belief l k + 1 1 . Property one shows that although the external noise can result in the fluctuations of the belief update, the belief mismatch, i.e., 1 − l k 1 , will decrease when e k < 1, reg ardless of the prior belief l k 1 ∈ ( 0 , 1 ) . Property two shows the ef ficiency of the belief update. The belief changes more under a larger belief mismatch, which results in a quick correction. Property three shows the rob ustness of the belief update. The erroneous belief update caused by a heavy noise can be corrected in the later stages when the noise fades. 2) Comparison with Heuristic P olicies: W e compare the proposed pursuer’ s control policy with two heuristic ones to demonstrate its ef ficacy in counter-deception 4 . The first heuristic policy is to repeat the attacker’ s trajectory with a one-stage delay; i.e., the pursuer applies the action so that x k + 1 1 = x k 2 , ∀ k ∈ K \ { K } . The pursuer does not need to apply Bayesian learning and we name this policy as dir ect following . The second heuristic policy for the pursuer is to stay at the initial location until her truth-re vealing stage k t r 1 and then head tow ard the ev ader’ s expected final-stage location in the remaining stages. The second policy is conservative because the pursuer does not take proactiv e actions until she identifies the ev ader’ s type. Let player i ’ s ex-post cumulative cost ˆ V 0: k i : = ∑ k h = 0 g h i , ∀ k ∈ K , be a real-time ev aluation of the online algorithm. Although a pursuer under both heuristic policies manages to stay close to the ev ader at the final stage, Fig. 4 shows that both heuristic policies are more costly than the proposed equilib- rium strategy in the long run. The conserv ativ e policy a voids potential trajectory deviations under deception but results in 4 The supplementary materials include a video demo that compares the proposed policy’ s trajectory and performance with two heuristic policies. 11 -10 -5 0 5 10 0 2 4 6 8 10 (a) Ratio represents d k 2 , g ( θ b 2 ) / d k 2 , b ( θ b 2 ) . -20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30 (b) Ratio represents d k 2 , b ( θ b 2 ) / d K 2 , b ( θ b 2 ) . -10 -5 0 5 10 0 2 4 6 8 10 (c) Ratio represents d k 21 ( θ b 2 ) / d K 2 , b ( θ b 2 ) . Fig. 1: The ev ader’ s trajectories from x 0 2 = [ 0 , 0 ] and x 0 2 = [ − 5 , 2 ] in solid and the dashed lines, respectively . The black downw ard and upward triangles represent the location of Calais ( x b , y b ) = ( − 10 , 10 ) and Normandy ( x g , y g ) = ( 10 , 10 ) , respecti vely . The ratios capture the ev ader’ s tradeoff of forming a deceptiv e trajectory , reaching the true target, and ev ading the pursuit. -10 -8 -6 -4 -2 0 2 4 6 8 10 -2 0 2 4 6 8 10 Fig. 2: The pursuer’ s trajectories under different initial beliefs. 0 5 10 15 20 25 30 35 40 0 0.2 0.4 0.6 0.8 1 16 17 18 19 20 21 22 0.96 0.97 0.98 0.99 1 Fig. 3: The pursuer’ s belief update over K stages under three different initial beliefs and the same noise sequence [ w k ] k ∈ K . The inset black box magnifies the selected area. less planning stages for the pursuer to achieve the capture goal. W e visualize the accumulation of the pursuer’ s cost in Fig. 4c. The red lines sho w that the pursuer who adopts the conservati ve policy spends no action costs before the truth-rev ealing stage k t r 1 , i.e., ( u k 1 ) 0 F k 11 ( θ 1 ) u k 1 = 0 , ∀ k ≤ k t r 1 , b ut huge costs in the remaining stages to fulfill her capture goal. The total cumulative cost ˆ V 0: K i at the final stage increases exponentially with the value of k t r 1 as shown in Fig. 4b. The black line in Fig. 4c illustrates the accumulation of ˆ V 0: k i when the pursuer direct follows the ev ader’ s trajectory . Only under e xtreme deception scenarios where k t r 1 > 34, the direct following polic y results in a lower cost than the conservati ve policy does. Since the initial belief l 0 1 affects both the truth- rev ealing stage and the proposed policy , we plot ˆ V 0: K i versus l 0 1 under the conservati ve policy and the proposed policy in Fig. 4a. When there is no belief mismatch l 0 1 ( θ b 2 | x 0 , θ H 1 ) = 1, we have k t r 1 = 1 and the conservati ve policy is equi valent to the proposed policy . As the belief mismatch increases, the cost ˆ V 0: K i under the proposed policy (resp. the conservati ve policy) increases due to the larger deviation along the x -axis (resp. the larger k t r 1 ). The proposed policy always results in a lo wer cost ˆ V 0: K i than the conserv ative policy does. The results in Fig. 4 lead to the following two principles for the pursuer to behave under deception. First, Bayesian learning is a more effecti ve countermeasure than the direct follo wing of the ev ader’ s deceptive trajectory . Second, if learning the ev ader’ s type takes a long time, the pursuer is better to act proactiv ely based on her current belief than to delay actions until the truth-rev ealing stage. B. Dynamic Game for Deception and Counter -Deception In this section, the ev ader has a coupled cost 5 defined in Definition 5 and the le vel of evasion determination increases with a constant rate α > 0; i.e., d k 21 ( θ 2 ) = α k , ∀ θ 2 ∈ Θ 2 , ∀ k ∈ K . The ev ader deceiv es the pursuer by hiding his true tar get. The pursuer can adopt the follo wing two countermeasures to reduce her cost under the ev ader’ s deception. Section IV -B1 in vestigates the effecti veness of adaptiv e learning. W e find that the pursuer manages to approach the true target at the final stage by updating her belief and taking actions accordingly based on the real-time trajectory observation. Section IV -B2 further allows the pursuer to introduce additional deception, i.e., obfuscate her maneuverability , to counteract the ev ader’ s information advantage and his deception impact. 1) Pursuer with a Public T ype: When the pursuer’ s type is common knowledge , we plot both U A Vs’ trajectories under two initial beliefs and two types of pursuers in Fig. 5. The solid lines show that the ev ader with the coupled cost 5 A video demo of two U A Vs’ real-time trajectories and belief updates under the coupled structure is included in the supplementary materials. 12 0 0.2 0.4 0.6 0.8 1 600 800 1000 1200 1400 1600 1800 K-stage Cost (a) The K -stage cumulati ve cost ˆ V 0: K i versus different initial beliefs. 0 10 20 30 40 Truth-Revealing Stage 0 2000 4000 6000 8000 10000 12000 K-stage Cost (b) The K -stage cumulativ e cost ˆ V 0: K i versus k t r 1 under the conservative policy . 0 10 20 30 40 Stage 0 2000 4000 6000 8000 Cumulative Cost (c) The accumulation of the pursuer’s cost ˆ V 0: k i , ∀ k ∈ K , along with stages. Fig. 4: The pursuer’ s ex-post cumulati ve cost under two heuristic policies and the proposed policy . -10 -8 -6 -4 -2 0 2 4 6 8 10 -5 0 5 10 15 Fig. 5: The K -stage trajectory of the e vader and the pursuer in solid and dashed lines, respecti vely . If the ev ader’ s type is com- mon knowledge and the pursuer is of high-maneuverability , we represent their noise-free trajectories in black. If the ev ader’ s type is pri v ate and the pursuer’ s initial belief mismatch is 0 . 9, two U A Vs’ trajectories are in red (resp. blue) when the pursuer’ s maneuverability is high (resp. lo w). detours to stay further from the pursuer . The initial belief mismatch causes a de viation along the x -axis for both high- and low-maneuv erability pursuers as shown in red and blue, respectiv ely . Ho wev er, the deviation has a smaller magnitude and lasts shorter than the one represented by the red line in Fig. 2 due to the coupled cost structure of the ev ader . The pursuer with a high maneuverability stays closer to the ev ader at the final stage. 2) Deception to Counteract Deception: When the pursuer’ s type is also pri vate, Fig. 6 shows that she can manipulate the ev ader’ s initial belief l 0 2 to obtain a smaller k t r 1 and a belief update with less fluctuation. The red line with stars is the same as the one in Fig. 3. It shows that the pursuer’ s belief learning is slower and fluctuates more when she interacts with the e vader who has a decoupled cost. The reason is that her manipulation of the initial belief l 0 2 does not affect the e vader’ s decision making as shown in Corollary 1. A comparison between Fig. 6a and Fig. 6b shows that it is bene- ficial for a lo w-maneuverability pursuer to disguise as a high- maneuverability pursuer b ut not vice v ersa. Thus, introducing additional deception to counteract existing deception is not always effecti ve. 0 5 10 15 20 25 30 35 40 0 0.2 0.4 0.6 0.8 1 10 12 14 16 18 20 0.98 1 (a) Lo w-maneuverability pursuer’ s belief update. 0 5 10 15 20 25 30 35 40 0 0.2 0.4 0.6 0.8 1 13 15 17 19 21 0.95 1 (b) High-maneuverability pursuer’ s belief update. Fig. 6: The pursuer’ s belief update o ver K stages with the same initial belief l 0 1 ( θ b 2 | x 0 , θ 1 ) = 0 . 1. The inset black box magnifies the selected area. C. Multi-Dimensional Deception Metrics The impact of the e vader’ s deception can be measured by metrics such as the endpoint distance x f d 2 : = || x K 2 − γ ( θ 2 ) || 2 between the ev ader and the true target, the endpoint dis- tance x f d 1 : = || x K 2 − x K 1 || 2 between tw o U A Vs, both U A Vs’ truth-rev ealing stages k t r i , and their ex-post cumulative costs ˆ V 0: k i , ∀ k ∈ K . In this pursuit-ev asion case study , we define ε -reachability and ε -capturability in Definition 8. Although x f d i , ∀ i ∈ { 1 , 2 } , is a random variable, we can obtain a good estimate of the reachability and capturability due to the neg- ligible variance of x f d i as shown in Fig. 7a and Fig. 8a. Definition 8 ( Reachability and Capturability ) . Consider the pr oposed pursuit-evasion scenario with a given ε ≥ 0 , a thr eshold ¯ x f d ≥ 0 , and all initial beliefs l 0 i ∈ ( 0 , 1 ) . The tar get is said to be ε -r eachable if Pr ( x f d 2 ≥ ¯ x f d ) ≤ ε . The evader is said to be ε -capturable if Pr ( x f d 1 ≥ ¯ x f d ) ≤ ε . 13 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 Distance (a) Distance x f d 1 with its v ari- ance magnified by 100 times. 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 Truth-revealing Stage (b) A realization of the pur- suer’ s truth-rev ealing stage k t r 1 . 0 0.2 0.4 0.6 0.8 1 0 1000 2000 3000 4000 5000 Cumulative Costs (c) The costs ˆ V 0: K − 1 1 and ˆ V 0: K 1 of the pursuer under type θ H 1 . 0 0.2 0.4 0.6 0.8 1 5500 6000 6500 7000 7500 8000 8500 Cumulative Costs (d) The ev ader’ s K -stage ex- post cumulativ e cost ˆ V 0: K 2 . Fig. 7: The influence of the initial belief mismatch on decep- tion. Error bars represent variances of the random v ariables. In Section IV -C1, we in vestigate how the ev ader can manip- ulate the pursuer’ s initial belief l 0 1 ( θ b 2 | x 0 , θ H 1 ) to influence the deception. In Section IV -C2, we inv estigate how the pursuer’ s maneuverability plays a role in deception. In both sections, the ev ader has a coupled cost structure. The pursuer either applies the Bayesian update or not, which is denoted by blue and red lines, respecti vely , in both Fig. 7 and Fig. 8. In Section IV -C3, we study other metrics, such as deceiv ability , distinguishability , and PoD. 1) The Impact of the Evader’ s Belief Manipulation: Both U A Vs determine their initial beliefs based on the intelligence collected before their interactions. By falsifying the pursuer’ s intelligence, the ev ader can manipulate the pursuer’ s initial belief l 0 1 and further influence the deception as shown in Fig. 7. In the x -axis, an initial belief l 0 1 ( θ b 2 | x 0 , θ H 1 ) closer to 1 indicates a smaller belief mismatch. Fig. 7a shows that the pursuer’ s distance to the e vader at the final stage decreases as the belief mismatch decreases regardless of the existence of Bayesian learning. Ho wev er, the initial belief manipulation has a much less influence on the endpoint distance x f d 1 when Bayesian learning is applied. Fig. 7b shows that for each realization of the noise sequence w k , the pursuer’ s truth-revealing stage steps down as the belief mismatch decreases when Bayesian update is applied. Fig. 7c illustrates the pursuer’ s ex-post cumulati ve cost ˆ V 0: K 1 and ˆ V 0: K − 1 1 at the last and the second last stage, respectiv ely . W ithout Bayesian update, the ev ader’ s deception significantly increases the pursuer’ s cost at the second last stage due to the lar ge endpoint distance x f d 1 . The red lines sho w that the cost increase is higher under a larger belief mismatch. Fig. 7d illustrates the ev ader’ s ex-post cumulati ve cost at the last stage. If the pursuer does not apply Bayesian learning, then the e vader can decrease his cost by increasing the pursuer’ s belief mismatch. If the pursuer applies Bayesian learning, 0 0.5 1 1.5 2 0 5 10 15 Distance (a) Distance x f d 1 with its v ari- ance magnified by 100 times. 0 0.5 1 1.5 2 0 2000 4000 6000 8000 10000 Cumulative Costs (b) T wo U A Vs’ K -stage costs ˆ V 0: K 1 and ˆ V 0: K 2 . Fig. 8: The influence of the pursuer’ s maneuverability on de- ception. Error bars represent v ariances of the random variables. then the e v ader’ s cost increases slightly if the pursuer’ s belief mismatch is increased. When the belief mismatch is small (i.e., 1 − l 0 1 ∈ ( 0 , 0 . 35 ) ), we observe a win-win situation; i.e., Bayesian learning not only reduces the pursuer’ s ex-post cumulativ e cost, but also the ev ader’ s. 2) The Impact of the Pursuer’ s Maneuverability: The pur- suer’ s maneuverability can also affect deception as shown in Fig. 8. The pursuer has an initial belief l 0 1 ( θ b 2 | x 0 , θ H 1 ) = 0 . 5 and the ev ader knows the pursuer’ s type. Fig. 8a illustrates that the pursuer can exponentially decrease her distance to the e vader at the final stage as her maneuverability increases. Fig. 8b demonstrates that the maneuverability increase can decrease and increase the pursuer’ s and the ev ader’ s ex-post cumulativ e costs at the final stage, respectiv ely . The variance grows as maneuverability decreases because the pursuer’ s trajectory will become lar gely af fected by the external noise. In both figures, we observe the phenomenon of the marginal effect ; i.e., the change rates of both the endpoint distance x f d 1 and the cost ˆ V 0: K i decrease as the maneuverability increases. Thus, we conclude that higher maneuverability can improve the pursuer’ s performance under the e vader’ s deception as measured by the distance x f d 1 and the cost ˆ V 0: K 1 . Moreov er , the improv ement rate is higher with low maneuv erability . 3) Deceivability , Distinguishability , and P oD: Deceiv abil- ity defined in Definition 7 is highly related to the distinguish- blity among dif ferent types. In this case study , a larger distance between targets, i.e., || γ ( θ g 2 ) − γ ( θ b 2 ) || 2 , makes it easier for the pursuer to distinguish between ev aders of type θ b 2 and type θ g 2 . A larger maneuverability difference | ˜ B 1 ( θ H 1 ) − ˜ B 1 ( θ L 1 ) | makes it easier for the ev ader to distinguish between pursuers of type θ H 1 and type θ L 1 . W e visualize two U A Vs’ truth- rev ealing stages k t r i versus the distance between targets and the maneuverability difference in Fig. 9. The ev ader has a coupled cost and both players’ initial belief mismatches are 0 . 5. The dashed black line indicates ˜ B 1 ( θ L 1 ) = 0 . 3. When the maneuverability dif ference is negligible ˜ B 1 ( θ H 1 ) ∈ ( 0 . 26 , 0 . 36 ) , the pursuer’ s type cannot be learned correctly in K stages; i.e., the pursuer is ( K + 1 ) -stage 0-deceiv able. When the maneuverability difference is small, i.e., ˜ B 1 ( θ H 1 ) ∈ ( 0 . 1 , 0 . 5 ) , yet not negligible, i.e., ˜ B 1 ( θ H 1 ) / ∈ ( 0 . 26 , 0 . 36 ) , the v ariance of k t r 2 is large. Let θ 2 = θ b 2 be common knowledge and assume that the 14 0 5 10 15 20 0 20 40 0 0.5 1 1.5 0 20 40 Fig. 9: The plot of the deceiv ed robot’ s truth-re vealing stage versus the deceiver’ s type distinguishability . Error bars repre- sent their variances, which are magnified by 5 times. 0 0.2 0.4 0.6 0.8 1 0.9 1 1.1 1.2 1.3 Price of Deception Fig. 10: PoD vs. prior type distribution for three values of η 1 . ev ader’ s belief confirms to the prior distribution of the pur- suer’ s type for all stages, i.e., l k 2 ( θ 1 | h k , θ b ) = Ξ 1 ( θ 1 ) , ∀ θ 1 ∈ Θ 1 , ∀ k ∈ K . Then, Fig. 10 illustrates how the prior distribution of the pursuer’ s type affects the v alue of PoD under three scenarios: • η 1 = 1, i.e., the central planner only ev aluates U A V 1’ s performance under deception. • η 1 = 0, i.e., the central planner only ev aluates U A V 2’ s performance under deception. • η 1 = 0 . 5, i.e., the central planner ev aluates the av erage performance of two U A Vs under deception. When the pursuer’ s type is also common knowledge , i.e., Ξ 1 ( θ H 1 ) = 0 (i.e., the pursuer has type θ L 1 ) and Ξ 1 ( θ H 1 ) = 1 (i.e., the pursuer has type θ H 1 ), the game is of complete information and the v alue of PoD equals 1. Since PoD takes continuous values over Ξ 1 ( θ H 1 ) ∈ [ 0 , 1 ] and has a v alue of 1 at two endpoints for all feasible η 1 , we refer to the plots in Fig. 10 as jump r ope plots. They corroborate that the PoD can be bigger than 1; i.e., deception among players may not only benefit the deceiv er but also the deceiv ee. V . C O N C L U S I O N A N D F U T U R E W O R K W e hav e in vestigated a novel class of rational r obot de- ception problems where intelligent robots hide their hetero- geneous priv ate information to achiev e their objectiv es in finite stages with minimum costs. W e have proposed an N - player dynamic game framework to quantify the impact of deception and design long-term optimal actions for deception and counter-deception. Robots form their own initial beliefs on others’ priv ate information and update their beliefs at each stage based on extrinsic or intrinsic information. Satisfying the properties of sequential rationality and belief consistency , perfect Bayesian Nash equilibrium can be used to predict N robots’ actions and costs ov er the K stages. W e hav e studied a class of games in the linear -quadratic form with e xtrinsic belief dynamics to obtain a unique affine state-feedback control policy and a set of e xtended Riccati equations. The cognitiv e coupling resulted from the deception of types demonstrates a distinct feature of rational deception where each player’ s action hinges on not only his own belief but also all other players’ beliefs. The concepts of deceivability , distinguisha- bility , and reac hability hav e been defined to characterize the fundamental limits of deception. Meanwhile, the price of deception serves as a crucial e v aluation and design metric. W e ha ve in vestigated a target protection problem where the ev ader aims to deceptiv ely reach the true target and the pursuer keeps her maneuverability as priv ate information. The pursuer achie ves a lower ex-post cumulati ve cost un- der the proposed policy than under the direct-following and conservati ve policies. W e ha ve proposed multi-dimensional metrics such as the stage of truth re velation and the endpoint distance to measure the deception impact throughout stages. W e hav e concluded that Bayesian learning can largely reduce the impact of initial belief manipulation and sometimes result in a win-win situation. The increase of the pursuer’ s maneu- verability can also reduce the endpoint distance and her ex- post cumulative cost yet has a marginal effect. A robot is more deceivable , i.e., less learnable when its potential type is less distinguishable . Finally , we have found that introducing additional deception to counteract existing deception is not always effecti ve. Moreover , deception among multiple players may not only benefit the deceiv er but also the decei vee. R E F E R E N C E S [1] D. L. Smith, Why W e Lie: The Evolutionary Roots of Deception and the Unconscious Mind . Macmillan, 2004. [2] M. Howard and M. E. Howard, Strate gic Deception in the Second W orld W ar . WW Norton & Company , 1995, vol. 5. [3] L. Cowen, T . Ideker , B. J. Raphael, and R. Sharan, “Network propa- gation: a universal amplifier of genetic associations, ” Nature Re views Genetics , vol. 18, no. 9, p. 551, 2017. [4] E. Al-Shaer , J. W ei, K. W . Hamlen, and C. W ang, “Dynamic Bayesian games for adversarial and defensiv e cyber deception, ” in A utonomous Cyber Deception . Springer, 2019, pp. 75–97. [5] D. Li and J. B. Cruz, “Defending an asset: A linear quadratic game approach, ” IEEE T ransactions on Aerospace and Electronic Systems , vol. 47, no. 2, pp. 1026–1044, 2011. [6] K. Sreenath and V . Kumar , “Dynamics, control and planning for co- operativ e manipulation of payloads suspended by cables from multiple quadrotor robots, ” in Robotics: Science and Systems , 2013. [7] J. C. Harsanyi, “Games with incomplete information played by ”Bayesian” players, i-iii. part i. the basic model, ” Management Science , vol. 14, no. 3, pp. 159–182, 1967. [8] V . L. L. Thing and J. Wu, “ Autonomous vehicle security: A taxonomy of attacks and defences, ” in 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Commu- nications (GreenCom) and IEEE Cyber , Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) , 2016, pp. 164–170. 15 [9] Y . Huang, J. Chen, L. Huang, and Q. Zhu, “Dynamic games for secure and resilient control system design, ” National Science Revie w , vol. 7, no. 7, pp. 1125–1141, 01 2020. [10] Y . Zhao, L. Huang, C. Smidts, and Q. Zhu, “Finite-horizon semi-Markov game for time-sensitiv e attack response and probabilistic risk assessment in nuclear power plants, ” Reliability Engineering & System Safety , vol. 201, p. 106878, 2020. [11] D. Li, N. Gebraeel, and K. Paynabar, “Detection and differentiation of replay attack and equipment faults in scada systems, ” IEEE T ransactions on Automation Science and Engineering , pp. 1–14, 2020. [12] S. Bhattacharya and T . Bas ¸ar, “Game-theoretic analysis of an aerial jamming attack on a U A V communication network, ” in Proceedings of the 2010 American Contr ol Conference , 2010, pp. 818–823. [13] R. Zhang and P . V enkitasubramaniam, “Stealthy control signal attacks in linear quadratic Gaussian control systems: detectability reward tradeoff, ” IEEE T rans. Inf. F orensics Security , vol. 12, no. 7, pp. 1555–1570, 2017. [14] Q. Zhang, K. Liu, Y . Xia, and A. Ma, “Optimal stealthy deception attack against cyber -physical systems, ” IEEE T ransactions on Cybernetics , vol. 50, no. 9, pp. 3963–3972, 2020. [15] A. A yub, “ An adaptive Markov process for robot deception, ” Master’ s thesis, Pennsylvania State University , July 2017. [16] L. Huang and Q. Zhu, “ A dynamic games approach to proactive defense strategies against advanced persistent threats in cyber-physical systems, ” Computers & Security , vol. 89, p. 101660, 2020. [17] M. O. Karabag, M. Ornik, and U. T opcu, “Optimal deceptive and reference policies for supervisory control, ” in 58th Confer ence on Decision and Contr ol (CDC) , 2019, pp. 1323–1330. [18] M. Ornik and U. T opcu, “Deception in optimal control, ” in 56th An- nual Allerton Confer ence on Communication, Control, and Computing (Allerton) , 2018, pp. 821–828. [19] K. Hor ´ ak, Q. Zhu, and B. Bo ˇ sansk ´ y, “Manipulating adversary’ s belief: A dynamic game approach to deception by design for proactive network security , ” in Decision and Game Theory for Security , S. Rass, B. An, C. Kiekintveld, F . Fang, and S. Schauer, Eds., 2017, pp. 273–294. [20] B. R. Bre wer, R. L. Klatzky, and Y . Matsuoka, “V isual-feedback distortion in a robotic rehabilitation en vironment, ” Pr oceedings of the IEEE , vol. 94, no. 9, pp. 1739–1751, 2006. [21] J. Shim and R. C. Arkin, “Other-oriented robot deception: A compu- tational approach for deceptive action generation to benefit the mark, ” in 2014 IEEE International Confer ence on Robotics and Biomimetics (R OBIO 2014) , 2014, pp. 528–535. [22] A. Dragan, R. Holladay , and S. Sriniv asa, “ An analysis of deceptive robot motion, ” in Robotics: Science and Systems , July 2014. [23] J. Shim and R. C. Arkin, “ A taxonomy of robot deception and its benefits in HRI, ” in 2013 IEEE International Conference on Systems, Man, and Cybernetics , 2013, pp. 2328–2335. [24] A. R. W agner and R. C. Arkin, “ Acting decepti vely: Providing robots with the capacity for deception, ” International J ournal of Social Robotics , vol. 3, no. 1, pp. 5–26, 2011. [25] P . Masters, “Goal recognition and deception in path-planning, ” Ph.D. dissertation, RMIT Univ ersity , February 2019. [26] K. Xu, Y . Zeng, L. Qin, and Q. Y in, “Single real goal, magnitude-based deceptiv e path-planning, ” Entr opy , vol. 22, no. 1, 2020. [27] W . McEneane y and R. Singh, “Deception in autonomous vehicle decision making in an adversarial en vironment, ” in AIAA Guidance, Navigation, and Contr ol Confer ence and Exhibit , 2005, p. 6152. [28] P . G. Bennett, “Hypergames: developing a model of conflict, ” Futur es , vol. 12, no. 6, pp. 489–507, 1980. [29] X. He and H. Dai, Dynamic Security Games with Deception . Cham: Springer International Publishing, 2018, pp. 61–71. [30] J. Pawlick, E. Colbert, and Q. Zhu, “Modeling and analysis of leaky deception using signaling games with evidence, ” IEEE T ransactions on Information F orensics and Security , vol. 14, no. 7, pp. 1871–1886, 2019. [31] L. Huang and Q. Zhu, “Duplicity games for deception design with an application to insider threat mitigation, ” , 2021. [32] N. Sandell and M. Athans, “Solution of some nonclassical LQG stochastic decision problems, ” IEEE T ransactions on Automatic Contr ol , vol. 19, no. 2, pp. 108–116, 1974. [33] L. Huang and Q. Zhu, “ Analysis and computation of adaptive defense strategies against advanced persistent threats for cyber-physical sys- tems, ” in International Conference on Decision and Game Theory for Security , 2018, pp. 205–226. [34] ——, “ Adaptiv e strategic cyber defense for advanced persistent threats in critical infrastructure networks, ” ACM SIGMETRICS P erformance Evaluation Review , vol. 46, no. 2, pp. 52–56, 2019. [35] A. Ouammi, Y . Achour, D. Zejli, and H. Dagdougui, “Supervisory model predictiv e control for optimal energy management of networked smart greenhouses integrated microgrid, ” IEEE T ransactions on Automation Science and Engineering , vol. 17, no. 1, pp. 117–128, 2020. [36] J. Cruz, M. A. Simaan, A. Gacic, and Y . Liu, “Moving horizon Nash strategies for a military air operation, ” IEEE Tr ansactions on Aer ospace and Electr onic Systems , vol. 38, no. 3, pp. 989–999, 2002. [37] A. Liniger and J. L ygeros, “ A noncooperati ve game approach to au- tonomous racing, ” IEEE T ransactions on Control Systems T echnology , vol. 28, no. 3, pp. 884–897, 2020. [38] H. Hajieghrary , D. Kularatne, and M. A. Hsieh, “Cooperative transport of a buoyant load: A dif ferential geometric approach, ” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS) , 2017, pp. 2158–2163. [39] R. S. Nickerson, “Confirmation bias: A ubiquitous phenomenon in many guises, ” Review of general psychology , vol. 2, no. 2, pp. 175–220, 1998. [40] D. Fridovich-K eil, V . Rubies-Royo, and C. J. T omlin, “ An iterative quadratic method for general-sum differential games with feedback linearizable dynamics, ” in 2020 IEEE International Conference on Robotics and Automation (ICRA) , 2020, pp. 2216–2222. [41] T . Basar and G. J. Olsder, Dynamic noncooperative game theory . Siam, 1999, vol. 23. Linan Huang (S’16) receiv ed the B.Eng. degree (Hons.) in Electrical Engineering from Beijing In- stitute of T echnology , China, in 2016. He is cur- rently pursuing the Ph.D. degree with the Laboratory for Agile and Resilient Complex Systems, T andon School of Engineering, New Y ork University , NY , USA. His research interests include dynamic deci- sion making of the multi-agent system, mechanism design, artificial intelligence, security and resilience for the cyber-ph ysical systems. Quanyan Zhu (SM’02-M’14) received B. Eng. in Honors Electrical Engineering from McGill Uni- versity in 2006, M. A. Sc. from the Uni versity of T oronto in 2008, and Ph.D. from the Univ ersity of Illinois at Urbana-Champaign (UIUC) in 2013. After stints at Princeton Uni versity , he is currently an associate professor at the Department of Electrical and Computer Engineering, New Y ork University (NYU). He is an affiliated faculty member of the Center for Urban Science and Progress (CUSP) and Center for Cyber Security (CCS) at NYU. His cur- rent research interests include game theory , machine learning, c yber deception, and cyber -physical systems.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment