RL-Based Method for Benchmarking the Adversarial Resilience and Robustness of Deep Reinforcement Learning Policies

RL-Based Metho d for Benc hmarking the Adv ersarial Resilience and Robustness of Deep Reinforcemen t Learning P olicies V ahid Behzadan 1 and William Hsu 1 { behz adan, bhsu } @ ksu.edu Kansas State Universit y Abstract. This pap er inv estigates the resilience and robustness of Deep Reinforcemen t Learning (DRL) p ol icies to adversarial p erturbations in the state space. Accordingly , we ﬁrst presen t an approach for the disen- tanglemen t o f vulnerabilities caused by representation learning of DRL agen ts f rom those that stem from the sensitivit y of the DRL p olicies to distributional shifts in state transitions. Building on this approac h, w e propose tw o RL-based tec hn iques for quan titative b enc hmark ing of adversa rial resil ience and robu stness in DRL p olicies a gainst perturba- tions of state transitions. W e demonstrate the feasibilit y of our prop osals through exp erimen tal ev aluatio n of resilience and robu stn ess i n DQN, A2C, and PPO2 p olicie s trained in the Cartp ole environmen t. Keywords: Deep Reinforcemen t Learning · Adversaria l Attack · Policy Generalization · Resilience · rob u stness · b enc hmarkin g . 1 In tro duction Since the rep orts b y Behzadan & Munir [1] and Huang et al. [5], the primary emphasis of the s tate of the art in DRL security [2] has been on the vulnera bilit y of p olicies to state-space p e rturbations. In particular , the ma nipulation of the po licy via adversarial exa mples [4] has rema ined the main fo cus of cur ren t liter- ature on this issue. How ever, this bias towards a dv ersar ial example attac ks g ives rise to a critical shortcoming: the analys es of such attacks fail to disen tang le the vulnerability caused b y the learned representation and that which is due to the sensitivity of the DRL dyna mics to distributional shifts in state transitions . Also, the p erformance of de fense s pr oposed for a dv ersar ial example attacks ar e inherently limited to the considered attack mechanisms. As the most s uccessful techn ique for mit ig ation o f adversarial examples, adversarial training is known to enha nce the r obustness of ma c hine lear ning mo dels to the type of attack used for gener ating the training adversarial ex a mples, while lea ving the model vul- nerable to o ther types of attacks[8]. F urthermore, the curr en t literature fails to provide s olutions and appr oac hes which ca n be used in practice to ev aluate and improv e the robustness and r esilience of DRL p olicies to attac ks that exploit the sensitivity to state transitio ns. Also, there remains a need for qua n titative 2 V ahid Behzadan and William Hsu { behzadan , bhsu } @ksu.edu approaches to measure and b enc hmar k the resilience and r obustness of DRL po licies in a r eusable and generaliza ble manner. In re s ponse to these sho rtcomings, this pa p er aims to addr ess the pr oblem of quantifying and b enc hmark ing the robustness and resilience o f a DRL a gen t to adversarial p erturbations of state transitions at test-time, in a manner that is indep enden t of the a ttac k t yp e. This improv es the gener alization of current techn iques tha t ana lyze the mo del a gainst sp eciﬁc a dv ersar ial ex a mple attacks. Accordingly , the main co n tributions of this pap er are as follows: 1. W e present formulations of the resilience and ro bustness problems that e n- able the disen tang le men t of limitation in representation learning from sensi- tivit y of p olicies to state transitio n dynamics. 2. W e prop ose t wo RL-based techniques and corresp onding metrics for the mea- surement and b enchm a r king of resilience and ro bustness of DRL p olicies to per turbations of state transitions, 3. W e demons trate the feasibility of our prop osal through exp e rimen tal ev alu- ation of their p erformance on DQN, A2C, and PP O2 a gen ts trained in the Cartp ole environment . The rema inder of this pap er is organized as follo ws: Section 2 deﬁnes and formulates the pro blems of adv ers a rial resilience and robustness in DRL. Our prop osed metho ds for b enc hmark ing the test-time resilie nce and r obustness o f DRL po licies are presented in Sections 3 and 4. Section 5 provides the details of exp erimen tal setup for ev a luating the p erformance of our prop osals, with the corres p onding results pr esen ted in Sectio n 6. The pap er concludes in Section 7 with a summar y of ﬁndings a nd remar ks on future directio ns of r esearch . 2 Problem F orm ulation W e consider the the generic problem of RL in the settings of a Mar k ov Decision Pro cess (MDP), describ ed by the tuple M D P := < S , A , R , P > , where S is the set o f reachable states in the process , A is the set of av ailable actions, R is the mapping of transitions to the immediate re ward, and P repr esen ts the transition probabilities (i.e., state dynamics), whic h ar e initially unknown to RL agents. A t any g iv en time-step t , the MDP is at a state s t ∈ S . The RL agent’s ch o ice of action at time t , a t ∈ A causes a transition from s t to a state s t +1 according to the transition pro ba bilit y P ( s t +1 | s t , a t ). The a gen t receives a reward r t +1 = R ( s t , a t , s t +1 ) for choos ing the a c tion a t at state s t . In tera ctions of the age nt with MDP ar e deter mined b y the p olicy π . When such in teractio ns are deterministic, the p olicy π : S → A is a mapping betw een the states and their cor responding actions. A sto c hastic p o licy π ( s ) represents the probability distribution of implementing an y actio n a ∈ A at state s . The goal of RL is to lea rn a p olicy that ma x imizes the exp ected discounted return E [ R t ], where R t = P ∞ k =0 γ k r t + k ; with r t denoting the instantaneous r ew ard received at time t , and γ is a discount factor γ ∈ [0 , 1]. T o fa cilitate the formal statemen t of a dv ersar ial resilience and r obustness, we ﬁrst in tro duce the following deﬁnitions: Benc h m ark in g Resilience and Robustness in DRL 3 – A dversarial R e gr et at time T is the diﬀerence betw een r eturn obta ine d by the nomina l (unp erturb ed) agent at time T and the return obtained by the p erturb ed agent at time T . F ormally : ˆ R adv ( T ) = R nominal ( T ) − R pertur bed ( T ). The time T may represent either the termina l timestep o f an episo de, or the time-horizon of interest in the analys is. – A dversarial Budget is deﬁned by the one or more of the following pa- rameters: the maximum num b er of features that can b e p erturb ed in the observ ations ( O max ∈ [0 , ∞ ] ), the max im um n umber of o bserv a tio ns that can b e p erturbed ( N max ∈ [0 , ∞ ] ), and the probability of p erturbing each observ ation ( P ( pertu rb ) ∈ [0 , 1] ). Building on these tw o concepts, we deﬁne the problems of adversarial r e- silience and robustness as follows: 1. T est-Time R esil ienc e: The minimu m num b er of s ta te p erturbations re- quired to incur the maximum reduction to the total retur n at time T (de- noted by ˆ R adv ( T )) for an agent driven by a p olicy π ( s ) in an environment with transition dynamics P . 2. T est-Time R obustness: The maximum adversarial regret ˆ R adv ( T ) = ǫ max achiev able via a maximum o f δ max state p erturbations for an agent driven by a po licy π ( s ) in an environmen t with tra nsition dynamics P . The following sections provide the details o f our prop osed so lutio ns to each of the aforementioned problem settings . 3 Benc hmarking of T est-Tim e Resilience This problem can b e mo deled as that o f ﬁnding an optimal adversaria l p olicy π adv ( s ) that minimizes the cost incurre d to the a dversar y C adv in or der to im- po se the ma xim um adversarial reg ret ˆ R adv ( T ), the worst-case v alue of which is the highes t cum ulative reward achiev ed by the target p olicy R max . Our prop osed approach is throug h the formulation of this pro blem in the s e tt ing s of reinforce- men t learning. T he state spac e in the corr esponding MDP is the set of s ta tes in the targe t MDP , augmented with the action of the target in that sta te, i.e., S ′ = {∀ s ∈ S : ( s, π ( s )) } . F or the purp ose of mea suring a low er- bound for the resilience, we consider the worst-case white-box adversary , which is able to im- po se targeted state p erturbations with 10 0% success ra te, to induce any action within the p ermissible action- set of the ta rget A which has the lowest Q -v alue at any state s according to the targ et’s optimal state- a ction v alue function Q ∗ . In this ca se, the s et o f p ermissible adversaria l actions at any s tate s is given by: A adv ( s ) = { No Action } ∪ A \ π ∗ ( s ) (1) Where A is the a c tio n set of the tar geted ag en t, a nd π : S → A is the p olicy of the targ eted age nt. In the prop osed approach, the a dv ersa r ial reward v alue is determined via the pr ocedure deta iled in Algorithm 1: 4 V ahid Behzadan and William Hsu { behzadan , bhsu } @ksu.edu Algorithm 1 Reward Assignment of RL Agent for Measur ing Adversarial Re- silience Require: T arget p olicy π ∗ , P erturbation cost function c adv ( ., . ), Maxim um ac hiev able score R max , Optimal state-action val ue function Q ∗ ( ., . ), Current adversarial p oli cy π adv , Current sta te s t , Current co unt of adversa rial actions Adv C ount , Current score R t Set T oPerturb ← π adv ( s t ) if T oPerturb is F alse then a t ← π ∗ ( s t ) Rew ar d ← 0 else a ′ t ← arg min a Q ∗ ( s t , a ) Rew ar d ← − c adv ( s t , a ′ t ) end if if either s t or s ′ t is terminal then Rew ar d + = ( R max − R t ) end if where c ( s t , a ′ t ) is the co s t of imp osing the sta te per turbation which induces the adversarial a ction a ′ t at s tate s t . It is noteworth y that if the v alue o f c ( s t , a ′ t ) is in v aria n t with resp ect to a ′ t , the a dv ersar ial action set reduces to: A adv ( s ) = { No Action , Induc e arg min a Q ( s, a ) } (2) T o obtain the test-time res ilience of p olicy π ∗ to state p erturbations, w e prop ose the following pro cedure: 1. If the state-action v alue function of the target Q ∗ is not av ailable (i.e., black- box testing), a ppro ximate Q ∗ via policy imitation [6]. 2. T rain the adversaria l ag e n t against the ta rget following π in its training envi- ronment, repo rt the optimal adversarial r eturn R ∗ pertur bed and the maxim um adversarial regr et R ∗ adv ( T ). 3. Apply the adversarial p olicy a gainst the tar g et in N episo des, recor d tota l cost C adv for each episo de, 4. Report the average of C adv ov er N episo des as the mean test-time resilience of π in the g iv en environment. This pro cedure introduces 3 metrics for the quantiﬁcation of test-time resilience : the optima l adversarial return R ∗ pertur bed achiev ed in the training pro c ess of the adversarial p olicy , the maxim um adversarial regre t R ∗ adv ( T ) achieved during training, a nd the mean per- episode o f the total cost C adv . These metrics pr o vide the means to b enc hmar k and compa r e the test-time resilience of diﬀeren t po licies trained to optimize the agent’s p erformance in a given environmen t. F or the purpose of measuring r esilience, we consider co n vergence to b e rea c hed if the average a dv ersar ial re g ret ov er 200 episo des remains constant. This deﬁni- tion r elaxes the instabilities that may arise due to the co nﬁguration and architec- ture of the DRL tra ining pro cess. It is noteworthy tha t depending on the training Benc h m ark in g Resilience and Robustness in DRL 5 algorithm and design par ameters, this pro cedure is not guaranteed to conv erge to the global optimal. How ever, b y rep orting the num b er o f iterations and conﬁg- uration of random num b er g enerators with a co ns tan t seed, the rep orted results present a repro ducible lo ose low er b ound on the adversarial resilience of the tar - get. Also , the tra ined a dv ersar ial p olicy can b e used to test other p olicies for compariso n o f suc h low er -bounds under the same adversarial strategy . 4 Benc hmarking of T est-Tim e Robustness F or this problem, we prop ose a mo diﬁed v er sion of the pro cedure developed for benchmarking the test-time resilience. Accordingly , the re ward function is adjusted to account for the lack of a target ǫ , a s well as the addition of an adversarial budget cons tr ain t δ max . The re ward measurement of this pr ocess is outlined in Algorithm 2: Algorithm 2 Reward Assignment of RL Agent for Measuring Adversarial Ro- bustness Require: Maximum p erturbation budget δ max , Perturbation cost function c adv ( ., . ), Maxim um achiev able score R max , Optimal state-action v alue function Q ∗ ( ., . ), Cur- rent adversarial p oli cy π adv , Current state s , Current count of adversa rial actions Adv C ount , Current score R t Set Adversaria lAction ← π adv ( s ) if AdversarialAction is NoAction then Rew ar d ← 0 else if A dvC ount ≥ δ max then Rew ar d ← − c a dv ( s, Adve rsar ialAction ) × δ max Adv C ount + = 1 else Rew ar d ← − c a dv ( s, Adve rsar ialAction ) Adv C ount + = 1 end if if s is terminal then Rew ar d + = 1 . 0 ∗ ( R max − R t ) Adv C ount ← 0 end if The pro posed procedur e for measuring the test-time r obustness o f a g iv en DRL policy to adversarial state p erturbations is as follows: 1. If the state-action v alue function of the target Q ∗ is not av ailable (i.e., black- box testing settings), approximate Q ∗ from the p olicy us ing imitation learn- ing (e.g., [6 ]), 2. T rain the adversarial agent against the target p olicy π ∗ in its training envi- ronment, rep ort the maxim um adversarial regret R ∗ adv ( T ) fo r time T a c hieved at adversarial optimalit y , 6 V ahid Behzadan and William Hsu { behzadan , bhsu } @ksu.edu 3. Apply the adversarial p olicy a gainst the ta rget for N episo des, record the adversarial regr et a t the end o f each episo de R adv ( T ), 4. Report the average of R adv ( T ) ov er N episo des as the mean per -episode test-time robustness of π ∗ in the g iv en e n vironment. 5 Exp erimen t Setup En vironme n t and T a rget Policies: T o demons tr ate the p erformance of the prop osed pro cedures for benchmarking the test-time ro bus tness and re s ilience in DRL p olicies, we present the analysis of the aforementioned measurements for p olicies tr ained in the Car tP ole environment in Op enAI Gym [3]. The con- sidered p olicies are chosen to repre s en t the c o mmonly-adopted state of the art metho d from ea ch class of DRL algorithms. F rom v a lue-iteration a pproaches, we consider DQN with prioritized replay . F r om p o licy g r adien t appr oac hes, we consider PPO2. As for acto r-critic metho ds, we inv estig ate the A2C metho d. T able 5 presents the speciﬁca tions of the CartPole en viro nmen t, and T ables 5 – 5 pr o vide the parameter settings of each target po licy . T able 1. Sp eci ﬁcations of the CartP ole Environment Observ ation Space Cart Posi tion [-4.8, +4.8] Cart V elocity [-inf, +inf] P ole Angle [-24 deg, +24 deg] P ole V elo cit y at Tip [-inf, +inf] Action Space 0 : Push cart to the left 1 : Push cart to the right Rewa rd +1 for every step tak en T ermination P ole Angle is more than 12 degrees Cart Posi tion is more than 2.4 Episode length is greater than 500 T able 2. P arameters of DQN Policy No. Timesteps 10 5 γ 0 . 99 Learning Rate 10 − 3 Replay Buﬀer S ize 50000 First Learning Step 1000 T arget Netw ork Up date F req. 500 Prioritized Replay T rue Exploration P arameter-Space Noise Exploration F raction 0.1 Final Exploration Prob. 0.02 Max. T otal Reward 500 Benc h m ark in g Resilience and Robustness in DRL 7 T able 3. Parameters of A2C P olicy No. Timesteps 5 × 10 5 γ 0 . 99 Learning Rate 7 × 10 − 4 Entrop y Coeﬃcient 0.0 V alue F u nction Co eﬃcien t 0.25 Max. T otal Reward 500 T able 4. Parameters of A2C P olicy No. En vironments 8 No. Timesteps 10 6 No. Runs p er Environment p er Up date 2048 No. Minibatc hes p er up date 32 Bias-V ariance T rade-O ﬀ F actor 0.95 No. Surrogate Epo c hs 10 γ 0 . 99 Learning Rate 3 × 10 − 4 Entrop y Coeﬃcient 0.0 V alue F unction Co eﬃcie nt 0.5 Max. T otal Reward 500 Adv ersarial Agent: In these ex periments, the a dv ersar ial ag en t is a DQN agent with the hyperpara meters pr o vided in T able 5. W e consider a homogeneous per turbation cost function for all state per turbations, that is ∀ s, a ′ : c adv ( s, a ′ ) = c adv . F or bo th the resilience and robustnes s measurements, we set c adv = 1 (i.e., each p erturbation incurs a cos t o f 1 to the adversary). The training pro cess is terminated when the adversarial r egret is maximized and the 100-episo de av er age of the n umber of a dv ersar ial p erturbations is qua si-stable for 200 episo des. T able 5. P arameters of DQN Policy Max. Timesteps 10 5 γ 0 . 99 Learning Rate 10 − 3 Replay Buﬀer S ize 50000 First Learning Step 1000 T arget Netw ork Up date F req. 500 Exp erience Selection Prioritized Rep lay Exploration P arameter-Space Noise Exploration F raction 0.1 Final Exploration Prob. 0.02 8 V ahid Behzadan and William Hsu { behzadan , bhsu } @ksu.edu 6 Results 6.1 Resil ience Benc hmarks W e consider the white-b o x settings in the training of adversaria l agents for r e- silience measurement. F or the DQN target, the optimal state-action v alue func- tion Q ∗ of the target is directly utilized. As for the A2C and PPO 2 tar gets, the state-action v a lue function is ca lculated from the internally-av aila ble state v alue estimations V ∗ ( s ) according to the following transformatio n: Q ∗ ( s t , a ) = r ( s t , a ) + γ V ∗ ( s t +1 ) (3) where s t +1 is the state resulting from a tra nsition out of state s t by implemen ting action a . T raining Results: The training progr ess plots o f a dversar ial DQN p olicy on the three tar g et p olicies a re presented in Fig.1 – 3. It ca n b e s een that a ll three po licies conv erge to the sa me o ptima. Ho wev er , for the a dv ersar y targeting the DQN p olicies, the conv ergence is a c hieved at a higher n umber of training steps. It is noteworthy that for all three policies , the mean-pe r -100 episo des of the minim um num b e r of perturba tions at con vergence is almost similar (as rep orted in T able 6 .2) , with A2C having the largest v alue of 7 . 69 pertur ba tions, PPO2 at 7 . 49 pe rturbations, and DQN having the lo west v a lue of 7 . 13. Also, the test-time per formance of these t r ained p olicies indicate similar results, with DQN requiring 6 . 95 per turbations to incur an adversaria l regr et of 491 . 15 , P PO2 requiring 7 . 72 per turbations for an adversarial r egret of 4 90 . 47, and A2C r equiring 8 . 7 1 p ertur- bations for an adversarial regret of 488 . 16. Acco rdingly , we can interpret these results as follows: the DQN p olicy ha s the low est adversarial resilience amo ng the three, follow ed b y the PPO 2 policy . Within the context o f this compariso n, the A2C p olicy is found to b e the most r e s ilien t to state-spac e p erturbation attacks. 0 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 8 0 0 0 0 1 0 0 0 0 0 1 2 0 0 0 0 1 4 0 0 0 0 1 6 0 0 0 0 S te p s 3 8 0 4 0 0 4 2 0 4 4 0 4 6 0 4 8 0 M e a n 1 0 0 Ep iso d e R e g r e t Re g r e t No . Pe r tu rb a tio n s 1 0 2 0 3 0 4 0 5 0 No . Pe r tu r b a ti on s Pe r E p i so d e (1 0 0 E p i so d e s M e a n ) Fig. 1. A dv ersarial T raining Progress for Resilience Benchmarking of the DQN Policy Benc h m ark in g Resilience and Robustness in DRL 9 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 S te p s 3 8 0 4 0 0 4 2 0 4 4 0 4 6 0 4 8 0 M e a n 1 0 0 Ep i sod e Re g r e t Re g r e t No . Pe r tu rb a tio n s 5 6 7 8 9 1 0 1 1 No . Pe r tu rb a tio n s Pe r E p iso d e ( 1 0 0 E p iso d e s M e a n ) Fig. 2. Adversaria l T raining Progress for Resilience Benchmarking of the A2C P olicy 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 8 0 0 0 0 1 0 0 0 0 0 S te p s 4 7 5 .0 4 7 7 .5 4 8 0 .0 4 8 2 .5 4 8 5 .0 4 8 7 .5 4 9 0 .0 4 9 2 .5 M e a n 1 0 0 Ep isod e Re g r e t Re g r e t No . Pe r tu rb a tio n s 8 9 1 0 1 1 No . Pe r turb a tio n s Pe r E p iso d e ( 1 0 0 E p i so d e s M e a n ) Fig. 3. A dv ersarial T raining Progress for Resilience Benc hmark in g of the PPO2 P olicy 6.2 T est-Time S tep-P erturbation Dis tribution: T o in vestigate the sta te-transition vulnerability of ea c h p olicy , we also study the frequency of p erturbing states at each timestep of an episo de fo r the thr ee adversarial po lic ies. The results, presented in Fig. 4 – 6 illus trate that in a ll three po licies, the initial timesteps hav e been the sub ject of most pertur ba tions. This result is noteworth y , a s it contradicts with the as sumption of Lin et a l.[7 ] that the most eﬀectiv e adversarial p erturbations are those that are moun ted tow ar ds the terminal sta te of the environment. T able 6. Comparison of T est- Time and T raining-Time Resilience Measurements for DQN, A2C, and PPO2 Polic ies T arget Policy M ax. Regret Av g. Regret (T rai ning) Avg. No. Perturbat ions ( T raining) Avg. Re gr et Avg. No. Perturbations DQN 492 491.24 7.13 491.15 6.95 A2C 492 491.44 7.69 488.16 8.71 PPO2 492 491.72 7.49 490.47 7.72 10 V ahid Behzadan and William Hsu { behzadan , bhsu } @ksu.edu 0 1 2 3 4 5 6 7 8 9 1 0 T ime ste p in Ep iso d e 0 2 0 4 0 6 0 8 0 1 0 0 No . Pe r turb a ti on s Fig. 4. Perturbation Count Per Episo dic TimeStep in 100 R u ns T argeting DQN Policy 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 T ime ste p in Ep iso d e 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 No. Pe rtu rb a tio n s Fig. 5. Perturbation Coun t Per Episodic TimeStep in 100 Runs T argeting A2C Policy 6.3 Robustness Benchmarks T o de mo nstrate the p e rformance of o ur propo sed technique for benchmarking the robustness o f DRL p olicies, w e provide the training-time res ults for t wo cases of δ max = 10 and δ max = 5 for DQN, A2C, and PP O2 Policies. As illustr ated in Fig.7 – 9, all three a dv ersar ial p olicies conv er ge with s imilar minimum p ertur- bation c oun ts a s those obtained in resilience analys is. This is exp ected, as the resilience a nalysis established that the minimum num ber of a ctions required for maximum r egret is 7 . 5, w hich is less than the av ailable budget of δ max = 10 As for the ca se of δ max = 5, Fig.10 – 12 demonstrate signiﬁcant diﬀerences betw een the three p olicies. In Fig.1 0 , it c a n b e see n that at 5 a c tio ns, the conv erge nc e o ccurs with an adversar ial regret of 462 . 5, while for A2C, the b est 5-action in- dication of c o n vergence o ccurs at an adversarial r egret of 224. As for PPO2, this v alue is at 268 . 2 . These results indicate a similar ranking o f the robustness in these p olicies, with DQN being the least-robust to maxim um of 5 p ertur- bations, and the A2C prev ailing as the most robust p olicy to maximum of 5 per turbations. Benc h m ark in g Resilience and Robustness in DRL 11 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 T ime ste p in Ep iso d e 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 No . Pe r turb a ti on s Fig. 6. P erturb atio n Count P er Episo dic TimeStep in 100 Runs T argeting PPO Policy 6.4 Case 1: δ max = 10: 0 2 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 1 5 0 0 0 0 1 7 5 0 0 0 S t e p s 3 2 5 3 5 0 3 7 5 4 0 0 4 2 5 4 5 0 4 7 5 Me a n 1 0 0 E p isod e R e g r e t R e g r e t No. Pe rt u r b a tion s 5 .0 7 .5 1 0 .0 1 2 .5 1 5 .0 1 7 .5 2 0 .0 2 2 .5 No. Pe rt u r b a tion s Pe r E p isod e (1 0 0 Ep isod e s M e an ) Fig. 7. Adversarial T raining Progress for Robustness Benchmarking - D QN, δ max = 10 0 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 8 0 0 0 0 1 0 0 0 0 0 S t e p s 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 M e a n 1 0 0 Ep i so d e R e g re t R e g r e t No. Pe rt u r b a tion s 5 1 0 1 5 2 0 2 5 No. Pe r turb a t i ons Pe r Ep i so d e (1 0 0 E p isod e s M e a n) Fig. 8. Adversarial T raining Progres s for R ob u stness Benchmarking - A2C, δ max = 10 12 V ahid Behzadan and William Hsu { behzadan , bhsu } @ksu.edu 0 2 0 0 0 0 4 0 0 0 0 6 0 0 0 0 8 0 0 0 0 1 0 0 0 0 0 1 2 0 0 0 0 1 4 0 0 0 0 1 6 0 0 0 0 S t e p s 2 7 5 3 0 0 3 2 5 3 5 0 3 7 5 4 0 0 4 2 5 4 5 0 4 7 5 Me a n 1 0 0 E p isod e R e g r e t R e g r e t No. Pe rt u r b a tion s 8 1 0 1 2 1 4 1 6 1 8 2 0 2 2 No. Pe rt u r b a tion s Pe r E p isod e (1 0 0 Ep isod e s M e an ) Fig. 9. A dv ersarial T raining Prog ress for Robustness Benchmarking - PPO2, δ max = 10 6.5 Case 2: δ max = 5: 0 2 5 0 0 0 5 0 0 0 0 7 5 0 0 0 1 0 0 0 0 0 1 2 5 0 0 0 1 5 0 0 0 0 1 7 5 0 0 0 S t e p s 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 Me a n 1 0 0 E p isod e R e g r e t R e g r e t No. Pe rt u r b a tion s 4 5 6 7 8 9 1 0 1 1 No. Pe rt u r b a tion s Pe r E p isod e (1 0 0 Ep isod e s M e an ) Fig. 10. Adversarial T raining Progress for Rob u stness Benc hmarking - DQN, δ max = 5 0 5 0 0 0 0 1 0 0 0 0 0 1 5 0 0 0 0 2 0 0 0 0 0 2 5 0 0 0 0 S t e p s 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 M e a n 1 0 0 Ep i so d e R e g re t R e g r e t No. Pe rt u r b a tion s 4 6 8 1 0 1 2 1 4 No. Pe r turb a t i ons Pe r Ep i so d e (1 0 0 E p isod e s M e a n) Fig. 11. Adversaria l T raining Progress for R ob u stness Benchmarking - A2C, δ max = 5 Benc h m ark in g Resilience and Robustness in DRL 13 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 S t e p s 2 0 0 2 2 0 2 4 0 2 6 0 2 8 0 Me a n 1 0 0 E p isod e R e g r e t R e g r e t No. Pe rt u r b a tion s 5 6 7 8 9 1 0 No. Pe rt u r b a tion s Pe r E p isod e (1 0 0 Ep isod e s M e an ) Fig. 12. Adversarial T raining Progress for Robustness Benchmarking - PPO2, δ max = 5 7 Conclusion W e presented tw o RL-based techniques for b enc hmark ing the r esilience and ro- bustness of DRL policies to adv er sarial p erturbations of state transition dynam- ics. E xperimental ev aluation of o ur prop osals demons trate the fea sibilit y of these techn iques for quantitativ e analysis o f p olicies with rega rds to their sensitivity to state transition dynamics. A promising ven ue of further explor ation is to study and extend the prop osed metho dologies for ev aluation of generalizatio n in DRL po licies. References 1. Behzadan, V., Munir, A.: V ulnerabilit y of deep reinforcement learning to policy in- duction attacks. In : International Conference on Mac hine Learning and Data Mining in Pattern Recognition. pp. 262–275 . Springer (2017) 2. Behzadan, V., Munir, A.: The faults in our pi stars: Security issues and op en chal- lenges in deep reinforcement learning. arXiv preprint arX iv :1 810.10369 ( 20 18) 3. Brockman, G., Cheung, V ., Pettersson, L ., Schneider, J., Sch ulman, J., T ang, J., Zarem ba, W.: Op enai gy m. arXiv p reprin t arXiv:1606.015 40 (2016) 4. Good fe llow, I.J., S hlens, J., Szegedy , C.: Explaining and harnessing adversarial ex- amples (2014). arXiv preprint arXiv:1412. 6572 (2014) 5. Huang, S., Papernot, N., Goo dfello w, I., Duan, Y., Abb eel, P .: A dv ersarial attacks on neural netw ork p olici es. arXiv preprint arXiv :1702.02284 (2017) 6. Hussein, A., Gab er, M.M., Ely an, E., Ja yne, C. : Imitation learning: A survey of learning metho ds. A CM Computing S urv eys (CSUR) 50 (2), 21 (2017) 7. Lin, Y.C., H ong, Z.W., Liao, Y .H., Shih, M.L., Liu, M.Y., Sun, M.: T actics of adver- sarial attac k on deep reinforcement lea rning agents. arX iv preprint arXiv:17 03.06748 (2017) 8. T ram ` er, F., K urakin, A., P ap ernot, N., Go odfello w, I., Boneh, D., McDaniel, P .: En- sem ble adversarial training: A ttacks and defenses. arXiv preprint arXiv:1705.072 04 (2017)

RL-Based Method for Benchmarking the Adversarial Resilience and Robustness of Deep Reinforcement Learning Policies

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment