Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control
Data-driven models are subject to model errors due to limited and noisy training data. Key to the application of such models in safety-critical domains is the quantification of their model error. Gaussian processes provide such a measure and uniform …
Authors: Armin Lederer, Jonas Umlauft, S
Unif orm Err or Bounds f or Gaussian Process Regr ession with A pplication to Safe Control Armin Lederer T echnical Univ ersity of Munich armin.lederer@tum.de Jonas Umlauft T echnical Univ ersity of Munich jonas.umlauft@tum.de Sandra Hirche T echnical Univ ersity of Munich hirche@tum.de Abstract Data-driv en models are subject to model errors due to limited and noisy training data. Ke y to the application of such models in safety-critical domains is the quantification of their model error . Gaussian processes provide such a measure and uniform error bounds hav e been deri ved, which allow safe control based on these models. Howev er , existing error bounds require restricti ve assumptions. In this paper , we employ the Gaussian process distribution and continuity arguments to deri ve a novel uniform error bound under weaker assumptions. Furthermore, we demonstrate how this distrib ution can be used to derive probabilistic Lipschitz constants and analyze the asymptotic behavior of our bound. Finally , we derive safety conditions for the control of unkno wn dynamical systems based on Gaussian process models and ev aluate them in simulations of a robotic manipulator . 1 Introduction The application of machine learning techniques in control tasks bears significant promises. The identification of highly nonlinear systems through supervised learning techniques [ 1 ] and the auto- mated policy search in reinforcement learning [ 2 ] enables the control of complex unkno wn systems. Nev ertheless, the application in safety-critical domains, like autonomous driving, robotics or aviation is rare. Even though the data-ef ficiency and performance of self-learning controllers is impressi ve, engineers still hesitate to rely on learning approaches if the ph ysical integrity of systems is at risk, in particular, if humans are in volv ed. Empirical evaluations, e.g. for autonomous driving [ 3 ], are av ailable, howev er , this might not be sufficient to reach the desired le vel of reliability and autonomy . Limited and noisy training data lead to imperfections in data-driv en models [ 4 ]. This makes the quantification of the uncertainty in the model and the knowledge about a model’ s ignorance key for the utilization of learning approaches in safety-critical applications. Gaussian process models pro vide this measure for their own imprecision and therefore gained attention in the control community [ 5 , 6 , 7 ]. These approaches heavily rely on error bounds of Gaussian process regression and are therefore limited by the strict assumptions made in previous w orks on GP uniform error bounds [ 8 , 9 , 10 , 11 ]. The main contribution of this paper is therefore the deri vation of a nov el GP uniform error bound, which requires less prior knowledge and assumptions than previous approaches and is therefore applicable to a wider range of problems. Furthermore, we derive a Lipschitz constant for the samples of GPs and in vestigate the asymptotic beha vior in order to demonstrate that arbitrarily small error bounds can be guaranteed with suf ficient computational resources and data. The proposed GP bounds are employed to deriv e safety guarantees for unkno wn dynamical systems which are controlled based 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), V ancouver , Canada. on a GP model. By employing L yapunov theory [ 12 ], we prov e that the closed-loop system - here we take a robotic manipulator as e xample - con ver ges to a small fraction of the state space and can therefore be considered as safe. The remainder of this paper is structured as follows: W e briefly introduce Gaussian process regression and discuss related error bounds in Section 2. The nov el proposed GP uniform error bound, the probabilistic Lipschitz constant and the asymptotic analysis are presented in Section 3. In Section 4 we show safety of a GP model based controller and e valuate it on a robotic manipulator in Section 5. 2 Background 2.1 Gaussian Process Regr ession and Uniform Err or Bounds Gaussian process regression is a Bayesian machine learning method based on the assumption that any finite collection of random variables 1 y i ∈ R follows a joint Gaussian distribution with prior mean 0 and cov ariance kernel k : R d × R d → R + [ 13 ]. Therefore, the variables y i are observations of a sample function f : X ⊂ R d → R of the GP distribution perturbed by zero mean Gaussian noise with variance σ 2 n ∈ R +,0 . By concatenating N input data points x i in a matrix X N the elements of the GP kernel matrix K ( X N , X N ) are defined as K ij = k ( x i , x j ) , i , j = 1, . . . , N and k ( X N , x ) denotes the kernel vector , which is defined accordingly . The probability distribution of the GP at a point x conditioned on the training data concatenated in X N and y N is then gi ven as a normal distribution with mean ν N ( x ) = k ( x , X N )( K ( X N , X N ) + σ 2 n I N ) − 1 y N and variance σ 2 N ( x , x 0 ) = k ( x , x 0 ) − k ( x , X N )( K ( X N , X N ) + σ 2 n I N ) − 1 k ( X N , x 0 ) . A major reason for the popularity of GPs and related approaches in safety critical applications is the existence of uniform error bounds for the re gression error , which is defined as follows. Definition 2.1. Gaussian pr ocess r e gr ession exhibits a uniformly bounded err or on a compact set X ⊂ R d if ther e exists a function η ( x ) such that | ν N ( x ) − f ( x ) | ≤ η ( x ) ∀ x ∈ X . (1) If this bound holds with pr obability of at least 1 − δ for some δ ∈ (0, 1) , it is called a pr obabilistic uniform err or bound. 2.2 Related W ork For many methods closely related to Gaussian process regression, uniform error bounds are very common. When dealing with noise-free data, i.e. in interpolation of multi variate functions, results from the field of scattered data approximation with radial basis functions can be applied [ 14 ]. In fact, many of the results from interpolation with radial basis functions can be directly applied to noise-free GP re gression with stationary kernels. The classical result in [ 15 ] emplo ys Fourier transform methods to deri ve an error bound for functions in the reproducing k ernel Hilbert space (RKHS) attached to the interpolation kernel. By additionally exploiting properties of the RKHS a uniform error bound with increased con vergence rate is deriv ed in [ 16 ]. T ypically , this form of bound crucially depends on the so called po wer function, which corresponds to the posterior standard de viation of Gaussian process regression under certain conditions [ 17 ]. In [ 18 ], a L p error bound for data distributed on a sphere is de veloped, while the bound in [ 19 ] extends e xisting approaches to functions from Sobolev spaces. Bounds for anisotropic kernels and the deri vati ves of the interpolant are de veloped in [ 20 ]. A Sobole v type error bound for interpolation with Matérn kernels is deri ved in [ 21 ]. Moreov er , it is shown that con ver gence of the interpolation error implies conv ergence of the GP posterior v ariance. Regularized kernel regression is a method which e xtends many ideas from scattered data interpolation to noisy observ ations and it is highly related to Gaussian process regression as pointed out in [ 17 ]. In fact, the GP posterior mean function is identical to kernel ridge regression with squared cost 1 Notation: Lower/upper case bold symbols denote vectors/matrices and R + / R +,0 all real positive/non- negati ve numbers. N denotes all natural numbers, I n the n × n identity matrix, the dot in ˙ x the deriv ative of x with respect to time and k · k the Euclidean norm. A function f ( x ) is said to admit a modulus of continuity ω : R + → R + if and only if | f ( x ) − f ( x 0 ) | ≤ ω ( k x − x 0 k ) . The τ -cov ering number M ( τ , X ) of a set X (with respect to the Euclidean metric) is defined as the minimum number of spherical balls with radius τ which is required to completely cov er X . Big O notation is used to describe the asymptotic behavior of functions. 2 function [ 13 ]. Many error bounds such as [ 22 ] depend on the empirical L 2 cov ering number and the norm of the unkno wn function in the RKHS attached to the regression kernel. In [ 23 ], the ef fective dimension of the feature space, in which regression is performed, is employed to deri ve a probabilistic uniform error bound. The ef fect of approximations of the kernel, e.g. with the Nyström method, on the regression error is analyzed in [ 24 ]. T ight error bounds using empirical L 2 cov ering numbers are deri ved under mild assumptions in [ 25 ]. Finally , error bounds for general regularization are de veloped in [26], which depend on regularization and the RKHS norm of the function. Using similar RKHS-based methods for Gaussian process regression, probabilistic uniform error bounds depending on the maximal information gain and the RKHS norm hav e been developed in [ 8 ]. These constants pose a high hurdle which has pre vented the rigorous application of this work in control and typically heuristic constants without theoretical foundations are applied, see e.g. [ 27 ]. While regularized kernel regression allo ws a wide range of observation noise distributions, the bound in [ 8 ] only holds for bounded sub-Gaussian noise. Based on this work an improved bound is deri ved in [ 9 ] in order to analyze the regret of an upper confidence bound algorithm in multi-armed bandit problems. Although these bounds are frequently used in safe reinforcement learning and control, they suf fer from se veral issues. On the one hand, they depend on constants which are very dif ficult to calculate. While this is no problem for theoretical analysis, it prohibits the integration of these bounds into algorithms and often estimates of the constants must be used. On the other hand, the y suffer from the general problem of RKHS approaches: The space of functions, for which the bounds hold, becomes smaller the smoother the kernel is [ 19 ]. In fact, the RKHS attached to a cov ariance kernel is usually small compared to the support of the prior distrib ution of a Gaussian process [28]. The latter issue has been addressed by considering the support of the prior distrib ution of the Gaussian process as belief space. Based on bounds for the suprema of GPs [29] and existing error bounds for interpolation with radial basis functions, a probabilistic uniform error bound for Kriging (alternative term for GP regression for noise-free training data) is deriv ed in [ 30 ]. Howe ver , the uniform error of Gaussian process regression with noisy observations has not been analyzed with the help of the prior GP distribution to the best of our kno wledge. 3 Probabilistic Unif orm Error Bound While probabilistic uniform error bounds for the cases of noise-free observations and the restriction to subspaces of a RKHS are widely used, they often rely on constants which are hard to determine and are typically limited to unnecessarily small function spaces. The inherent probability distribution of GPs, which is the largest possible function space for regression with a certain GP , has not been exploited to deri ve uniform error bounds for Gaussian process re gression with noisy observations. Under the weak assumption of Lipschitz continuity of the covariance kernel and the unknown function, a directly computable probabilistic uniform error bound is deriv ed in Section 3.1. W e demonstrate ho w Lipschitz constants for unknown functions directly follo w from the assumed distribution o ver the function space in Section 3.2. Finally , we show that an arbitrarily small error bound can be reached with sufficiently man y and well-distributed training data in Section 3.3. 3.1 Exploiting Lipschitz Continuity of the Unknown Function In contrast to the RKHS based approaches in [ 8 , 9 ], we make use of the inherent probability distribution ov er the function space defined by Gaussian processes. W e achiev e this through the following assumption. Assumption 3.1. The unknown function f ( · ) is a sample fr om a Gaussian process G P (0, k ( x , x 0 )) and observations y = f ( x ) + ar e perturbed by zer o mean i.i.d. Gaussian noise with variance σ 2 n . This assumption includes abundant information about the re gression problem. The space of sample functions F is limited through the choice of the kernel k ( · , · ) of the Gaussian process. Using Mercer’ s decomposition [ 31 ] φ i ( x ) , i = 1, . . . , ∞ of the kernel k ( · , · ) , this space is defined through F = ( f ( x ) : ∃ λ i , i = 1, . . . , ∞ such that f ( x ) = ∞ X i =1 λ i φ i ( x ) ) , (2) which contains all functions that can be represented in terms of the kernel k ( · , · ) . By choosing a suitable class of covariance functions k ( · , · ) , this space can be designed in order to incorporate 3 prior knowledge of the unkno wn function f ( · ) . For example, for cov ariance kernels k ( · , · ) which are uni versal in the sense of [ 32 ], continuous functions can be learned with arbitrary precision. Moreov er , for the squared exponential kernel, the space of sample functions corresponds to the space of continuous functions on X , while its RKHS is limited to analytic functions [ 28 ]. Furthermore, Assumption 3.1 defines a prior GP distribution over the sample space F which is the basis for the calculation of the posterior probability . The prior distribution is typically shaped by the hyperparameters of the cov ariance kernel k ( · , · ) , e.g. slowly varying functions can be assigned a higher probability than functions with high deriv atives. Finally , Assumption 3.1 allows Gaussian observation noise which is in contrast to the bounded noise required e.g. in [8, 9]. In addition to Assumption 3.1, we need Lipschitz continuity of the k ernel k ( · , · ) and the unkno wn function f ( · ) . W e define the Lipschitz constant of a differentiable cov ariance kernel k ( · , · ) as L k : = max x , x 0 ∈ X h ∂ k ( x , x 0 ) ∂ x 1 . . . ∂ k ( x , x 0 ) ∂ x d i T . (3) Since most of the practically used covariance kernels k ( · , · ) , such as squared exponential and Matérn kernels, are Lipschitz continuous [ 13 ], this is a weak restriction on covariance kernels. Howe ver , it allows us to prov e continuity of the posterior mean function ν N ( · ) and the posterior standard deviation σ N ( · ) , which is exploited to deriv e a probabilistic uniform error bound in the following theorem. The proofs for all follo wing theorems can be found in the supplementary material. Theorem 3.1. Consider a zer o mean Gaussian pr ocess defined thr ough the continuous covariance kernel k ( · , · ) with Lipschitz constant L k on the compact set X . Furthermor e, consider a continuous unknown function f : X → R with Lipschitz constant L f and N ∈ N observations y i satisfying Assumption 3.1. Then, the posterior mean function ν N ( · ) and standar d deviation σ N ( · ) of a Gaussian pr ocess conditioned on the training data { ( x i , y i ) } N i =1 ar e continuous with Lipschitz constant L ν N and modulus of continuity ω σ N ( · ) on X such that L ν N ≤ L k √ N ( K ( X N , X N ) + σ 2 n I N ) − 1 y N (4) ω σ N ( τ ) ≤ s 2 τ L k 1 + N k ( K ( X N , X N ) + σ 2 n I N ) − 1 k max x , x 0 ∈ X k ( x , x 0 ) . (5) Mor eover , pic k δ ∈ (0, 1) , τ ∈ R + and set β ( τ ) = 2 log M ( τ , X ) δ (6) γ ( τ ) = ( L ν N + L f ) τ + p β ( τ ) ω σ N ( τ ). (7) Then, it holds that P | f ( x ) − ν N ( x ) | ≤ p β ( τ ) σ N ( x ) + γ ( τ ), ∀ x ∈ X ≥ 1 − δ . (8) The parameter τ is in fact the grid constant of a grid used in the deriv ation of the theorem. The error on the grid can be bounded by exploiting properties of the Gaussian distribution [ 8 ] resulting in a dependency on the number of grid points. Eventually , this leads to the constant β ( τ ) defined in (6) since the covering number M ( τ , X ) is the minimum number of points in a grid over X with grid constant τ . By employing the Lipschitz constant L ν N and the modulus of continuity ω σ N ( · ) , which are tri vially obtained due Lipschitz continuity of the cov ariance kernel k ( · , · ) , as well as the Lipschitz constant L f , the error bound is extended to the complete set X , which results in (8). Note, that most of the equations in Theorem 3.1 can be directly ev aluated. Although our expression for β ( τ ) depends on the co vering number of X , which is in general dif ficult to calculate, upper bounds can be computed trivially . For example, for a hypercubic set X ⊂ R d the covering number can be bounded by M ( τ , X ) ≤ 1 + r τ d , (9) where r is the edge length of the hypercube. Furthermore, (4) and (5) depend only on the training data and kernel expressions, which can be calculated analytically in general. Therefore, (8) can 4 be computed for fix ed τ and δ if an upper bound for the Lipschitz constant L f of the unknown function f ( · ) is known. Prior bounds on the Lipschitz constant L f are often av ailable for control systems, e.g. based on simplified first order physical models. Howe ver , we demonstrate a method to obtain probabilistic Lipschitz constants from Assumption 3.1 in Section 3.2. Therefore, it is trivial to compute all expressions in Theorem 3.1 or upper bounds thereof, which emphasizes the high applicability of Theorem 3.1 in safe control of unknown systems. Moreov er , it should be noted that τ can be chosen arbitrarily small such that the ef fect of the constant γ ( τ ) can always be reduced to an amount which is ne gligible compared to p β ( τ ) σ N ( x ) . Even conserv ativ e approximations of the Lipschitz constants L ν N and L f and a loose modulus of continuity ω σ N ( τ ) do not affect the error bound (8) much since (6) gro ws merely logarithmically with diminishing τ . In fact, ev en the bounds (4) and (5) , which gro w in the order of O ( N ) and O ( N 1 2 ) , respectiv ely , as shown in the proof of Theorem 3.3 and thus are unbounded, can be compensated such that a vanishing uniform error bound can be pro ven under weak assumptions in Section 3.3. 3.2 Probabilistic Lipschitz Constants f or Gaussian Processes If little prior kno wledge of the unkno wn function f ( · ) is gi ven, it might not be possible to directly deriv e a Lipschitz constant L f on X . Ho wever , we indirectly assume a certain distribution of the deriv atives of f ( · ) with Assumption 3.1. Therefore, it is possible to deriv e a probabilistic Lipschitz constant L f from this assumption, which is described in the following theorem. Theorem 3.2. Consider a zer o mean Gaussian pr ocess defined thr ough the covariance kernel k ( · , · ) with continuous partial derivatives up to the fourth or der and partial derivative kernels k ∂ i ( x , x 0 ) = ∂ 2 ∂ x i ∂ x 0 i k ( x , x 0 ) ∀ i = 1, . . . , d . (10) Let L ∂ i k denote the Lipschitz constants of the partial derivative kernels k ∂ i ( · , · ) on the set X with maximal extension r = max x , x 0 ∈ X k x − x 0 k . Then, a sample function f ( · ) of the Gaussian pr ocess is almost sur ely continuous on X and with pr obability of at least 1 − δ L , it holds that L f = r 2 log 2 d δ L max x ∈ X p k ∂ 1 ( x , x ) + 12 √ 6 d max max x ∈ X p k ∂ 1 ( x , x ), q r L ∂ 1 k . . . r 2 log 2 d δ L max x ∈ X p k ∂ d ( x , x ) + 12 √ 6 d max max x ∈ X p k ∂ d ( x , x ), q r L ∂ d k (11) is a Lipschitz constant of f ( · ) on X . Note that a higher differentiability of the covariance kernel k ( · , · ) is required compared to Theorem 3.1. The reason for this is that the proof of Theorem 3.2 exploits the fact that the partial deriv ativ e k ∂ i ( · , · ) of a differentiable kernel is again a cov ariance function, which defines a deriv ativ e Gaussian process [ 33 ]. In order to obtain continuity of the samples of these deriv ativ e processes, the deriv ative kernels k ∂ i ( · , · ) must be continuously differentiable [ 34 ]. Using the metric entropy criterion [ 34 ] and the Borell-TIS inequality [ 35 ], we exploit the continuity of sample functions and bound their maximum value, which directly translates into the probabilistic Lipschitz constant (11). Note that all the v alues required in (11) can be directly computed. The maximum of the deriv ative ker- nels k ∂ i ( · , · ) as well as their Lipschitz constants L ∂ i k can be calculated analytically for many kernels. Therefore, the Lipschitz constant obtained with Theorem 3.2 can be directly used in Theorem 3.1 through application of the union bound. Since the Lipschitz constant L f has only a logarithmic depen- dence on the probability δ L , small error probabilities for the Lipschitz constant can easily be achie ved. Remark 3.1. The work in [ 36 ] derives also estimates for the Lipschitz constants. However , the y only take the Lipschitz constant of the posterior mean function, whic h ne glects the pr obabilistic nature of the GP and ther eby under estimates the Lipschitz constants of samples of the GP . 3.3 Analysis of Asymptotic Behavior In safe reinforcement learning and control of unknown systems an important question regards the existence of lower bounds for the learning error because they limit the achiev able control 5 performance. It is clear that the av ailable data and constraints on the computational resources pose such lower bounds in practice. Howe ver , it is not clear under which conditions, e.g. requirements of computational power , an arbitrarily low uniform error can be guaranteed. The asymptotic analysis of the error bound, i.e. inv estigation of the bound (8) in the limit N → ∞ can clarify this question. The following theorem is the result of this analysis. Theorem 3.3. Consider a zer o mean Gaussian pr ocess defined thr ough the continuous covariance kernel k ( · , · ) with Lipschitz constant L k on the set X . Furthermor e, consider an infinite data str eam of observations ( x i , y i ) of an unknown function f : X → R with Lipsc hitz constant L f and maximum absolute value ¯ f ∈ R + on X which satisfies Assumption 3.1. Let ν N ( · ) and σ N ( · ) denote the mean and standar d deviation of the Gaussian process conditioned on the first N observations. If ther e exists a > 0 such that the standar d deviation satisfies σ N ( x ) ∈ O log( N ) − 1 2 − , ∀ x ∈ X , then it holds for every δ ∈ (0, 1) that P sup x ∈ X k ν N ( x ) − f ( x ) k ∈ O (log( N ) − ) ≥ 1 − δ . (12) In addition to the conditions of Theorem 3.1 the absolute v alue of the unknown function is required to be bounded by a v alue ¯ f . This is necessary to bound the Lipschitz constant L ν N of the posterior mean function ν N ( · ) in the limit of infinite training data. Even if no such constant is known, it can be derived from properties of the GP under weak conditions similarly to Theorem 3.2. Based on this restriction, it can be shown that the bound of the Lipschitz constant L ν N grows at most with rate O ( N ) using the triangle inequality and the fact that the squared norm of the observation noise k k 2 follo ws a χ 2 N distribution with probabilistically bounded maximum value [ 37 ]. Therefore, we pick τ ( N ) ∈ O ( N − 2 ) such that γ ( τ ( N )) ∈ O ( N − 1 ) and β ( τ ( N )) ∈ O (log ( N )) which implies (12). The condition on the con vergence rate of the posterior standard deviation σ N ( · ) in Theorem 3.3 can be seen as a condition for the distribution of the training data, which depends on the structure of the cov ariance kernel. In [ 38 , Corollary 3.2], the condition is formulated as follows: Let B ρ ( x ) denote a set of training points around x with radius ρ > 0 , then the posterior variance conv erges to zero if there exists a function ρ ( N ) for which ρ ( N ) ≤ k ( x , x ) /L k ∀ N , lim N →∞ ρ ( N ) = 0 and lim N →∞ B ρ ( N ) ( x ) = ∞ holds. This is achiev ed, e.g. if a constant fraction of all samples lies on the point x . In fact, it is straightforward to derive a similar condition for the uniform error bounds in [ 8 , 9 ]. Howe ver , due to their dependence on the maximal information gain, the required decrease rates depend on the covariance kernel k ( · , · ) and are typically higher . For example, the posterior standard deviation of a Gaussian process with a squared exponential kernel must satisfy σ N ( · ) ∈ O log( N ) − d 2 − 2 for [8] and σ N ( · ) ∈ O log( N ) − d +1 2 for [9]. 4 Safety Guarantees f or Control of Unknown Dynamical Systems Safety guarantees for dynamical systems, in terms of upper bounds for the tracking error, are becoming more and more rele vant as learning controllers are applied in safety-critical applications like autonomous dri ving or robots working in close proximity to humans [ 39 , 40 , 4 ]. W e therefore sho w how the results in Theorem 3.1 can be applied to control safely unkno wn dynamical systems. In Section 4.1 we propose a tracking control law for systems which are learned with GPs. The stability of the resulting controller is analyzed in Section 4.2. 4.1 T racking Control Design Consider the nonlinear control affine dynamical system ˙ x 1 = x 2 , ˙ x 2 = f ( x ) + u , (13) with state x = [ x 1 x 2 ] | ∈ X ⊂ R 2 and control input u ∈ U ⊆ R . While the structure of the dynamics (13) is known, the function f ( · ) is not. Howe ver , we assume that it is a sample from a GP with kernel k ( · , · ) . Systems of the form (13) cov er a large range of applications including Lagrangian dynamics and many physical systems. 6 The task is to define a polic y π : X → U for which the output x 1 tracks the desired trajectory x d ( t ) such that the tracking error e = [ e 1 e 2 ] | = x − x d with x d = [ x d ˙ x d ] | vanishes ov er time, i.e. lim t →∞ k e k = 0 . For notational simplicity , we introduce the filtered state r = λe 1 + e 2 , λ ∈ R + . A well-known method for tracking of control affine systems is feedback linearization [ 12 ], which aims for a model-based compensation of the non-linearity f ( · ) using an estimate ˆ f ( · ) and then applies linear control principles for the tracking. The feedback linearizing policy reads as u = π ( x ) = − ˆ f ( x ) + ν , (14) where the linear control law ν is the PD-controller ν = ¨ x d − k c r − λe 2 , (15) with control gain k c ∈ R + . This results in the dynamics of the filtered state ˙ r = f ( x ) − ˆ f ( x ) − k c r . (16) Assuming training data of the real system y i = f ( x i ) + , i = 1, . . . , N , ∼ N (0, σ 2 n ) are av ailable, we utilize the posterior mean function ν N ( · ) for the model estimate ˆ f ( · ) . This implies, that observ ations of ˙ x 2 are corrupted by noise, while x is measured free of noise. This is of course debatable, but in practice measuring the time deriv ativ e is usually realized with finite difference approximations, which injects significantly more noise than a direct measurement. 4.2 Stability Analysis Due to safety constraints, e.g. for robots interacting with humans, it is usually necessary to verify that the model ˆ f ( · ) is suf ficiently precise and the parameters of the controller k c , λ are chosen properly . These safety certificates can be achiev ed if there exists an upper bound for the tracking error as defined in the following. Definition 4.1 (Ultimate Boundedness) . The trajectory x ( t ) of a dynamical system ˙ x = f ( x , u ) is globally ultimately bounded, if ther e exist a positive constants b ∈ R + such that for every a ∈ R + , ther e is a T = T ( a , b ) ∈ R + such that k x ( t 0 ) k ≤ a ⇒ k x ( t ) k ≤ b , ∀ t ≥ t 0 + T . Since the solutions x ( t ) cannot be computed analytically , a stability analysis is necessary , which allows conclusions regarding the closed-loop behavior without running the policy on the real system [12]. Theorem 4.1. Consider a contr ol affine system (13) , where f ( · ) admits a Lipschitz constant L f on X ⊂ R d . Assume that f ( · ) and the observations y i , i = 1, . . . , N , satisfy the conditions of Assumption 3.1. Then, the feedback linearizing contr oller (14) with ˆ f ( · ) = ν N ( · ) guarantees with pr obability 1 − δ that the trac king err or e con ver ges to B = ( x ∈ X k e k ≤ p β ( τ ) σ N ( x ) + γ ( τ ) k c √ λ 2 + 1 ) , (17) with β ( τ ) and γ ( τ ) defined in Theor em 3.1. Based on L yapunov theory , it can be shown that the tracking error conv erges if the feedback term | k c r | dominates the model error | f ( · ) − ˆ f ( · ) | . As Theorem 3.1 bounds the model error , the set for which holds | k c r | > p β ( τ ) σ N ( x ) + γ ( τ ) can be computed. It can directly be seen, that the ultimate bound can be made arbitrarily small, by increasing the gains λ , k c or with more training points to decrease σ N ( · ) . Computing the set B allows to check whether the controller (14) adheres to the safety requirements. 5 Numerical Evaluation W e ev aluate our theoretical results in two simulations. 2 In Section 5.1, we inv estigate the effect of applying Theorem 3.2 to determine a probabilistic Lipschitz constant for an unknown synthetic 2 Matlab code is online av ailable: https://gitlab.lrz.de/ga68car/GPerrorbounds4safecontrol 7 − 6 − 4 − 2 0 2 4 − 4 − 2 0 2 4 x 1 x 2 − 6 − 4 − 2 0 2 4 − 4 − 2 0 2 4 x 1 σ 2 ( x ) B X N x ( t ) x d ( t ) Figure 1: Snapshots of the state trajectory (blue) as it approaches the desired trajectory (green). In lo w uncertainty areas (yello w background), the set B (red) is significantly smaller then in high uncertainty areas (blue background). 0 10 20 30 10 − 4 10 − 2 10 0 t k e k 10 0 10 1 volume B k e k volume B Figure 2: When the ultimate bound (red) is large, the tracking error (blue) increases due to the less precise model. system. Furthermore, we analyze the ef fect of unevenly distrib uted training samples on the tracking error bound from Theorem 4.1. In Section 5.2, we apply the feedback linearizing controller (14) to a tracking problem with a robotic manipulator . 5.1 Synthetic System with Unknown Lipschitz Constant L f As an e xample for a system of form (13) , we consider f ( x ) = 1 − sin( x 1 ) + 1 1+exp( − x 2 ) . Based on a uniform grid over [0 3] × [ − 3 3] the training set is formed of 81 points with σ 2 n = 0.01 . The reference trajectory is a circle x d ( t ) = 2 sin( t ) and the controller gains are k c = 2 and λ = 1 . W e choose a probability of failure δ = 0.01 , δ L = 0.01 and set τ = 10 − 8 . The state space is the rectangle X = [ − 6 4] × [ − 4 4] . A squared exponential kernel with automatic relev ance determination is utilized, for which L k and max x , x 0 ∈ X k ( x , x 0 ) is deriv ed analytically for the optimized hyperparameters. W e make use of Theorem 3.2 to estimate the Lipschitz constant L f , and it turns out to be a conservati ve bound (factor 10 ∼ 100 ). Howe ver , this is not crucial, because τ can be chosen arbitrarily small and γ ( τ ) is dominated by p β ( τ ) ω σ N ( τ ) . As Theorems 3.1 and 3.2 are subsequently utilized in this example, a union bound approximation can be applied to combine δ and δ L . The results are shown in Figs. 1 and 2. Both plots show , that the safety bound here is rather conservati ve, which also results from the fact that the violation probability was set to 1% . 5.2 Robotic Manipulator with 2 Degrees of Fr eedom W e consider a planar robotic manipulator in the z 1 - z 2 -plane with 2 degrees of freedom (DoFs), with unit length and unit masses / inertia for all links. For this example, we consider L f to be known and extend Theorem 3.1 to the multidimensional case using the union bound. The state space is here four dimensional [ q 1 ˙ q 1 q 2 ˙ q 2 ] and we consider X = [ − π π ] 4 . The 81 training points are distributed 8 − 2 − 1 0 1 2 − 2 − 1 0 1 2 z 1 z 2 X \ B B X N 0 2 4 6 − 2 0 2 t q , ˙ q q 1 ˙ q 1 q 2 ˙ q 2 Figure 3: The task space of the robot (left) shows the robot is guaranteed to remain in B (red) after a transient phase. Hence, the remaining state space X \ B (green) can be considered as safe. The joint angles and velocities (right) con verge to the desired trajectories (dashed lines) o ver time. in [ − 1 1] 4 and the control gain is k c = 7 , while other constants remain the same as in Section 5.1. The desired trajectories for both joints are again sinusoidal as sho wn in Fig. 3 on the right side. The robot dynamics are deriv ed according to [41, Chapter 4]. Theorem 3.1 allows to deri ve an error bound in the joint space of the robot according to Theorem 4.1, which can be transformed into the task space as shown in Fig. 3 on the left. Thus, based on the learned (initially unknown) dynamics, it can be guaranteed, that the robot will not lea ve the depicted area and can thereby be considered as safe. Pre vious error bounds for GPs are not applicable to this practical setting, because they i) do not allow the observation noise on the training data to be Gaussian [ 8 ], which is a common assumption in robotics, ii) utilize constants which cannot be computed ef ficiently (e.g. maximal information gain in [ 42 ]) or iii) make assumptions dif ficult to verify in practice (e.g. the RKHS norm of the unknown dynamical system [6]). 6 Conclusion This paper presents a novel uniform error bound for Gaussian process regression. By exploiting the inherent probability distribution of Gaussian processes instead of the reproducing kernel Hilbert space attached to the co variance kernel, a wider class of functions can be considered. Furthermore, we demonstrate how probabilistic Lipschitz constants can be estimated from the GP distrib ution and deriv e sufficient conditions to reach arbitrarily small uniform error bounds. W e employ the deriv ed results to show safety bounds for a tracking control algorithm and e valuate them in simulation for a robotic manipulator . Acknowledgments Armin Lederer gratefully ackno wledges financial support from the German Academic Scholarship Foundation. References [1] P . M. Nørgård, O. Ra vn, N. K. Poulsen, and L. K. Hansen, Neural Networks for Modelling and Contr ol of Dynamical Systems - A Practicioner’ s Handbook . London: Springer , 2000. [2] M. P . Deisenroth, “A Survey on Policy Search for Robotics, ” F oundations and T r ends in Robotics , vol. 2, no. 1-2, pp. 1–142, 2013. [3] B. Huval, T . W ang, S. T andon, J. Kiske, W . Song, J. Pazhayampallil, M. Andriluka, P . Rajpurkar , T . Migimatsu, R. Cheng-Y ue, F . Mujica, A. Coates, and A. Y . Ng, “An Empirical 9 Evaluation of Deep Learning on Highway Driving, ” pp. 1–7, 2015. [Online]. A vailable: http://arxiv .org/abs/1504.01716 [4] J. Umlauft, Y . Fanger , and S. Hirche, “Bayesian Uncertainty Modeling for Programming by Demonstration, ” in Pr oceedings of the IEEE Conference on Robotics and Automation , 2017, pp. 6428–6434. [5] T . Beckers, D. Kuli ´ c, and S. Hirche, “Stable Gaussian Process based T racking Control of Euler–Lagrange Systems, ” Automatica , v ol. 103, no. 23, pp. 390–397, 2019. [6] F . Berkenkamp, R. Moriconi, A. P . Schoellig, and A. Krause, “Safe Learning of Regions of Attraction for Uncertain, Nonlinear Systems with Gaussian Processes, ” in Pr oceedings of the IEEE Confer ence on Decision and Contr ol , 2016, pp. 4661–4666. [7] Y . Fanger , J. Umlauft, and S. Hirche, “Gaussian Processes for Dynamic Movement Primiti ves with Application in Knowledge-based Cooperation, ” in Proceedings of the IEEE Confer ence on Intelligent Robots and Systems , 2016, pp. 3913–3919. [8] N. Srini vas, A. Krause, S. M. Kakade, and M. W . See ger, “Information-Theoretic Re gret Bounds for Gaussian Process Optimization in the Bandit Setting, ” IEEE T ransactions on Information Theory , vol. 58, no. 5, pp. 3250–3265, 2012. [9] S. R. Chowdhury and A. Gopalan, “On Kernelized Multi-armed Bandits, ” in Proceedings of the International Confer ence on Machine Learning , 2017, pp. 844–853. [10] J. Umlauft, L. Pöhler , and S. Hirche, “An Uncertainty-Based Control L yapunov Approach for Control-Affine Systems Modeled by Gaussian Process, ” IEEE Contr ol Systems Letters , vol. 2, no. 3, pp. 483–488, 2018. [11] J. Umlauft, T . Beckers, and S. Hirche, “Scenario-based Optimal Control for Gaussian Process State Space Models, ” in Pr oceedings of the European Contr ol Confer ence , 2018. [12] H. K. Khalil, Nonlinear Systems; 3r d ed. Upper Saddle Riv er , NJ: Prentice-Hall, 2002. [13] C. E. Rasmussen and C. K. I. W illiams, Gaussian Pr ocesses for Machine Learning . Cambridge, MA: The MIT Press, 2006. [14] H. W endland, Scattered Data Appr oximation . Cambridge Univ ersity Press, 2004. [15] Z. M. W u and R. Schaback, “Local Error Estimates for Radial Basis Function Interpolation of Scattered Data, ” IMA Journal of Numerical Analysis , vol. 13, no. 1, pp. 13–27, 1993. [16] R. Schaback, “Improv ed Error Bounds for Scattered Data Interpolation by Radial Basis Func- tions, ” Mathematics of Computation , vol. 68, no. 225, pp. 201–217, 2002. [17] M. Kanagaw a, P . Hennig, D. Sejdino vic, and B. K. Sriperumb udur , “Gaussian Processes and Kernel Methods: A Revie w on Connections and Equiv alences, ” pp. 1–64, 2018. [Online]. A v ailable: http://arxiv .org/abs/1807.02582 [18] S. Hubbert and T . M. Morton, “Lp-Error Estimates for Radial Basis Function Interpolation on the Sphere, ” Journal of Appr oximation Theory , vol. 129, no. 1, pp. 58–77, 2004. [19] F . J. Narcowich, J. D. W ard, and H. W endland, “Sobolev Error Estimates and a Bernstein Inequal- ity for Scattered Data Interpolation via Radial Basis Functions, ” Constructive Appr oximation , vol. 24, no. 2, pp. 175–186, 2006. [20] R. Beatson, O. Da vydov , and J. Le vesley , “Error Bounds for Anisotropic RBF Interpolation, ” Journal of Appr oximation Theory , vol. 162, no. 3, pp. 512–527, 2010. [21] A. M. Stuart and A. L. T eckentrup, “Posterior Consistency for Gaussian Process Approximations of Bayesian Posterior Distrib utions, ” Mathematics of Computation , vol. 87, no. 310, pp. 721– 753, 2018. [22] S. Mendelson, “Improving the Sample Comple xity using Global Data, ” IEEE T ransactions on Information Theory , vol. 48, no. 7, pp. 1977–1991, 2002. 10 [23] T . Zhang, “Learning Bounds for Kernel Regression using Ef fectiv e Data Dimensionality, ” Neural Computation , v ol. 17, no. 9, pp. 2077–2098, 2005. [24] C. Cortes, M. Mohri, and A. T alwalkar , “On the Impact of Kernel Approximati on on Learning Accuracy, ” Pr oceedings of 13th International Conference on Artificial Intelligece and Statistics , vol. 9, pp. 113–120, 2010. [25] L. Shi, “Learning Theory Estimates for Coef ficient-based Regularized Regression, ” Applied and Computational Harmonic Analysis , vol. 34, no. 2, pp. 252–265, 2013. [26] L. H. Dicker , D. P . Foster , and D. Hsu, “K ernel Ridge vs. Principal Component Regression: Minimax Bounds and the Qualification of Regularization Operators, ” Electr onic Journal of Statistics , vol. 11, no. 1, pp. 1022–1047, 2017. [27] F . Berk enkamp, A. P . Schoellig, M. Turchetta, and A. Krause, “Safe Model-based Reinforcement Learning with Stability Guarantees, ” in Advances in Neural Information Pr ocessing Systems , 2017. [28] A. v an der V aart and H. v an Zanten, “Information Rates of Nonparametric Gaussian Process Methods, ” Journal of Mac hine Learning Resear ch , v ol. 12, pp. 2095–2119, 2011. [29] R. Adler and J. T aylor , Random F ields and Geometry . Springer Science & Business Media, 2007. [30] W . W ang, R. T uo, and C. F . J. W u, “On Prediction Properties of Kriging: Uniform Error Bounds and Robustness, ” Journal of the American Statistical Society , pp. 1–38, 2019. [31] J. Mercer , “Functions of Positiv e and Negati ve T ype, and their Connection with the Theory of Integral Equations, ” Philosophical T ransactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , vol. 209, no. 441-458, pp. 415–446, 1909. [32] I. Steinwart, “On the Influence of the K ernel on the Consistency of Support V ector Machines, ” Journal of Mac hine Learning Resear ch , v ol. 2, pp. 67–93, 2001. [33] S. Ghosal and A. Roy , “Posterior Consistency of Gaussian Process Prior for Nonparametric Binary Regression, ” The Annals of Statistics , vol. 34, no. 5, pp. 2413–2429, 2006. [34] R. M. Dudley , “The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes, ” Journal of Functional Analysis , vol. 1, no. 3, pp. 290–330, 1967. [35] M. T alagrand, “Sharper Bounds for Gaussian and Empirical Processes, ” The Annals of Pr oba- bility , vol. 22, no. 1, pp. 28–76, 1994. [36] J. González, Z. Dai, P . Hennig, and N. D. Lawrence, “Batch Bayesian Optimization via Local Penalization, ” in Pr oceedings of the International Confer ence on Artificial Intelligence and Statistics , 2016, pp. 648–657. [37] B. Laurent and P . Massart, “Adapti ve Estimation of a Quadratic Functional by Model Selection, ” The Annals of Statistics , vol. 28, no. 5, pp. 1302–1338, 2000. [38] A. Lederer, J. Umlauft, and S. Hirche, “Posterior V ariance Analysis of Gaussian Processes with Application to A verage Learning Curv es, ” 2019. [Online]. A vailable: http://arxiv .org/abs/1906.01404 [39] J. Umlauft, T . Beckers, M. Kimmel, and S. Hirche, “Feedback Linearization using Gaussian Processes, ” in Pr oceedings of the IEEE Conference on Decision and Contr ol , 2017, pp. 5249– 5255. [40] J. Umlauft, A. Lederer, and S. Hirche, “Learning Stable Gaussian Process State Space Models, ” in Pr oceedings of the American Control Confer ence , 2017, pp. 1499–1504. [41] R. M. Murray , Z. Li, and S. Shankar Sastry , A Mathematical Intr oduction to Robotic Manipula- tion . CRC Press, 1994. 11 [42] N. Srini vas, A. Krause, S. Kakade, and M. See ger , “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, ” in Pr oceedings of the International Confer ence on Machine Learning , 2010, pp. 1015–1022. [43] S. Grüne wälder, J.-Y . Audibert, M. Opper , and J. Shawe-T aylor , “Regret Bounds for Gaussian Process Bandit Problems, ” J ournal of Machine Learning Resear ch , v ol. 9, pp. 273–280, 2010. A Proof of Theor em 3.1 Pr oof of Theorem 3.1. W e first prove the Lipschitz constant of the posterior mean ν N ( x ) and the modulus of continuity of the standard deviation σ N ( x ) , before we deriv e the bound of the regression error . The norm of the difference between the posterior mean ν N ( x ) ev aluated at two different points is giv en by k ν N ( x ) − ν N ( x 0 ) k = k ( k ( x , X N ) − k ( x 0 , X N )) α k with α = ( K ( X N , X N ) + σ 2 n I N ) − 1 y N . (18) Due to the Cauchy-Schwarz inequality and the Lipschitz continuity of the kernel we obtain k ν N ( x ) − ν N ( x 0 ) k ≤ L k √ N k α k k x − x 0 k , which prov es Lipschitz continuity of the mean ν N ( x ) . In order to calculate a modulus of continuity for the posterior standard de viation σ N ( x ) observe that the dif ference of the variance at tw o points x , x 0 ∈ X can be expressed as | σ 2 N ( x ) − σ 2 N ( x 0 ) | = | σ N ( x ) − σ N ( x 0 ) || σ N ( x ) + σ N ( x 0 ) | . (19) Since the standard deviation is positi ve semidefinite we hav e | σ N ( x ) + σ N ( x 0 ) | ≥ | σ N ( x ) − σ N ( x 0 ) | (20) and hence, we obtain | σ 2 N ( x ) − σ 2 N ( x 0 ) | ≥ | σ N ( x ) − σ N ( x 0 ) | 2 . (21) Therefore, it is sufficient to bound the difference of the variance at two points x , x 0 ∈ X and take the square root of the resulting expression. Due to the Cauchy-Schwarz inequality and Lipschitz continuity of k ( · , · ) the absolute value of the dif ference of the variance can be bounded by | σ 2 N ( x ) − σ 2 N ( x 0 ) | ≤ 2 L k k x − x 0 k + k k ( x , X N ) − k ( x 0 , X N ) k ( K ( X N , X N ) + σ 2 n I N ) − 1 k k ( X N , x ) + k ( X N , x 0 ) k . (22) On the one hand, we hav e k k ( x , X N ) − k ( x 0 , X N ) k ≤ √ N L k k x − x 0 k (23) due to Lipschitz continuity of k ( x , x 0 ) . On the other hand we hav e k k ( x , X N ) + k ( x 0 , X N ) k ≤ 2 √ N max x , x 0 ∈ X k ( x , x 0 ). (24) The modulus of continuity ω σ N ( τ ) follows from substituting (23) and (24) in (22) and taking the square root of the resulting e xpression. Finally , we prove the probabilistic uniform error bound by exploiting the fact that for e very grid X τ with | X τ | grid points and max x ∈ X min x 0 ∈ X τ k x − x 0 k ≤ τ (25) it holds with probability of at least 1 − | X τ | e − β ( τ ) / 2 that [8] | f ( x ) − ν N ( x ) | ≤ p β ( τ ) σ N ( x ) ∀ x ∈ X τ . (26) Choose β ( τ ) = 2 log | X τ | δ , then | f ( x ) − ν N ( x ) | ≤ p β ( τ ) σ N ( x ) ∀ x ∈ X τ (27) 12 holds with probability of at least 1 − δ . Due to continuity of f ( x ) , ν N ( x ) and σ N ( x ) we obtain min x 0 ∈ X τ | f ( x ) − f ( x 0 ) | ≤ τ L f ∀ x ∈ X (28) min x 0 ∈ X τ | ν N ( x ) − ν N ( x 0 ) | ≤ τ L ν N ∀ x ∈ X (29) min x 0 ∈ X τ | σ N ( x ) − σ N ( x 0 ) | ≤ ω σ N ( τ ) ∀ x ∈ X . (30) Moreov er , the minimum number of grid points satisfying (25) is given by the covering number M ( τ , X ) . Hence, we obtain P | f ( x ) − ν N ( x ) | ≤ p β ( τ ) σ N ( x ) + γ ( τ ), ∀ x ∈ X ≥ 1 − δ , (31) where β ( τ ) = 2 log M ( τ , X ) δ (32) γ ( τ ) = ( L f + L ν N ) τ + p β ( τ ) ω σ N ( τ ). (33) B Proof of Theor em 3.2 In order to proof Theorem 3.2, several auxiliary results are necessary , which are deri ved in the following. The first lemma concerns the expected supremum of a Gaussian process. Lemma B.1. Consider a Gaussian pr ocess with a continuously differ entiable covariance func- tion k ( · , · ) and let L k denote its Lipschitz constant on the set X with maximum extension r = max x , x 0 ∈ X k x − x 0 k . Then, the expected supr emum of a sample function f ( x ) of this Gaussian pr ocess satisfies E sup x ∈ X f ( x ) ≤ 12 √ 6 d max max x ∈ X p k ( x , x ), p r L k . (34) Pr oof. W e prove this lemma by making use of the metric entropy criterion for the sample continuity of some version of a Gaussian process [ 34 ]. This criterion allo ws to bound the expected supremum of a sample function f ( x ) by E sup x ∈ X f ( x ) ≤ max x ∈ X √ k ( x , x ) Z 0 p log( N ( % , X ))d % , (35) where N ( % , X ) is the % -packing number of X with respect to the cov ariance pseudo-metric d k ( x , x 0 ) = p k ( x , x ) + k ( x 0 , x 0 ) − 2 k ( x , x 0 ). (36) Instead of bounding the % -packing number , we bound the %/ 2 -cov ering number , which is known to be an upper bound. The covering number can be easily bounded by transforming the problem of cov ering X with respect to the pseudo-metric d k ( · , · ) into a cov erage problem in the original metric of X . For this reason, define ψ ( % 0 ) = sup x , x 0 ∈ X k x − x 0 k ∞ ≤ % 0 d k ( x , x 0 ), (37) which is continuous due to the continuity of the cov ariance kernel k ( · , · ) . Consider the in verse function ψ − 1 ( % ) = inf { % 0 > 0 : ψ ( % 0 ) > % } . (38) Continuity of ψ ( · ) implies % = ψ ( ψ − 1 ( % )) . In particular, this means that we can guarantee d k ( x , x 0 ) ≤ % 2 if k x − x 0 k ≤ ψ − 1 ( % 2 ) . Due to this relationship it is sufficient to construct an 13 uniform grid with grid constant 2 ψ − 1 ( % 2 ) in order to obtain a %/ 2 -cov ering net of X . Furthermore, the cardinality of this grid is an upper bound for the %/ 2 -cov ering number , i.e. M ( %/ 2, X ) ≤ r 2 ψ − 1 ( % 2 ) d . (39) Therefore, it follows that N ( % , X ) ≤ r 2 ψ − 1 ( % 2 ) d . (40) Due to the Lipschitz continuity of the cov ariance function, we can bound ψ ( · ) by ψ ( % 0 ) ≤ p 2 L k % 0 . (41) Hence, the in verse function satisfies ψ − 1 % 2 ≥ % 2 √ 2 L k 2 (42) and consequently N ( % , X ) ≤ 1 + 4 r L k % 2 d (43) holds, where the ceil operator is resolv ed through the addition of 1 . Substituting this expression in the metric entropy bound (35) yields E sup x ∈ X f ( x ) ≤ 12 √ d max x ∈ X √ k ( x , x ) Z 0 s log 1 + 4 r L k % 2 d % . (44) As shown in [43] this inte gral can be bounded by max x ∈ X √ k ( x , x ) Z 0 s log 1 + 4 r L k % 2 d % ≤ √ 6 max max x ∈ X p k ( x , x ), p r L k (45) which prov es the lemma. Based on the e xpected supremum of Gaussian process it is possible to deriv e a high probability bound for the supremum of a sample function. Lemma B.2. Consider a Gaussian pr ocess with a continuously differ entiable covariance func- tion k ( · , · ) and let L k denote its Lipschitz constant on the set X with maximum extension r = max x , x 0 ∈ X k x − x 0 k . Then, with probability of at least 1 − δ L the supr emum of a sample function f ( x ) of this Gaussian pr ocess is bounded by sup x ∈ X f ( x ) ≤ s 2 log 1 δ L max x ∈ X p k ( x , x ) + 12 √ 6 d max max x ∈ X p k ( x , x ), p r L k . (46) Pr oof. W e prove this lemma by e xploiting the wide theory of concentration inequalities to deriv e a bound for the supremum of the sample function f ( x ) . W e apply the Borell-TIS inequality [35] P sup x ∈ X f ( x ) − E " sup x ∈ X f ( x ) # ≥ c ! ≤ exp − c 2 2 max x ∈ X k ( x , x ) . (47) Due to Lemma B.1 we hav e E sup x ∈ X f ( x ) ≤ 12 √ 6 d max max x ∈ X p k ( x , x ), p r L k . (48) The lemma follo ws from substituting (48) in (47) and choosing c = r 2 log 1 δ L max x ∈ X p k ( x , x ) . 14 Finally , we exploit the fact that the deri vati ve of a sample function is a sample function from another Gaussian process to prov e the high probability Lipschitz constant in Theorem 3.2. Pr oof of Theorem 3.2. Continuity of the sample function f ( x ) follows directly from [ 33 , Theorem 5]. Furthermore, this theorem guarantees that the deri vati ve functions ∂ ∂ x i f ( x ) are samples from deriv ative Gaussian processes with co variance functions k ∂ i ( x , x 0 ) = ∂ 2 ∂ x i ∂ x 0 i k ( x , x 0 ). (49) Therefore, we can apply Lemma B.2 to each of the deri vati ve processes and obtain with probability of at least 1 − δ L d − L f ∂ i ≤ sup x ∈ X ∂ ∂ x i f ( x ) ≤ L f ∂ i , (50) where L f ∂ i = s 2 log 2 d δ L max x ∈ X p k ∂ i ( x , x ) + 12 √ 6 d max max x ∈ X p k ∂ i ( x , x ), q r L ∂ i k (51) and L ∂ i k is the Lipschitz constant of deriv ati ve kernel k ∂ i ( x , x 0 ) . Applying the union bound over all partial deriv ative processes i = 1, . . . , d finally yields the result. C Proof of Theor em 3.3 Pr oof of Theorem 3.3. Due to Theorem 3.1 with β N ( τ ) = 2 log M ( τ , X ) π 2 N 2 3 δ and the union bound ov er all N > 0 it follo ws that sup x ∈ X | f ( x ) − ν N ( x ) | ≤ p β N ( τ ) σ N ( x ) + γ N ( τ ) ∀ N > 0 (52) with probability of at least 1 − δ / 2 . A trivial bound for the covering number can be obtained by considering a uniform grid ov er the cube containing X . This approach leads to M ( τ , X ) ≤ 1 + r τ d , (53) where r = max x , x 0 ∈ X k x − x 0 k . Therefore, we hav e β N ( τ ) ≤ 2 d log 1 + r τ + 4 log( π N ) − 2 log (3 δ ). (54) Furthermore, the Lipschitz constant L ν N is bounded by L ν N ≤ L k √ N ( K ( X N , X N ) + σ 2 n I N ) − 1 y N (55) due to Theorem 3.1. Since the Gram matrix K ( X N , X N ) is positiv e semidefinite and f ( · ) is bounded by ¯ f , we can bound ( K ( X N , X N ) + σ 2 n I N ) − 1 y N by ( K ( X N , X N ) + σ 2 n I N ) − 1 y N ≤ k y N k ρ min ( K ( X N , X N ) + σ 2 n I N ) ≤ √ N ¯ f + k ξ N k σ 2 n , (56) where ξ N is a vector of N i.i.d. zero mean Gaussian random variables with v ariance σ 2 n . Therefore, it follows that k ξ N k 2 σ 2 n ∼ χ 2 N . Due to [37], with probability of at least 1 − exp( − η N ) we hav e k ξ N k 2 ≤ 2 p N η N + 2 η N + N σ 2 n . (57) 15 Setting η N = log( π 2 N 2 3 δ ) and applying the union bounds ov er all N > 0 yields ( K ( X N , X N ) + σ 2 n I N ) − 1 y N ≤ √ N ¯ f + p 2 √ N η N + 2 η N + N σ n σ 2 n ∀ N > 0 (58) with probability of at least 1 − δ / 2 . Hence, the Lipschitz constant of the posterior mean function ν N ( · ) satisfies with probability of at least 1 − δ / 2 L ν N ≤ L k N ¯ f + p N (2 √ N η N + 2 η N + N ) σ n σ 2 n ∀ N > 0. (59) Since η N grows logarithmically with the number of training samples N , it holds that L ν N ∈ O ( N ) with probability of at least 1 − δ / 2 . The modulus of continuity ω σ N ( · ) of the posterior standard deviation can be bounded by ω σ N ( τ ) ≤ v u u u t 2 L k τ N max ˜ x , ˜ x 0 ∈ X k ( ˜ x , ˜ x 0 ) σ 2 n + 1 (60) because k ( K ( X N , X N ) + σ 2 n I N ) − 1 k ≤ 1 σ 2 n . Due to the union bound (52) holds with probability of at least 1 − δ with γ N ( τ ) ≤ v u u u t 2 L k τ β ( τ ) N max ˜ x , ˜ x 0 ∈ X k ( ˜ x , ˜ x 0 ) σ 2 n + 1 + L f τ + L k N ¯ f + p N (2 √ N η N + 2 η N + N ) σ 2 n τ . (61) This function must con ver ge to 0 for N → ∞ in order to guarantee a vanishing re gression error . This is only ensured if τ ( N ) decreases faster than O (( N log ( N )) − 1 ) . Therefore, set τ ( N ) ∈ O ( N − 2 ) in order to guarantee lim N →∞ γ N ( τ N ) = 0. (62) Howe ver , this choice of τ ( N ) implies that β N ( τ ( N )) ∈ O (log( N )) due to (54) . Since there exists an > 0 such that σ N ( x ) ∈ O log( N ) − 1 2 − , ∀ x ∈ X by assumption, we have p β N ( τ ( N )) σ N ( x ) ∈ O (log ( N ) − ) ∀ x ∈ X , (63) which concludes the proof. D Proof of Theor em 4.1 L yapunov theory provides the follo wing statement [12]. Lemma D.1. A dynamical system ˙ x = f ( x , u ) is globally ultimately bounded to a set B ⊂ X , containing the origin, if there e xists a positive definite (so called Lyapunov) function, V : X → R +,0 , for which ˙ V ( x ) < 0 , for all x ∈ X \ B . This allows to proof Theorem 4.1 as follo wing. Pr oof of Theorem 4.1. Consider the L yapunov function V ( x ) = 1 2 r 2 ˙ V ( x ) = ∂ V ∂ r ˙ r = r f ( x ) − ˆ f ( x ) − k c r ≤ | r || f ( x ) − ν N ( x ) | − k c | r | 2 ≤ 0 ∀| r | > f ( x ) − ν N ( x ) k c Based on Theorem 3.1, the model error is bounded with high probability , which allows to conclude P ˙ V ( x ) < 0 ∀ x ∈ X \ B ≥ 1 − δ . The global ultimate boundedness of the closed-loop system, is thereby sho wn according to Lemma D.1. 16 E Report on Computational Complexity of the Numerical Evaluation Simulations are performed in MA TLAB 2019a on a i5-6200U CPU with 2.3GHz and 8GB RAM. The simulation in Sec. 5.1 took 77s and used 1 MB of workspace memory . The simulation in Sec. 5.2 took 39s and used 134 MB of workspace memory . The code is available as supplementary material. 17
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment