The Use of Gaussian Processes in System Identification

Gaussian processes are used in machine learning to learn input-output mappings from observed data. Gaussian process regression is based on imposing a Gaussian process prior on the unknown regressor function and statistically conditioning it on the ob…

Authors: Simo S"arkk"a

The Use of Gaussian Processes in System Identification
The Use of Gaussian Processes in System Identification Simo S ¨ arkk ¨ a T o appear in Encyclopedia of systems and contr ol, 2nd edition 1 Abstract Gaussian processes are used in machine learning to learn input-output mappings from observed data. Gaussian process regression is based on imposing a Gaussian process prior on the unknown re gressor function and statistically conditioning it on the ob- served data. In system identification, Gaussian processes are used to form time series prediction models such as non-linear finite-impulse response (NFIR) models as well as non-linear autoregressi ve (N ARX) models. Gaussian process state-space models (GPSS) can be used to learn the dynamic and measurement models for a state-space representation of the input-output data. T emporal and spatio-temporal Gaussian pro- cesses can be directly used to form regressor on the data in the time domain. The aim of this article is to briefly outline the main directions in system identification methods using Gaussian processes. 2 K eywords Gaussian process regression, non-linear system identification, GP-NFIR model, GP- ARX model, GP-NOE model, Gaussian process state-space model, temporal Gaussian process, state-space Gaussian process 3 Intr oduction Gaussian process re gression (Rasmussen and W illiams, 2006) refers to a statistical methodology where we use Gaussian processes as prior models for regression func- tions that we fit to observ ed data. This kind of methodology is particularly popular in machine learning although the origins of the basic ideas can be traced to geostatistics (Cressie, 1993). In geostatistics, the corresponding methodology is called ”kriging” which is named after South African mining engineer D. G. Krige. In system identifica- tion, Gaussian processes can be used to identify (or ”learn” in machine learning terms) the input-output relationships from observ ed data. Even when there is no e xplicit input in the system, Gaussian processes can be used to identify a model for an observ ed time series of outputs which is a specific form of a system identification problem. Overvie ws of the use of Gaussian processes in system identification can be found, for example, in the monograph of K ocijan (2016), and PhD theses of McHutchon (2015) and Frigola (2016). 1 4 Gaussian pr ocesses in system identification 4.1 Gaussian process r egression Gaussian process regression is concerned with the follo wing problem: Giv en a set of observed (training) input-output data D = { ( z k , y k ) : k = 1 , . . . , N } from an unknown function y = f ( z ) , predict the v alues of the function at new (test) inputs { z ∗ k : k = 1 , . . . , M } . That is, the problem is a classical re gression problem. Howe ver , the classical solution to the problem usually amounts to fixing a parametric function class f ( z ; θ ) , where θ is a set of parameters and then fitting the parameters to the observed data. In Gaussian process regression, we take a dif ferent route; instead of fixing a parametric class of functions, we put a Gaussian process prior measure on the whole regression function and condition on the observed data using Bayes’ rule (Rasmussen and W illiams, 2006). 4.1.1 Gaussian process r egression problem Mathematically , the Gaussian process re gression problem can be written as f ( z ) ∼ G P ( m ( z ) , k ( z , z 0 )) , (1a) y k = f ( z k ) +  k ,  k ∼ N (0 , σ 2 n ) , (1b) where Equation (1a) tells that, a priori, the function is a Gaussian process with mean function m ( z ) = E[ f ( z )] and cov ariance function (or kernel) k ( z , z 0 ) = Co v [ f ( z ) , f ( z 0 )] = E[( f ( z ) − m ( z )) ( f ( z 0 ) − m ( z 0 ))] . Equation (1b) tells that we observe the function v alues at points z k , k = 1 , . . . , N and that they are corrupted by (independent) Gaussian noises with variance σ 2 n . The mean and covariance functions define the regressor function class and they , or at least their parametric classes, need to be selected a priori. The mean function can typically be selected to be identically zero m ( z ) = 0 . The covariance function defines the smoothness properties of the functions, and a typical choice in machine learning is the squared exponential co variance function k ( z , z 0 ) = s 2 exp  − k z − z 0 k 2 2 ` 2  (2) which produces infinitely differentiable (i.e., analytic) regressor functions. The pa- rameters s and ` in the aforementioned cov ariance function define the magnitude and length scales of the regressor functions, respecti vely . Other common choices of co- variance functions are, for e xample, the Mat ´ ern class of cov ariance functions (Mat ´ ern, 1960; Rasmussen and W illiams, 2006). 4.1.2 Gaussian process r egression equations Giv en the mean and co variance functions as well as the measurements, we can form the Gaussian process regressor . Assuming that the noises are independent of the function values, we can write the joint distribution of the observed v alues and the unkno wn function values as follo ws:  y f ( Z ∗ )  ∼ N  m ( Z ) m ( Z ∗ )  ,   K + σ 2 n I  k > ( Z ∗ ) k ( Z ∗ ) k ( Z ∗ , Z ∗ )  , (3) 2 where y =  y 1 . . . y N  > , m ( Z ) =  m ( z 1 ) · · · m ( z N )  > , m ( Z ∗ ) =  m ( z ∗ 1 ) . . . m ( z ∗ M )  > , and K and k ( Z ∗ ) denote matrices with element ( i, j ) giv en as k ( z i , z j ) and k ( z ∗ i , z j ) , respectively . By conditioning this joint Gaussian distribution on the measurements y we get that the conditional (i.e., posterior) distribution of the function values f ( Z ∗ ) =  f ( z ∗ 1 ) . . . f ( z ∗ M )  > is Gaussian with the mean and cov ariance E[ f ( Z ∗ ) | y ] = m ( Z ∗ ) + k ( Z ∗ )  K + σ 2 n I  − 1 ( y − m ( Z )) , Co v[ f ( Z ∗ ) | y ] = k ( Z ∗ , Z ∗ ) − k ( Z ∗ )  K + σ 2 n I  − 1 k > ( Z ∗ ) . (4) These are the fundamental equations of Gaussian process regression. An example of Gaussian process regression with squared exponential cov ariance function is shown in Figure 1. 4.1.3 Hyperparameter learning Even though Gaussian process regression is a non-parametric method, for which we do not need to fix a parametric class of functions, the mean and covariance functions can ha ve unkno wn h yperparameters ϕ which can be estimated from data. For example, the squared exponential covariance function in Equation (2) has the hyperparameters ϕ = ( s, ` ) . 0 2 4 6 8 10 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 True function Regression mean Observations Quantiles Figure 1: Example of Gaussian process regression with squared exponential cov ariance func- tion. The true function is a sinusoidal which is observed only at 10 points that are corrupted by Gaussian noise. The quantiles provide error bars for the predicted function values. A common way to estimate the parameters is to maximize the marginal likelihood – also called evidence – p ( y | ϕ ) of the measurements, or equiv alently , minimize the negati ve log-likelihood of the measurements − log p ( y | ϕ ) = 1 2 log | 2 π ( K ϕ + σ 2 n I ) | + 1 2 ( y − m ϕ ( Z )) >  K ϕ + σ 2 n I  − 1 × ( y − m ϕ ( Z )) . (5) 3 The gradient of this function with respect to the hyperparameters is also a vailable (see, e.g., Rasmussen and Williams, 2006) which allo ws for the use of gradient-based opti- mization methods to estimate the parameters. Instead of using the maximum likelihood method to estimate the parameters, it is also possible to use a Bayesian approach to the problem and consider the posterior distribution of the hyperparameters p ( ϕ | y ) = p ( y | ϕ ) p ( ϕ ) R p ( y | ϕ ) p ( ϕ ) d ϕ , (6) where p ( ϕ ) is the prior distrib ution of the hyperparameters. W e can, for example, compute the maximum a posteriori estimate of the parameters by finding the maximum of this distribution or use Mark ov chain Monte Carlo (MCMC) methods (Brooks et al, 2011) to estimate the statistics of the distribution. In what follows, to avoid notational clutter , we drop out the hyperparameters from the Gaussian process formulations and inference methods although they are commonly estimated as part of the Gaussian process learning. 4.1.4 Reduction of computational complexity A limitation of Gaussian process regression in its e xplicit form is that the computa- tional complexities of the regression Equations (4) and likelihood Equation (5) are cubic O ( N 3 ) in the number of measurements N . This is due to the N × N matrix in version appearing in the equations which, ev en when implemented with Cholesky or LU decompositions, needs a cubic number of computational steps. W ays of solving the computational complexity problem are, for example, sparse approximations using inducing points (Qui ˜ nonero-Candela and Rasmussen, 2005; Ras- mussen and W illiams, 2006; T itsias, 2009), approximating the problem with a discrete Gaussian random field model (Lindgren et al, 2011), or use of random or deterministic basis/spectral expansions (Qui ˜ nonero-Candela et al, 2010; Solin and S ¨ arkk ¨ a, 2018). 4.2 GP-NFIR, GP-NARX, GP-NOE, and r elated models In system identification, we can use Gaussian processes to model unknown input- output relationships in time series. Sev eral different model architectures are av ailable for this purpose. Let us assume that we have a system with input sequence u 1 , u 2 , . . . and output sequence y 1 , y 2 , . . . and the aim is to predict the outputs from inputs. W e also assume that we have been given a set of training data consisting of known inputs and (noisy) outputs. In the follo wing, we present some typically used architectures that hav e been proposed for this purpose. More details can be found in the monograph of K ocijan (2016). 4.2.1 GP-NFIR model The Gaussian process non-linear finite impulse response (GP-NFIR) model (Acker - mann et al, 2011; K ocijan, 2016) has the form (see Figure 2) ˆ y k = f ( u k − 1 , . . . , u k − m ) , (7) where f ( · ) is an unknown mapping which we model as a Gaussian process, and ˆ y k denotes the estimate produced by the regressor . In this model, we form a Gaussian pro- cess regressor that predicts the current output from a finite number of pre vious inputs. 4 This model can be identified by reducing it into a Gaussian process regression model y k = f ( z k ) +  k , (8a) where z k =  u k − 1 . . . u k − m  > and by using standard Gaussian process regression methods on it. GP model u k − m u k − 1 u k y k  k Figure 2: In GP-NFIR model the Gaussian process regressor is used to predict ne xt output from previous inputs. 4.2.2 GP-NARX model The Gaussian process nonlinear autoregressi ve model with exogenous input (GP- N ARX) (Kocijan et al, 2005; K ocijan, 2016) is a model of the form (see Figure 3) y k = f ( y k − 1 , . . . , y k − n , u k − 1 , . . . , u k − m ) +  k , (9) where  k is a Gaussian random v ariable. This model can be reduced to a Gaussian pro- cess regression problem by setting z k =  y k − 1 · · · y k − n u k − 1 · · · u k − m  > in Equation (8a). GP model y k − n y k − 1 y k u k − m u k y k  k Figure 3: In GP-NARX model the Gaussian process is used to predict the ne xt output from the previous inputs and outputs. 4.2.3 GP-NOE model In Gaussian process nonlinear output error (GP-NOE) model (Kocijan and Petelin, 2011; Kocijan, 2016) we form a Gaussian process regressor for the problem (see Fig- ure 4) y k = f ( ˆ y k − 1 , . . . , ˆ y k − n , u k − 1 , . . . , u k − m ) +  k , (10) 5 where ˆ y k − 1 , . . . , ˆ y k − n are the Gaussian process re gressor predictions from the pre vious steps. GP model q − n ˆ y k − n q − 1 ˆ y k − 1 u k − m u k y k  k Figure 4: The GP-NOE model uses pre vious inputs and the previous outputs of the Gaussian process regressor to predict the next output. In the figure, q − n denotes an n -step delay operator . Learning in this kind of model requires further approximations because the predic- tions of the Gaussian process are directly used as inputs on the next step. 4.2.4 Other model architectur es As discussed in Kocijan (2016), it is also possible extend these architectures to, for example, GP-N ARMAX (nonlinear autoregressiv e and moving average model with exogenous input) models and NJB (nonlinear BoxJenkins) models. 4.3 Gaussian process state-space (GPSS) models Another approach to system identification is to form a state-space model where the dynamic and measurement models are identified using Gaussian process regression methods. This leads to so-called Gaussian process state-space models. 4.3.1 General GPSS model A Gaussian process state-space (GPSS) model (see Figure 5) has the mathematical form (e.g. K ocijan, 2016) x k +1 = f ( x k , u k ) + w k , (11a) y k = g ( x k , u k ) +  k , (11b) where the state vector x k , k = 0 , 1 , 2 , . . . , N contains the current state of the system, u 1 , u 2 , . . . is the input sequence and y 1 , y 2 , . . . is the output sequence. In the model, w k is a Gaussian distributed process noise. The aim is no w to learn the functions f ( x k , u k ) and g ( x k , u k ) , which are modeled as Gaussian processes, giv en the input and output sequences, or in some cases, also giv en direct observations of the state v ector . 4.3.2 Learning with fully observ ed state When the state vector x k is fully observed, then both the dynamic model (11a) and measurement model (11b) become standard Gaussian process regression models. In 6 GP model f q − 1 x k − 1 u k − 1 x k w k x k GP model g u k y k  k Figure 5: In GPSS model we learn a Gaussian process regressor for approximating the dynamic and measurement models in a state-space model. In this figure, q − 1 denotes a one-step delay operator . dynamic model (11a), the training set consists of measurements x k +1 with the cor- responding inputs ( x k , u k ) , and in measurement model (11b) the measurements are y k with the corresponding inputs ( x k , u k ) . This kind of fully observ ed models are important in many applications such as robotics (Deisenroth et al, 2015). After conditioning on the training data, the functions f and g will still be Gaussian processes and their mean and covariance functions are gi ven by (multiv ariate general- izations) of Equations (4). State estimation in this kind of models has been considered by K o and Fox (2009) and Deisenroth et al (2011), and it turns out that it is possible to construct closed-form Gaussian approximation (moment matching) based filters and smoothers for these models (Deisenroth et al, 2011). Control problems related to this kind of models hav e been considered, for example, by Deisenroth et al (2015). 4.3.3 Marginalization of the GP When the states x k are not observ ed, then we need to treat both the states and the Gaus- sian processes as unknown. There are a few different ways to cope with the model in that case, and one approach is the marginalization approach of Frigola et al (2013, 2014b,a) and Frigola (2016). First note that if we hav e a method to learn f , we can learn both f and g using a state-augmentation trick (Frigola, 2016): we define an aug- mented state as ˜ x =  x γ  > , Gaussian process h ( γ , u ) =  f ( x , u ) g ( x , u )  > , and augmented process noise ˜ w =  w k 0  > , which reduces the model to ˜ x k +1 = h ( ˜ x k , u k ) + ˜ w k , (12a) y k = γ k +  k . (12b) This is no w a model with an unknown dynamic model, but with a given linear Gaussian measurement model p ( y k | ˜ x k , u k ) = N ( y k | γ k , σ 2 n ) . Thus, without a loss of generality , we can focus on models with an unkno wn dy- 7 namic model f , and a known measurement model: x k +1 = f ( x k , u k ) + w k , (13a) y k ∼ p ( y k | x k , u k ) . (13b) The aim is to learn the function f ( x ) from a sequence of measurement data y k . One way to understand (Frigola, 2016) this model is that we could hypothetically generate data from it by sampling the (infinite-dimensional) function f ( x , u ) , and then starting from x 0 sequentially produce each { f 1 , x 1 , . . . , f N , x N } , where we hav e de- noted f k = f ( x k − 1 , u k − 1 ) for k = 1 , 2 , . . . , N . Each of the conditional distributions p ( f k | f 1: k − 1 , x 0: k − 1 ) (14) turns out to be Gaussian. Note that above, we hav e introduced the short-hand notation f 1: k = ( f 1 , . . . , f k ) which we will use also in the rest of this article. The abov e observation allows us to integrate out (i.e., marginalize) the Gaussian process from the model in closed form. The result is the following representation (Frigola et al, 2013): p ( x 0: N ) = N Y k =1 N ( x k | µ k ( x 0: k − 1 ) , Σ k ( x 0: k − 1 )) , (15) where the means µ k ( x 0: k − 1 ) and covariances Σ k ( x 0: k − 1 ) are (quite complicated) functions of the prior mean and co variance functions of the Gaussian process f , which are ev aluated on the whole previous state history . The above equation defines a non- Markovian prior model for the state sequence x k , k = 1 , . . . , N . For a given x 0: N , the distribution p ( f ( x ∗ ) | x ∗ , x 0: N ) for a test point x ∗ can be computed by using con ventional Gaussian process prediction equation as follo w: p ( f ( x ∗ ) | x ∗ , y 1: N ) = Z p ( f ( x ∗ ) | x ∗ , x 0: N ) p ( x 0: N | y 1: N ) d x 0: N , (16) which can be numerically approximated, pro vided that we use a conv enient (e.g. Monte Carlo) approximation for p ( x 0: N | y 1: N ) . Giv en the model (15) it is then possible to use, for example, particle Markov chain Monte Carlo methods (Frigola et al, 2013) to sample state trajectories from the poste- rior distribution p ( x 0: N | y 1: N ) jointly with the parameters of the model, which pro- vides a Monte Carlo approximation to the above integral. Other proposed approaches are, for example, particle stochastic approximation expectation–maximization (EM, Frigola et al, 2014b) which uses a Monte Carlo approximation to the EM algorithm aiming at computing the maximum likelihood estimates of the parameters while han- dling the states as missing data. 4.3.4 Appr oximation of the GP Another way to approach the problem where both the states and Gaussian processes are unknown is to approximate the Gaussian process as finite-dimensional parametric model and use con ventional parameter estimation methods to the model. 8 One possible approximation considered in Sv ensson et al (2016) is to employ a Karhunen–Loev e type of basis function expansion of the Gaussian process as follo ws: f ( x , u ) = S X i =1 c i φ i ( x , u ) , (17) where φ i ( x , u ) are deterministic basis functions (e.g. sinusoidals) and c i are Gaussian random variables. With this approximation, the model in Equation (13) becomes x k +1 = S X i =1 c i φ i ( x k , u k ) + w k , (18a) y k ∼ p ( y k | x k , u k ) , (18b) where learning of the Gaussian process f reduces to estimation of the finite number of parameters c =  c 1 · · · c S  > in the state-space model. The states and parameters in this model can no w be determined using, for example, particle Marko v chain Monte Carlo (PMCMC) methods (see Svensson et al, 2016). Another possibility is to use inducing points (typically denoted with u , but here we denote them with f u to av oid confusion with the input sequence). In those ap- proaches the idea is to first perform Gaussian process inference on the inducing points alone, that is, compute p ( f u | y 1: N ) and then compute the (approximate) predictions by conditioning on the inducing points instead of the original data. This leads to the approximation p ( f ( x ∗ ) | y 1: N ) = Z p ( f ( x ∗ ) | x ∗ , x 0: N ) p ( x 0: N | y 1: N ) d x 0: N ≈ Z p ( f ( x ∗ ) | x ∗ , f u ) p ( f u | y 1: N ) d f u . (19) In the T urner et al (2010) the inducing points are learned using expectation– maximization (EM) algorithm. Frigola et al (2014a) propose a method for v ariational learning (or integration over) the inducing points by forming a variational Bayesian approximation q ( f u ) ≈ p ( f u | y 1: N ) , which further results in p ( f ( x ∗ ) | y 1: N ) ≈ Z p ( f ( x ∗ ) | x ∗ , f u ) q ( f u ) d f u (20) and turns out to be analytically tractable as the optimal v ariational distribution q ( f u ) is Gaussian. 4.4 Spatio-temporal Gaussian process models 4.4.1 T emporal Gaussian processes Another way of modeling time series using Gaussian processes is by considering them as functions of time (Hartikainen and Srkk, 2010) which are sampled at certain time instants t 1 , t 2 , . . . : f ( t ) ∼ G P ( m ( t ) , k ( t, t 0 )) , y k = f ( t k ) +  k . (21) 9 That is, instead of attempting to form a predictor from the pre vious measurements or inputs, the idea is to condition the temporal Gaussian process on its observed v alues and use the conditional Gaussian process for predicting values at ne w time points. Unfortunately , due to the cubic computational scaling of the Gaussian process re- gression, this quickly becomes in tractable when time series length increases. Howe ver , temporal Gaussian process regression is closely related to the classical Kalman filtering and Rauch-Tung-Striebel smoothing problems (e.g. S ¨ arkk ¨ a, 2013), which can be used to reduce the required computations. It turns out that provided that the Gaussian pro- cess is stationary , that is, the cov ariance function only depends on the time difference k ( t, t 0 ) = k ( t − t 0 ) , then, under certain restrictions, the Gaussian process regression problem is essentially equiv alent to state-estimation in a model of the form d x ( t ) d t = A x ( t ) + B η ( t ) , y k = C x ( t k ) +  k , (22) where η ( t ) is a white noise process and the matrices A , B , and C are selected suitably to match the original cov ariance function. For example, the Mat ´ ern cov ariance func- tions with half-integer smoothness parameters ha ve exact representations as state-space models (Hartikainen and Srkk, 2010). The Gaussian process re gression problem can be no w solv ed by applying a Kalman filter and Rauch-Tung-Striebel smoother on this problem. These methods have the fortunate property , that their complexity is linear O ( N ) with respect to the number of measurements N as opposed to the cubic complexity of direct Gaussian process regression solution. 4.4.2 Spatio-temporal Gaussian processes A similar state-space approach also w orks for spatio-temporal Gaussian process models with a covariance function of a stationary form k ( z , t ; z 0 , t 0 ) = k ( z , z 0 ; t − t 0 ) . In that case, the state-estimation problem becomes infinite-dimensional, that is, a distributed parameter system d x ( z , t ) d t = A x ( z , t ) + B η ( z , t ) , y k = C x ( z , t k ) +  k , (23) where A is a matrix of operators and C is a matrix of functionals (S ¨ arkk ¨ a et al, 2013). The solution of the Gaussian process regression problem in this form requires the use of methods from partial differential equations, but in man y cases we can obtain an exact O ( N ) inference procedure from this route. 4.4.3 Latent for ce models In so-called latent force models ( ´ Alvarez et al, 2013) the idea is to infer latent force ξ ( t ) in a differential equation model such as d 2 x ( t ) d t 2 + γ d x ( t ) d t + ν 2 = ξ ( t ) , (24) where ξ ( t ) is an unknown function which is modeled as a Gaussian process. The in- ference in this model can be recast as Gaussian process regression with a modified 10 cov ariance function. The idea can be further generalized to partial differential equation models and non-linear models when approximation methods such as Laplace approxi- mation are used. The inference in this kind of models can be further re-stated in the state-space form (S ¨ arkk ¨ a et al, 2019), which also allo ws for the study of control problems on latent force models. This formulation also allows the analysis of observability and controllability properties of latent force models. 5 Summary and Futur e Dir ections In this article, we have briefly outlined the main directions in system identification us- ing Gaussian processes. The methods can be divided into three classes: (1) GP-NFIR, GP-N ARX, and GP-NOE type of models which directly aim at learning the function from the previous inputs and outputs to the current output, (2) Gaussian process state- space models which aim to learn the dynamic and measurement models in the state- space model, and (3) Gaussian process re gression methods for the spatio-temporal time series by direct or state-space Gaussian process regression. Activ e problems in research in this area appear to be, for example, joint learning and control in all the model types outlined here. Another direction of further study is the analysis of observability , identifiability , and controllability . In practice, the Gaus- sian process-based system identification methods are very similar to classical methods, and hence they can be expected to inherit many limitations and theoretical properties of the classical methods. An emerging research area in Gaussian process-based models are so-called deep Gaussian processes (Damianou and Lawrence, 2013) which borro w the idea of deep neural networks by forming hierarchies of Gaussian processes. This kind of models could also turn out to be useful in system identification. Furthermore, Gaussian pro- cesses are also easily combined with first-principles-models (cf. latent force models described above) which allows for flexible gray box modeling using Gaussian pro- cesses. One of the main obstacles in Gaussian process re gression is still the computational scaling in the number of measurements, which is also inherited by all the new de velop- ments. Although sev eral good approaches to tackle this problem ha ve been proposed, the problem still is that they inherently replace the original model with an approxima- tion. New better approaches to this problem are likely to appear in the near future. 6 Cr oss Refer ences Refer ences Ackermann ER, De V illiers JP , Cilliers P (2011) Nonlinear dynamic systems modeling using Gaussian processes: Predicting ionospheric total electron content over South Africa. Journal of Geophysical Research: Space Physics 116(10) ´ Alvarez MA, Luengo D, Lawrence ND (2013) Linear latent force models using Gaus- sian processes. IEEE T ransactions on Pattern Analysis and Machine Intelligence 35(11):2693–2705 11 Brooks S, Gelman A, Jones GL, eng XLM (2011) Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC, Boca Raton, FL Cressie N AC (1993) Statistics for Spatial Data. W iley Damianou A C, Lawrence ND (2013) Deep Gaussian processes. In: International Con- ference on Artificial Intelligence and Statistics (AIST A TS), pp 207–215 Deisenroth MP , T urner RD, Huber MF , Hanebeck UD, Rasmussen CE (2011) Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control 57(7):1865–1871 Deisenroth MP , Fox D, Rasmussen CE (2015) Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(2):408–423 Frigola R (2016) Bayesian time series learning with Gaussian processes. PhD thesis, Univ ersity of Cambridge Frigola R, Lindsten F , Sch ¨ on TB, Rasmussen CE (2013) Bayesian inference and learn- ing in Gaussian process state-space models with particle MCMC. In: Advances in Neural Information Processing Systems, pp 3156–3164 Frigola R, Chen Y , Rasmussen CE (2014a) V ariational Gaussian process state-space models. In: Advances in Neural Information Processing Systems, pp 3680–3688 Frigola R, Lindsten F , Sch ¨ on TB, Rasmussen CE (2014b) Identification of Gaussian process state-space models with particle stochastic approximation EM. IF A C Pro- ceedings V olumes, Proceedings of the 19th IF AC W orld Congress 47(3):4097–4102 Hartikainen J, Srkk S (2010) Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In: IEEE International W orkshop on Machine Learning for Signal Processing (MLSP), pp 379–384 K o J, Fox D (2009) GP-BayesFilters: Bayesian filtering using Gaussian process pre- diction and observation models. Autonomous Robots 27(1):75–90 K ocijan J (2016) Modelling and control of dynamic systems using Gaussian process models. Springer K ocijan J, Petelin D (2011) Output-error model training for Gaussian process mod- els. In: International Conference on Adaptiv e and Natural Computing Algorithms, Springer , pp 312–321 K ocijan J, Girard A, Banko B, Murray-Smith R (2005) Dynamic systems identifica- tion with Gaussian processes. Mathematical and Computer Modelling of Dynamical Systems 11(4):411–424 Lindgren F , Rue H, Lindstr ¨ om J (2011) An explicit link between Gaussian fields and Gaussian Marko v random fields: the stochastic partial differential equation ap- proach. JRSS B 73(4):423–498 Mat ´ ern B (1960) Spatial variation. T ech. rep., Meddelanden fr ˚ an Statens Skogforskn- ingsinstitut, band 49 - Nr 5 12 McHutchon AJ (2015) Nonlinear modelling and control using Gaussian processes. PhD thesis, Univ ersity of Cambridge Qui ˜ nonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. JMLR 6:1939–1959 Qui ˜ nonero-Candela J, Rasmussen CE, Figueiras-V idal AR, et al (2010) Sparse spectrum Gaussian process regression. Journal of Machine Learning Research 11(Jun):1865–1881 Rasmussen CE, W illiams CK (2006) Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA S ¨ arkk ¨ a S (2013) Bayesian Filtering and Smoothing. Cambridge Univ ersity Press S ¨ arkk ¨ a S, Solin A, Hartikainen J (2013) Spatiotemporal learning via infinite- dimensional Bayesian filtering and smoothing. IEEE Signal Processing Magazine 30(4):51–61 S ¨ arkk ¨ a S, ´ Alvarez MA, La wrence ND (2019) Gaussian process latent force models for learning and stochastic control of physical systems. IEEE Transactions on Automatic Control (to appear) Solin A, S ¨ arkk ¨ a S (2018) Hilbert space methods for reduced-rank Gaussian process regression. ArXi v:1401.5508 Svensson A, Solin A, S ¨ arkk ¨ a S, Sch ¨ on T (2016) Computationally efficient Bayesian learning of Gaussian process state space models. In: Artificial Intelligence and Statistics, pp 213–221 T itsias M (2009) V ariational learning of inducing variables in sparse Gaussian pro- cesses. In: Artificial Intelligence and Statistics, pp 567–574 T urner R, Deisenroth M, Rasmussen C (2010) State-space inference and learning with Gaussian processes. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 868–875 13

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment