Kernel Regression by Mode Calculation of the Conditional Probability Distribution
The most direct way to express arbitrary dependencies in datasets is to estimate the joint distribution and to apply afterwards the argmax-function to obtain the mode of the corresponding conditional distribution. This method is in practice difficult…
Authors: Steffen Kuehn
Kernel Regression b y Mo de Calculation of the Conditional Probabilit y Distribution Steffen K ¨ uhn T ec hnische Univ ersit¨ at Berlin Chair of Electronic Measuremen t and Diagnostic T ec hnology Einstein ufer 17, 10585 Berlin, Germany steffen.kuehn@tu-b erlin.de Abstract The most direct w ay to express arbitrary dep endencies in datasets is to estimate the join t distribution and to apply afterw ards the argmax- function to obtain the mo de of the corresponding conditional distribu- tion. This metho d is in practice difficult, b ecause it requires a global optimization of a complicated function, the joint distribution b y fixed input v ariables. This article prop oses a metho d for finding global maxima if the join t distribution is mo deled by a kernel densit y esti- mation. Some exp eriments sho w adv an tages and shortcomings of the resulting regression metho d in comparison to the standard Nadaray a- W atson regression tec hnique, which approximates the optimum by the exp ectation v alue. 1 In tro duction Regression is a v ery imp ortant metho d in engineering and science for the estimation of the dep endencies b etw een tw o or more v ariables on the basis of some giv en sample p oin ts. The best kno wn regression metho d is certainly the parametric regression technique after Legendre and Gauss, whic h minimizes the squared error b etw een a mo del – often a p olynom – and the samples. The least squares metho d is fast and w ell suitable for strongly linearly correlated data, but seldom appropriate for high-dimensional problems with difficult, unkno wn, and non-linear dep endencies. F or these problems, non- parametric regression tec hniques – lik e k ernel or Nadaray a-W atson regression metho ds – are more suitable (Nadaray a [1964], W atson [1964]). The first 1 step of Nadaray a-W atson regression is to estimate the unknown join t density distribution p X , Y ( x , y ) of the given sample data D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } b y a kernel-densit y estimator [Scott, 1992]. The resulting mo del distribution has in the most general case the form ˜ p X , Y ( x , y ) = m X i =1 a i φ i ( x − x i ) ψ i ( y − y i ) (1) with m ≤ n and P m i =1 a i = 1. F urthermore, the k ernel functions φ i and ψ i ha ve to b e normalized so that the integrals o ver all v alues for x and y are one. With this mo del, the conditional distribution ˜ p Y , X ( y | x ) can b e easily deriv ed: ˜ p Y | X ( y | x ) = ˜ p X , Y ( x , y ) ˜ p X ( x ) = ˜ p X , Y ( x , y ) R y ˜ p X , Y ( x , y )d y = m P i =1 a i φ i ( x − x i ) ψ i ( y − y i ) m P i =1 a i φ i ( x − x i ) . (2) This distribution represents the relative probabilities for realizations of y giv en a vector x . But for a regression, w e do not need a probabilit y distribu- tion, but a single v ector. The most in tuitive choice is, of course, the mo de of the conditional distribution, that means the v alue y for whic h ˜ p Y | X b ecomes maximal. F or this case, the regression function ˜ f ( x ) tak es the specific form ˜ y = ˜ f ( x ) = argmax y { ˜ p Y | X ( y | x ) } . (3) The difficult y is, ho wev er, that the maximization is not easy to calculate, b ecause the expression (2) is highly non-linear. On the other hand, the exp ected v alue of ˜ p Y | X ( y | x ) regarding y is easy to calculate: Z y y ˜ p Y | X ( y | x ) d y = m P i =1 a i y i φ i ( x − x i ) m P i =1 a i φ i ( x − x i ) . (4) The idea of Nadara y a and W atson was to approximate expression (3) by ˜ y ≈ m P i =1 a i y i φ i ( x − x i ) m P i =1 a i φ i ( x − x i ) , (5) 2 (A) -2 -1 0 1 2 p Y | X ( y | 3) y -10 -5 0 5 10 x -5 0 5 0 0.2 0.4 0.6 0.8 1 0 -1 1 p X,Y ( x, y ) x y -5 0 5 0 0.2 0.4 0.6 0.8 1 0 -1 1 x y (B) -2 -1 0 1 2 y -10 -5 0 5 10 x p X,Y ( x, y ) p X | Y ( x | − 0 . 6) p Y | X ( y | 3) p X | Y ( x | − 0 . 6) Figure 1: Joint distributions with unam biguous and ambiguous relations b et w een x and y . whic h is often sufficien tly go o d. But there are also some p otential problems. Figure 1 demonstrates this for t wo different joint distributions. Case (A) is uncritical and describes essen tially a h yp erb olic tangent function. Case (B), how ev er, causes problems. Here, the v ariables x and y are non-linearly correlated to o, but the underlying dep endency cannot b e described b y a function. The difficult y becomes ob vious when considering the conditional distributions p X | Y ( x | − 0 . 6) and p Y | X ( y | 3), whic h are no longer unimo dal. A computation of the exp ected v alue would yield in b oth cases zero, whic h is far a wa y from probable v alues for y = − 0 . 6 or x = 3. In contrast to the Nadaray a-W atson metho d, the calculation of expres- sion (3) should not lead to problems. The argmax-calculation cannot resolve the am biguousness, of course, due to the fact that t wo global maxima ex- ist, but it is better to return only one maxim um than a completely incor- rect v alue. Esp ecially for high-dimensional tasks, this shortcoming of the Nadara ya-W atson metho d can b e anno ying, b ecause the o ccurrence of am- biguousness is difficult to recognize. T o ov ercome this “curse of compromise”, the next section prop oses a metho d for solving expression (3) directly if the densit y estimation is giv en in the form (1). 3 2 Finding Global Maxima for Kernel Densit y Estimations The fundamental idea of the here prop osed metho d to find the global maxi- m um of a probability densit y function is to utilize its sp ecial prop erties. In general, the glob al maximum of an arbitrary function, which is, for example, giv en as a piece of co de, can b e found only b y trial and error. In principle, this can b e also applied to find the global maximum of a probability density function. But this “blind” searc h would b e v ery inefficient and the remaining lik eliho o d to find still b etter v alues do es not b ecome zero, regardless of how long the algorithm runs. But for probability densities h ( y ), this remaining lik eliho o d can be re- duced v ery fast by using h -distributed sample p oints, instead of ev enly dis- tributed samples. Wh y is it so? T o answer this question, w e assume that q accordingly distributed sample p oin ts y i with i = 1 , . . . , q hav e b een gener- ated. F or each y i , there is a p ercen tage α i for more improbable realizations y . Let α i b e 99%. The probabilit y that all other q − 1 generated samples ha ve lo w er α -v alues is (99% / 100%) q − 1 . F or q = 10000, this probability is only 2 . 27 10 − 44 ! In practice, this means that it is imp ossible not to come close to the global maximum with 10000 h -distributed sample p oin ts. That is all. F ortunately , the generation of accordingly distributed samples is not v ery difficult for k ernel density estimations lik e expression (1). In the first step, w e insert for the calculation of expression (3) the giv en v alue x and get ˜ y = argmax y ( h ( y )) (6) with h ( y ) := m X i =1 b i ψ i ( y − y i ) (7) and b i := a i φ i ( x − x i ) P m j =1 φ j ( x − x j ) . (8) Note that h fulfills the requirements for a probabilit y densit y function. F ur- thermore, the b i can b e in terpreted as probabilities 1 b ecause of P m j =1 b j = 1. In the next step, w e generate a dataset D 0 = { y 0 1 , . . . , y 0 q } (9) 1 Man y b i are v ery lo w for a giv en v alue x . The corresponding k ernels should b e omitted to improv e the computation sp eed. 4 of h -distributed random samples. F or this purp ose, w e can utilize the distri- bution function H of the density function h : H ( y ) = y Z − ∞ m X i =1 b i ψ i ( z − y i )d z = m X i =1 b i y Z − ∞ ψ i ( z − y i )d z = m X i =1 b i Ψ i ( y − y i ) . (10) The distribution functions Ψ i for the kernels ψ i are usually known or at least easy to calculate. The generation of the k = 1 , . . . , q random samples (9) can b e p erformed in three stages: 1. Choose randomly one of the m k ernels ψ i corresp onding to the proba- bilities b i . 2. Let j b e the choice of the first stage. Generate now a ψ j -distributed random v alue using the distribution function Ψ j . 3. Add the k ernel cen ter y j to the random v alue from stage tw o to get a random sample y 0 k After that, w e calculate the function v alues h ( y 0 k ) for all k = 1 , . . . , q of dataset (9). The argument y 0 k for whic h h ( y 0 k ) b ecomes maximal is then a goo d starting point for a local optimization method lik e gradien t ascen t [Duda et al., 2000], for example. 3 Implemen tation Example The subsequen t Matlab co de 2 snipp et implemen ts the describ ed metho d for m ultidimensional Gaussian kernels with diagonal cov ariance matrix. function [xm,pxm] = findMax(para,q) m = length(para.a); d = length(para.x(1,:)); cdf = zeros(1,m); for (i=2:m) cdf(i) = cdf(i-1) + para.a(i-1); end 2 The co de was tested with Matlab 7.1. 5 xr = zeros(q,d); yr = zeros(q,1); for (i=1:q) rv = rand(1); lvi = find(cdf < rv); ri = lvi(length(lvi)); xr(i,:) = randn(1,d).*sqrt(para.s(ri)) + para.x(ri); yr(i) = KDE(xr(i,:),para); end [pxm,xi] = max(yr); xm = xr(xi,:); The parameters of the expression (7) are com bined into the structure para with three elements: para.a are the w eights b i , para.s the standard devi- ations, and para.x the centers of the Gaussian k ernels. The function KDE calculates the estimated densit y v alue for a given x : function y = KDE(x,para); m = length(para.a); y = 0; for (i=1:m) y = y + para.a(i)*gauss(x,para.x(i,:),para.s(i,:)); end function y = gauss(x,m,s) y = prod(1./(sqrt(2*pi)*s)).*exp(sum(-(x-m).^2./(2*s.^2))); The gradien t ascent is not p erformed in this example. Figure 2 sho ws the results of an exp eriment with function findMax for differen t v alues of q . The parameters of the probabilit y densit y function h ( y ) w ere para.a = [0.45,0.45,0.1]; para.x = [[1,1];[-1,-1];[-1.5,1.5]]; para.s = [[1,1];[1,1];[0.5,0.5]]; That means that there are tw o global maxima – at appro ximately (1 1) T and − (1 1) T . F or this reason, b oth could b e the result returned b y the algorithm. But only one of these p ossibilities is returned p er step. The plots also show that the distribution of the computed points b ecomes more compact with increasing size of q . Figure 3 sho ws the result for a more complex densit y with 80 k ernels and several lo cal maxima. 6 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 Figure 2: Con tour plots for a t wo-dimensional k ernel densit y estimation with t wo global maxima at ab out (1 1) T and − (1 1) T . Both w ere found b y the algorithm (black marks). F or the left hand plot, q w as 100 and for the right hand plot 1000. 4 Regression Exp erimen ts This section in vestigates the prop erties of the describ ed metho d. The first exp erimen t compares the standard Nadara y a-W atson method and the pro- p osed method with computation of the mode in view of its abilit y to estimate a clear functional dep endency b etw een x und y . F or this purp ose, a dataset of n = 1000 random sample p oints of the function y = sin( x 8 5 ) in the in terv al [0 , 2 π ] w as generated. F urthermore, a slight, Gaussian distributed noise with a standard deviation of σ N = 0 . 2 was added to the y -v alues. The dataset is sho wn on the left of Figure 4. Before applying the tw o regression metho ds, the distribution of the data has to b e modeled b y a kernel density . Differen t t yp es of k ernels can b e applied. One of the simplest is the d -dimensional Gaussian kernel g ( x , s ) = d Y k =1 1 √ 2 π s k exp − x 2 k 2 s 2 k (11) with s = ( s 1 . . . s d ) T as only free parameter. Its application to the tw o dimensional problem of Figure 4 yields ˜ p ( x ) = 1 n n X i =1 g ( x − x i , s ) (12) with x = ( x y ) T . F or high-dimensional problems, the smo othness s has to b e automatically optimized with resp ect to a certain quality measuremen t, suc h 7 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 Figure 3: The result for a more complex density with 80 kernels and several lo cal maxima. q w as 100 (left) and 1000 (righ t). 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y Figure 4: Two datasets. as the self-contribution [Duin, 1976] for example. Another metho d is plug-in estimation. A go o d ov erview ab out this topic is giv en b y T urlach [1993] or Scott [1992]. F or the t wo-dimensional dataset in Figure 4, it is still p ossible to esti- mate the smo othness parameter visually . F or s = (0 . 1 0 . 1) T , the resulting densit y is dra wn as con tour plot in Figure 5 at the top. F urthermore, the picture provides the result of the Nadara y a-W atson regression (left) and of the prop osed metho d (right) as white dotted lines. Every p oint represen ts the outcome for a single giv en v alue x . The picture demonstrates to o that the Nadaray a-W atson regression p er- forms clearly b etter. This is only at the first glance surprising. F ormally , b oth regression metho ds should giv e the same results, b ecause exp ected v alue and 8 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y Figure 5: The results of the Nadara ya-W atson metho d (left) and of the optimization metho d (right). The density estimation abov e shows a clear dep endency b etw een x and y – contrary to the other b elo w. maxim um are identical for the true conditional probability distribution p Y | X ( y | x ) = g ( y − sin( x 8 5 ) , σ N ) . (13) But b oth regression tec hniques ha ve different susceptibilities to estimation errors and the prop ert y of the effective v alue to a v erage out noise leads to a m uch smo other curve progression. This adv antage b ecomes a shortcoming if the dep endencies within the data are ambiguous. T o demonstrate this, a second dataset was generated (Figure 4, righ t). Now, for most v alues of x t w o v alues of y with different emphasis are reasonable. The Nadaray a-W atson regression calculates for ev ery x a “compromise”. This can lead to v ery improbable v alues for y . The optimization metho d how ever c ho oses alwa ys the most probable v alue and is b ecause of this immune to this effect. 9 5 Conclusion If a dep endency b etw een some data is clear and unambiguous, the standard Nadara ya-W atson metho d or – still b etter – the lo cal linearizing Nadaray a- W atson approach [Clev eland, 1979] should be preferred for the mo deling. But for numerous real life applications, it cannot b e guaran teed that this condition is fulfilled b ecause, for example, the data ma y b e collected online. Another difficult y is a high dimensionality . The dep endency ma y b e simple and unambiguous b etw een some of the v ector comp onents, but b et w een oth- ers it may not. Every input-output com bination has to b e chec ked, what is mostly impracticable. F or suc h general and complex cases, the prop osed metho d is more suit- able, b ecause the assumption of unam biguousness is not necessary . The ap- proac h returns alwa ys a prediction that is probable in resp ect to the knowl- edge given by the sample data. F or some applications, this prop ert y is more imp ortan t than con tinuit y of the curve and its smo othness. An example for such a scenario is mac hine control. The data are in this case measurements from actuators and sensors. The con troller con tinuously has to solve the problem which actuator v alues leads to the desired sensor v alues. F or this purp ose, already one go o d setting is sufficient, regardless of whether sev eral possibilities exist or not. An a verage v alue or a “compro- mise”, ho wev er, is mostly a bad decision. References W.S. Clev eland. Robust lo cally w eighted regression and smo othing scatter- plots. Journal of the A meric an Stastic al Asso ciation , 1979. Ric hard O. Duda, P eter E. Hart, and Da vid G. Stork. Pattern Classific ation . John Wiley & Sons, Inc., 2000. R.P .W. Duin. On the choice of the smo othing parameters for parzen estima- tors of probabilit y density functions. IEEE T r ansactions on Computers , V ol. C-25, No. 11:1175–1179, 1976. E. A. Nadara y a. On estimating regression. The ory of Pr ob ability and its Applic ations , V ol. 9:141–142, 1964. Da vid W. Scott. Multivariate Density Estimation . John Wiley & Sons, Inc., 1992. 10 Berwin A. T urlach. Bandwidth selection in kernel density estimation: A review. In CORE and Institut de Statistique , 1993. G. S. W atson. Smo oth regression analysis. Sankhya, Series A , 26:359–372, 1964. 11
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment