Kernel Regression by Mode Calculation of the Conditional Probability Distribution

Kernel Regression b y Mo de Calculation of the Conditional Probabilit y Distribution Steﬀen K ¨ uhn T ec hnische Univ ersit¨ at Berlin Chair of Electronic Measuremen t and Diagnostic T ec hnology Einstein ufer 17, 10585 Berlin, Germany steﬀen.kuehn@tu-b erlin.de Abstract The most direct w ay to express arbitrary dep endencies in datasets is to estimate the join t distribution and to apply afterw ards the argmax- function to obtain the mo de of the corresponding conditional distribu- tion. This metho d is in practice diﬃcult, b ecause it requires a global optimization of a complicated function, the joint distribution b y ﬁxed input v ariables. This article prop oses a metho d for ﬁnding global maxima if the join t distribution is mo deled by a kernel densit y esti- mation. Some exp eriments sho w adv an tages and shortcomings of the resulting regression metho d in comparison to the standard Nadaray a- W atson regression tec hnique, which approximates the optimum by the exp ectation v alue. 1 In tro duction Regression is a v ery imp ortant metho d in engineering and science for the estimation of the dep endencies b etw een tw o or more v ariables on the basis of some giv en sample p oin ts. The best kno wn regression metho d is certainly the parametric regression technique after Legendre and Gauss, whic h minimizes the squared error b etw een a mo del – often a p olynom – and the samples. The least squares metho d is fast and w ell suitable for strongly linearly correlated data, but seldom appropriate for high-dimensional problems with diﬃcult, unkno wn, and non-linear dep endencies. F or these problems, non- parametric regression tec hniques – lik e k ernel or Nadaray a-W atson regression metho ds – are more suitable (Nadaray a [1964], W atson [1964]). The ﬁrst 1 step of Nadaray a-W atson regression is to estimate the unknown join t density distribution p X , Y ( x , y ) of the given sample data D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } b y a kernel-densit y estimator [Scott, 1992]. The resulting mo del distribution has in the most general case the form ˜ p X , Y ( x , y ) = m X i =1 a i φ i ( x − x i ) ψ i ( y − y i ) (1) with m ≤ n and P m i =1 a i = 1. F urthermore, the k ernel functions φ i and ψ i ha ve to b e normalized so that the integrals o ver all v alues for x and y are one. With this mo del, the conditional distribution ˜ p Y , X ( y | x ) can b e easily deriv ed: ˜ p Y | X ( y | x ) = ˜ p X , Y ( x , y ) ˜ p X ( x ) = ˜ p X , Y ( x , y ) R y ˜ p X , Y ( x , y )d y = m P i =1 a i φ i ( x − x i ) ψ i ( y − y i ) m P i =1 a i φ i ( x − x i ) . (2) This distribution represents the relative probabilities for realizations of y giv en a vector x . But for a regression, w e do not need a probabilit y distribu- tion, but a single v ector. The most in tuitive choice is, of course, the mo de of the conditional distribution, that means the v alue y for whic h ˜ p Y | X b ecomes maximal. F or this case, the regression function ˜ f ( x ) tak es the speciﬁc form ˜ y = ˜ f ( x ) = argmax y { ˜ p Y | X ( y | x ) } . (3) The diﬃcult y is, ho wev er, that the maximization is not easy to calculate, b ecause the expression (2) is highly non-linear. On the other hand, the exp ected v alue of ˜ p Y | X ( y | x ) regarding y is easy to calculate: Z y y ˜ p Y | X ( y | x ) d y = m P i =1 a i y i φ i ( x − x i ) m P i =1 a i φ i ( x − x i ) . (4) The idea of Nadara y a and W atson was to approximate expression (3) by ˜ y ≈ m P i =1 a i y i φ i ( x − x i ) m P i =1 a i φ i ( x − x i ) , (5) 2 (A) -2 -1 0 1 2 p Y | X ( y | 3) y -10 -5 0 5 10 x -5 0 5 0 0.2 0.4 0.6 0.8 1 0 -1 1 p X,Y ( x, y ) x y -5 0 5 0 0.2 0.4 0.6 0.8 1 0 -1 1 x y (B) -2 -1 0 1 2 y -10 -5 0 5 10 x p X,Y ( x, y ) p X | Y ( x | − 0 . 6) p Y | X ( y | 3) p X | Y ( x | − 0 . 6) Figure 1: Joint distributions with unam biguous and ambiguous relations b et w een x and y . whic h is often suﬃcien tly go o d. But there are also some p otential problems. Figure 1 demonstrates this for t wo diﬀerent joint distributions. Case (A) is uncritical and describes essen tially a h yp erb olic tangent function. Case (B), how ev er, causes problems. Here, the v ariables x and y are non-linearly correlated to o, but the underlying dep endency cannot b e described b y a function. The diﬃcult y becomes ob vious when considering the conditional distributions p X | Y ( x | − 0 . 6) and p Y | X ( y | 3), whic h are no longer unimo dal. A computation of the exp ected v alue would yield in b oth cases zero, whic h is far a wa y from probable v alues for y = − 0 . 6 or x = 3. In contrast to the Nadaray a-W atson metho d, the calculation of expres- sion (3) should not lead to problems. The argmax-calculation cannot resolve the am biguousness, of course, due to the fact that t wo global maxima ex- ist, but it is better to return only one maxim um than a completely incor- rect v alue. Esp ecially for high-dimensional tasks, this shortcoming of the Nadara ya-W atson metho d can b e anno ying, b ecause the o ccurrence of am- biguousness is diﬃcult to recognize. T o ov ercome this “curse of compromise”, the next section prop oses a metho d for solving expression (3) directly if the densit y estimation is giv en in the form (1). 3 2 Finding Global Maxima for Kernel Densit y Estimations The fundamental idea of the here prop osed metho d to ﬁnd the global maxi- m um of a probability densit y function is to utilize its sp ecial prop erties. In general, the glob al maximum of an arbitrary function, which is, for example, giv en as a piece of co de, can b e found only b y trial and error. In principle, this can b e also applied to ﬁnd the global maximum of a probability density function. But this “blind” searc h would b e v ery ineﬃcient and the remaining lik eliho o d to ﬁnd still b etter v alues do es not b ecome zero, regardless of how long the algorithm runs. But for probability densities h ( y ), this remaining lik eliho o d can be re- duced v ery fast by using h -distributed sample p oints, instead of ev enly dis- tributed samples. Wh y is it so? T o answer this question, w e assume that q accordingly distributed sample p oin ts y i with i = 1 , . . . , q hav e b een gener- ated. F or each y i , there is a p ercen tage α i for more improbable realizations y . Let α i b e 99%. The probabilit y that all other q − 1 generated samples ha ve lo w er α -v alues is (99% / 100%) q − 1 . F or q = 10000, this probability is only 2 . 27 10 − 44 ! In practice, this means that it is imp ossible not to come close to the global maximum with 10000 h -distributed sample p oin ts. That is all. F ortunately , the generation of accordingly distributed samples is not v ery diﬃcult for k ernel density estimations lik e expression (1). In the ﬁrst step, w e insert for the calculation of expression (3) the giv en v alue x and get ˜ y = argmax y ( h ( y )) (6) with h ( y ) := m X i =1 b i ψ i ( y − y i ) (7) and b i := a i φ i ( x − x i ) P m j =1 φ j ( x − x j ) . (8) Note that h fulﬁlls the requirements for a probabilit y densit y function. F ur- thermore, the b i can b e in terpreted as probabilities 1 b ecause of P m j =1 b j = 1. In the next step, w e generate a dataset D 0 = { y 0 1 , . . . , y 0 q } (9) 1 Man y b i are v ery lo w for a giv en v alue x . The corresponding k ernels should b e omitted to improv e the computation sp eed. 4 of h -distributed random samples. F or this purp ose, w e can utilize the distri- bution function H of the density function h : H ( y ) = y Z − ∞ m X i =1 b i ψ i ( z − y i )d z = m X i =1 b i y Z − ∞ ψ i ( z − y i )d z = m X i =1 b i Ψ i ( y − y i ) . (10) The distribution functions Ψ i for the kernels ψ i are usually known or at least easy to calculate. The generation of the k = 1 , . . . , q random samples (9) can b e p erformed in three stages: 1. Choose randomly one of the m k ernels ψ i corresp onding to the proba- bilities b i . 2. Let j b e the choice of the ﬁrst stage. Generate now a ψ j -distributed random v alue using the distribution function Ψ j . 3. Add the k ernel cen ter y j to the random v alue from stage tw o to get a random sample y 0 k After that, w e calculate the function v alues h ( y 0 k ) for all k = 1 , . . . , q of dataset (9). The argument y 0 k for whic h h ( y 0 k ) b ecomes maximal is then a goo d starting point for a local optimization method lik e gradien t ascen t [Duda et al., 2000], for example. 3 Implemen tation Example The subsequen t Matlab co de 2 snipp et implemen ts the describ ed metho d for m ultidimensional Gaussian kernels with diagonal cov ariance matrix. function [xm,pxm] = findMax(para,q) m = length(para.a); d = length(para.x(1,:)); cdf = zeros(1,m); for (i=2:m) cdf(i) = cdf(i-1) + para.a(i-1); end 2 The co de was tested with Matlab 7.1. 5 xr = zeros(q,d); yr = zeros(q,1); for (i=1:q) rv = rand(1); lvi = find(cdf < rv); ri = lvi(length(lvi)); xr(i,:) = randn(1,d).*sqrt(para.s(ri)) + para.x(ri); yr(i) = KDE(xr(i,:),para); end [pxm,xi] = max(yr); xm = xr(xi,:); The parameters of the expression (7) are com bined into the structure para with three elements: para.a are the w eights b i , para.s the standard devi- ations, and para.x the centers of the Gaussian k ernels. The function KDE calculates the estimated densit y v alue for a given x : function y = KDE(x,para); m = length(para.a); y = 0; for (i=1:m) y = y + para.a(i)*gauss(x,para.x(i,:),para.s(i,:)); end function y = gauss(x,m,s) y = prod(1./(sqrt(2*pi)*s)).*exp(sum(-(x-m).^2./(2*s.^2))); The gradien t ascent is not p erformed in this example. Figure 2 sho ws the results of an exp eriment with function findMax for diﬀeren t v alues of q . The parameters of the probabilit y densit y function h ( y ) w ere para.a = [0.45,0.45,0.1]; para.x = [[1,1];[-1,-1];[-1.5,1.5]]; para.s = [[1,1];[1,1];[0.5,0.5]]; That means that there are tw o global maxima – at appro ximately (1 1) T and − (1 1) T . F or this reason, b oth could b e the result returned b y the algorithm. But only one of these p ossibilities is returned p er step. The plots also show that the distribution of the computed points b ecomes more compact with increasing size of q . Figure 3 sho ws the result for a more complex densit y with 80 k ernels and several lo cal maxima. 6 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 Figure 2: Con tour plots for a t wo-dimensional k ernel densit y estimation with t wo global maxima at ab out (1 1) T and − (1 1) T . Both w ere found b y the algorithm (black marks). F or the left hand plot, q w as 100 and for the right hand plot 1000. 4 Regression Exp erimen ts This section in vestigates the prop erties of the describ ed metho d. The ﬁrst exp erimen t compares the standard Nadara y a-W atson method and the pro- p osed method with computation of the mode in view of its abilit y to estimate a clear functional dep endency b etw een x und y . F or this purp ose, a dataset of n = 1000 random sample p oints of the function y = sin( x 8 5 ) in the in terv al [0 , 2 π ] w as generated. F urthermore, a slight, Gaussian distributed noise with a standard deviation of σ N = 0 . 2 was added to the y -v alues. The dataset is sho wn on the left of Figure 4. Before applying the tw o regression metho ds, the distribution of the data has to b e modeled b y a kernel density . Diﬀeren t t yp es of k ernels can b e applied. One of the simplest is the d -dimensional Gaussian kernel g ( x , s ) = d Y k =1 1 √ 2 π s k exp  − x 2 k 2 s 2 k  (11) with s = ( s 1 . . . s d ) T as only free parameter. Its application to the tw o dimensional problem of Figure 4 yields ˜ p ( x ) = 1 n n X i =1 g ( x − x i , s ) (12) with x = ( x y ) T . F or high-dimensional problems, the smo othness s has to b e automatically optimized with resp ect to a certain quality measuremen t, suc h 7 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 Figure 3: The result for a more complex density with 80 kernels and several lo cal maxima. q w as 100 (left) and 1000 (righ t). 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y Figure 4: Two datasets. as the self-contribution [Duin, 1976] for example. Another metho d is plug-in estimation. A go o d ov erview ab out this topic is giv en b y T urlach [1993] or Scott [1992]. F or the t wo-dimensional dataset in Figure 4, it is still p ossible to esti- mate the smo othness parameter visually . F or s = (0 . 1 0 . 1) T , the resulting densit y is dra wn as con tour plot in Figure 5 at the top. F urthermore, the picture provides the result of the Nadara y a-W atson regression (left) and of the prop osed metho d (right) as white dotted lines. Every p oint represen ts the outcome for a single giv en v alue x . The picture demonstrates to o that the Nadaray a-W atson regression p er- forms clearly b etter. This is only at the ﬁrst glance surprising. F ormally , b oth regression metho ds should giv e the same results, b ecause exp ected v alue and 8 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y 0 1 2 3 4 5 6 − 1 − 0.5 0 0.5 1 x y Figure 5: The results of the Nadara ya-W atson metho d (left) and of the optimization metho d (right). The density estimation abov e shows a clear dep endency b etw een x and y – contrary to the other b elo w. maxim um are identical for the true conditional probability distribution p Y | X ( y | x ) = g ( y − sin( x 8 5 ) , σ N ) . (13) But b oth regression tec hniques ha ve diﬀerent susceptibilities to estimation errors and the prop ert y of the eﬀective v alue to a v erage out noise leads to a m uch smo other curve progression. This adv antage b ecomes a shortcoming if the dep endencies within the data are ambiguous. T o demonstrate this, a second dataset was generated (Figure 4, righ t). Now, for most v alues of x t w o v alues of y with diﬀerent emphasis are reasonable. The Nadaray a-W atson regression calculates for ev ery x a “compromise”. This can lead to v ery improbable v alues for y . The optimization metho d how ever c ho oses alwa ys the most probable v alue and is b ecause of this immune to this eﬀect. 9 5 Conclusion If a dep endency b etw een some data is clear and unambiguous, the standard Nadara ya-W atson metho d or – still b etter – the lo cal linearizing Nadaray a- W atson approach [Clev eland, 1979] should be preferred for the mo deling. But for numerous real life applications, it cannot b e guaran teed that this condition is fulﬁlled b ecause, for example, the data ma y b e collected online. Another diﬃcult y is a high dimensionality . The dep endency ma y b e simple and unambiguous b etw een some of the v ector comp onents, but b et w een oth- ers it may not. Every input-output com bination has to b e chec ked, what is mostly impracticable. F or suc h general and complex cases, the prop osed metho d is more suit- able, b ecause the assumption of unam biguousness is not necessary . The ap- proac h returns alwa ys a prediction that is probable in resp ect to the knowl- edge given by the sample data. F or some applications, this prop ert y is more imp ortan t than con tinuit y of the curve and its smo othness. An example for such a scenario is mac hine control. The data are in this case measurements from actuators and sensors. The con troller con tinuously has to solve the problem which actuator v alues leads to the desired sensor v alues. F or this purp ose, already one go o d setting is suﬃcient, regardless of whether sev eral possibilities exist or not. An a verage v alue or a “compro- mise”, ho wev er, is mostly a bad decision. References W.S. Clev eland. Robust lo cally w eighted regression and smo othing scatter- plots. Journal of the A meric an Stastic al Asso ciation , 1979. Ric hard O. Duda, P eter E. Hart, and Da vid G. Stork. Pattern Classiﬁc ation . John Wiley & Sons, Inc., 2000. R.P .W. Duin. On the choice of the smo othing parameters for parzen estima- tors of probabilit y density functions. IEEE T r ansactions on Computers , V ol. C-25, No. 11:1175–1179, 1976. E. A. Nadara y a. On estimating regression. The ory of Pr ob ability and its Applic ations , V ol. 9:141–142, 1964. Da vid W. Scott. Multivariate Density Estimation . John Wiley & Sons, Inc., 1992. 10 Berwin A. T urlach. Bandwidth selection in kernel density estimation: A review. In CORE and Institut de Statistique , 1993. G. S. W atson. Smo oth regression analysis. Sankhya, Series A , 26:359–372, 1964. 11

Kernel Regression by Mode Calculation of the Conditional Probability Distribution

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment