Multiclass Semi-Supervised Learning on Graphs using Ginzburg-Landau Functional Minimization

Multiclass Semi-Sup ervised Learning on Graphs using Ginzburg-Landau F unctional Minimization Cristina Garcia-Cardona 1 , Arjuna Flenner 2 , and Allon G. P ercus 1 1 Claremon t Graduate Univ ersity , Institute of Mathematical Sciences, Claremon t, CA 91711, USA, cristina.cgarcia@gmail.com, allon.p ercus@cgu.edu 2 Na v al Air W arfare Center, Physics and Computational Sciences, China Lake, CA 93555, USA Abstract. W e presen t a graph-based v ariational algorithm for classiﬁ- cation of high-dimensional data, generalizing the binary diﬀuse interface mo del to the case of multiple classes. Motiv ated by total v ariation tech- niques, the metho d inv olv es minimizing an energy functional made up of three terms. The ﬁrst t wo terms promote a step wise contin uous clas- siﬁcation function with sharp transitions betw een classes, while preserv- ing symmetry among the class lab els. The third term is a data ﬁdelity term, allowing us to incorp orate prior information into the mo del in a semi-sup ervised framework. The performance of the algorithm on syn- thetic data, as w ell as on the COIL and MNIST b enc hmark datasets, is comp etitiv e with state-of-the-art graph-based multiclass segmentation metho ds. Keyw ords: diﬀuse in terfaces, learning on graphs, semi-sup ervised meth- o ds 1 In tro duction Man y tasks in pattern recognition and machine learning rely on the abilit y to quan tify lo cal similarities in data, and to infer meaningful global structure from suc h lo cal characteristics [8]. In the classiﬁcation framework, the desired global structure is a descriptiv e partition of the data in to categories or classes. Man y studies hav e b een devoted to the binary classiﬁcation problems. The m ultiple- class case, where data are partitioned into more than t wo clusters, is more c hal- lenging. One approac h is to treat the problem as a series of binary classiﬁca- tion problems [1]. In this pap er, w e develop an alternative metho d, inv olving a m ultiple-class extension of the diﬀuse in terface mo del introduced in [4]. The diﬀuse interface mo del by Bertozzi and Flenner com bines metho ds for diﬀusion on graphs with eﬃcien t partial diﬀerential equation techniques to solve binary segmentation problems. As with other metho ds inspired b y physical phe- nomena [3,17,21], it requires the minimization of an energy expression, sp eciﬁ- cally the Ginzburg-Landau (GL) energy functional. The formulation generalizes the GL functional to the case of functions deﬁned on graphs, and its minimiza- tion is related to the minimization of weigh ted graph cuts [4]. In this sense, it 2 Garcia-Cardona et al. parallels other techniques based on inference on graphs via diﬀusion op erators or function estimation [8,7,31,26,28,5,25,15]. Multiclass segmen tation methods that cast the problem as a series of binary classiﬁcation problems use a num b er of diﬀerent strategies: (i) deal directly with some binary co ding or indicator for the lab els [9,28], (ii) build a hierarch y or com bination of classiﬁers based on the one-vs-all approach or on class rank- ings [14,13] or (iii) apply a recursive partitioning scheme consisting of succes- siv ely sub dividing clusters, until the desired num b er of classes is reached [25,15]. While there are adv an tages to these approaches, such as p ossible robustness to mislab eled data, there can b e a considerable num b er of classiﬁers to compute, and p erformance is aﬀected b y the num b er of classes to partition. In contrast, we prop ose an extension of the diﬀuse in terface mo del that ob- tains a sim ultaneous segmentation into m ultiple classes. The m ulticlass extension is built by mo difying the GL energy functional to remov e the prejudicial eﬀect that the order of the lab elings, given by integer v alues, has in the smo othing term of the original binary diﬀuse interface mo del. A new term that promotes homogenization in a multiclass setup is introduced. The expression penalizes data points that are lo cated close in the graph but are not assigned to the same class. This p enalt y is applied indep endently of ho w diﬀerent the in teger v alues are, represen ting the class lab els. In this w ay , the c haracteristics of the mul- ticlass classiﬁcation task are incorp orated directly in to the energy functional, with a measure of smoothness indep enden t of label order, allowing us to obtain high-qualit y results. Alternative multiclass metho ds minimize a Kullback-Leibler div ergence function [23] or expressions inv olving the discrete Laplace op erator on graphs [30,28]. This paper is organized as follows. Section 2 reviews the diﬀuse interface mo del for binary classiﬁcation, and describ es its application to semi-sup ervised learning. Section 3 discusses our prop osed multiclass extension and the corre- sp onding computational algorithm. Section 4 presents results obtained with our metho d. Finally , section 5 dra ws conclusions and delineates future work. 2 Data Segmen tation with the Ginzburg-Landau Mo del The diﬀuse interface model [4] is based on a con tinuous approac h, using the Ginzburg-Landau (GL) energy functional to measure the qualit y of data seg- men tation. A go od segmentation is c haracterized by a state with small energy . Let u ( x ) b e a scalar ﬁeld deﬁned ov er a space of arbitrary dimensionality , and represen ting the state of the system. The GL energy is written as the functional GL( u ) =  2 Z |∇ u | 2 d x + 1  Z Φ ( u ) d x , (1) with ∇ denoting the spatial gradient operator,  > 0 a real constant v alue, and Φ a double w ell p oten tial with minima at ± 1: Φ ( u ) = 1 4  u 2 − 1  2 . (2) Multiclass SSL on Graphs using Ginzburg-Landau F unctional Minimization 3 Segmen tation requires minimizing the GL functional. The norm of the gra- dien t is a smo othing term that p enalizes v ariations in the ﬁeld u . The potential term, on the other hand, comp els u to adopt the discrete lab els of +1 or − 1, clustering the state of the system around tw o classes. Jointly minimizing these t wo terms pushes the system domain tow ards homogeneous regions with v alues close to the minima of the double w ell potential, making the mo del appropriate for binary segmen tation. The smoothing term and p oten tial term are in conﬂict at the in terface b e- t ween the tw o regions, with the ﬁrst term fa voring a gradual transition, and the second term p enalizing deviations from the discrete lab els. A compromise b et ween these conﬂicting goals is established via the constan t  . A small v alue of  denotes a small length transition and a sharp er interface, while a large  w eights the gradient norm more, leading to a slow er transition. The result is a diﬀuse in te rface betw een regions, with sharpness regulated by  . It can b e sho wn that in the limit  → 0 this function approximates the total v ariation (TV) form ulation in the sense of functional ( Γ ) con vergence [18], pro- ducing piecewise constan t solutions but with greater computational eﬃciency than conv entional TV minimization methods. Th us, the diﬀuse interface mo del pro vides a framework to compute piecewise constan t functions with diﬀuse tran- sitions, approaching the ideal of the TV formulation, but with the adv antage that the smooth energy functional is more tractable numerically and can be minimized b y simple n umerical metho ds suc h as gradient descen t. The GL energy has been used to approximate the TV norm for image seg- men tation [4] and image inpainting [3,10]. F urthermore, a calculus on graphs equiv alen t to TV has b een introduced in [12,25]. Application of Diﬀuse In terface Mo dels to Graphs An undirected, w eighted neighborho o d graph is used to represent the local rela- tionships in the data set. This is a common tec hnique to segmen t classes that are not linearly separable. In the N -neigh b orho od graph mo del, eac h vertex v i ∈ V of the graph corresponds to a data p oin t with feature v ector x i , while the w eight w ij is a measure of similarit y b etw een v i and v j . Moreo ver, it satisﬁes the sym- metry prop ert y w ij = w j i . The neighborho o d is deﬁned as the set of N closest p oin ts in the feature space. Accordingly , edges exist b etw een eac h vertex and the v ertices of its N -nearest neigh b ors. F ollo wing the approach of [4], we calculate w eights using the lo cal scaling of Zelnik-Manor and P erona [29], w ij = exp  − || x i − x j || 2 τ ( x i ) τ ( x j )  . (3) Here, τ ( x i ) = || x i − x M i || deﬁnes a local v alue for eac h x i , where x M i is the p osition of the M th closest data p oin t to x i , and M is a global parameter. It is con v enient to express calculations on graphs via the graph Laplacian matrix, denoted b y L . The pro cedure w e use to build the graph Laplacian is as follo ws. 4 Garcia-Cardona et al. 1. Compute the similarity matrix W with comp onen ts w ij deﬁned in (3). As the neighborho od relationship is not symmetric, the resulting matrix W is also not symmetric. Make it a symmetric matrix b y connecting vertices v i and v j if v i is among the N -nearest neighbors of v j or if v j is among the N -nearest neighbors of v i [27]. 2. Deﬁne D as a diagonal matrix whose i th diagonal elemen t represents the degree of the v ertex v i , ev aluated as d i = X j w ij . (4) 3. Calculate the graph Laplacian: L = D − W . Generally , the graph Laplacian is normalized to guaran tee sp ectral con vergence in the limit of large sample size [27]. The symmetric normalized graph Laplacian L s is deﬁned as L s = D − 1 / 2 L D − 1 / 2 = I − D − 1 / 2 W D − 1 / 2 . (5) Data segmentation can now b e carried out through a graph-based formula- tion of the GL energy . T o implement this task, a ﬁdelity term is added to the functional as initially suggested in [11]. This enables the sp eciﬁcation of a priori information in the system, for example the kno wn labels of certain points in the data set. This kind of setup is called semi-sup ervised learning (SSL). The discrete GL energy for SSL on graphs can b e written as [4]: GL SSL ( u ) =  2 h u , L s u i + 1  X v i ∈ V Φ ( u ( v i )) + X v i ∈ V µ ( v i ) 2 ( u ( v i ) − ˆ u ( v i )) 2 (6) =  4 X v i ,v j ∈ V w ij u ( v i ) √ d i − u ( v j ) p d j ! 2 + 1  X v i ∈ V Φ ( u ( v i )) + X v i ∈ V µ ( v i ) 2 ( u ( v i ) − ˆ u ( v i )) 2 . (7) In the discrete formulation, u is a v ector whose comp onen t u ( v i ) represen ts the state of the vertex v i ,  > 0 is a real constan t characterizing the smo othness of the transition b etw een classes, and µ ( v i ) is a ﬁdelity w eight taking v alue µ > 0 if the lab el ˆ u ( v i ) (i.e. class) of the data point asso ciated with vertex v i is known b eforehand, or µ ( v i ) = 0 if it is not kno wn (semi-sup ervised). Minimizing the functional sim ulates a diﬀusion pro cess on the graph. The information of the few labels known is propagated through the discrete structure b y means of the smo othing term, while the p oten tial term clusters the vertices around the states ± 1 and the ﬁdelit y term enforces the kno wn labels. The energy minimization process itself attempts to reduce the in terface regions. Note that in the absence of the ﬁdelit y term, the pro cess could lead to a trivial steady-state solution of the diﬀusion equation, with all data p oin ts assigned the same lab el. The ﬁnal state u ( v i ) of each v ertex is obtained by thresholding, and the resulting homogeneous regions with lab els of +1 and − 1 constitute the t wo-class data segmen tation. Multiclass SSL on Graphs using Ginzburg-Landau F unctional Minimization 5 3 Multiclass Extension The double-w ell potential in the diﬀuse interface model for SSL drives the state of the system tow ards t wo deﬁnite lab els. Multiple-class segmen tation requires a more general p otenti al function Φ M ( u ) that allo ws clusters around more than t wo lab els. F or this purp ose, we use the p erio dic-w ell p oten tial suggested by Li and Kim [21], Φ M ( u ) = 1 2 { u } 2 ( { u } − 1) 2 , (8) where { u } denotes the fractional part of u , { u } = u − b u c , (9) and b u c is the largest in teger not greater than u . This p eriodic p oten tial well promotes a multiclass solution, but the graph Laplacian term in Equation (6) also requires mo diﬁcation for eﬀective calcula- tions due to the ﬁxed ordering of class lab els in the multiple class setting. The graph Laplacian term p enalizes large c hanges in the spatial distribution of the system state more than smaller gradual changes. In a multiclass framework, this implies that the p enalty for tw o spatially contiguous classes with diﬀeren t lab els ma y v ary according to the (arbitrary) ordering of the labels. This phenomenon is shown in Fig. 1. Supp ose that the goal is to segment the image into three classes: class 0 composed b y the blac k region, class 1 composed b y the gray region and class 2 comp osed by the white region. It is clear that the horizontal interfaces comprise a jump of size 1 (analogous to a tw o class segmen tation) while the v ertical in terface implies a jump of size 2. Accordingly , the smo othing term will assign a higher cost to the v ertical in terface, ev en though from the p oin t of view of the classiﬁcation, there is no speciﬁc reason for this. In this example, the problem cannot b e solved with a diﬀerent lab el assignment. There will alw ays b e an interface with higher costs than others indep enden t of the in tege r v alues used. Th us, the multiclass approach breaks the symmetry among classes, inﬂuenc- ing the diﬀuse interface evolution in an undesirable manner. Eliminating this in- con venience requires restoring the symmetry , so that the diﬀerence b et ween tw o classes is alwa ys the same, regardless of their lab els. This ob jectiv e is achiev ed b y introducing a new class diﬀerence measure. Fig. 1. Three-class segmen tation. Blac k: class 0. Gra y: class 1. White: class 2. 6 Garcia-Cardona et al. 3.1 Generalized Diﬀerence F unction The ﬁnal class labels are determined by thresholding each vertex u ( v i ), with the lab el y i set to the nearest in teger: y i =  u ( v i ) + 1 2  . (10) The boundaries b etw een classes then occur at half-integer v alues corresp ond- ing to the unstable equilibrium states of the p otential w ell. Deﬁne the function ˆ r ( x ) to represen t the distance to the nearest half-in teger: ˆ r ( x ) =     1 2 − { x }     . (11) A schematic of ˆ r ( x ) is depicted in Fig. 2. The ˆ r ( x ) function is used to deﬁne a generalized diﬀerence function b etw een classes that restores symmetry in the energy functional. Deﬁne the generalized diﬀerence function ρ as: ρ ( u ( v i ) , u ( v j )) =    ˆ r ( u ( v i )) + ˆ r ( u ( v j )) y i 6 = y j | ˆ r ( u ( v i )) − ˆ r ( u ( v j )) | y i = y j (12) Th us, if the vertices are in diﬀeren t classes, the diﬀerence ˆ r ( x ) betw een eac h state’s v alue and the nearest half-integer is added, whereas if they are in the same class, these diﬀerences are subtracted. The function ρ ( x, y ) corresponds to the tree distance (see Fig. 2). Strictly speaking, ρ is not a metric since it do es not satisfy ρ ( x, y ) = 0 ⇒ x = y . Nevertheless, the cost of interfaces betw een classes b ecomes the same regardless of class labeling when this generalized distance function is implemen ted. t     @ @ @ @ a a a Half-in teger In teger    @ @ ˆ r ( x ) Fig. 2. Schematic interpretation of generalized diﬀerence: ˆ r ( x ) measures distance to nearest half-integer, and ρ is a tree distance measure. The GL energy functional for SSL, using the new generalized diﬀerence func- tion ρ and the p eriodic p oten tial, is expressed as MGL SSL ( u ) =  2 X v i ∈ V X v j ∈ V w ij p d i d j [ ρ ( u ( v i ) , u ( v j )) ] 2 + Multiclass SSL on Graphs using Ginzburg-Landau F unctional Minimization 7 1 2  X v i ∈ V { u ( v i ) } 2 ( { u ( v i ) } − 1) 2 + X v i ∈ V µ ( v i ) 2 ( u ( v i ) − ˆ u ( v i )) 2 . (13) Note that the smo othing term in this functional is comp osed of an operator that is not just a generalization of the normalized symmetric Laplacian L s . The new smo othing op eration, written in terms of the generalized distance function ρ , constitutes a non-linear op erator that is a symmetrization of a diﬀerent nor- malized Laplacian, the random walk Laplacian L w = D − 1 L = I − D − 1 W [27]. The reason is as follo ws. The Laplacian L satisﬁes ( L u ) i = X j w ij ( u i − u j ) and L w satisﬁes ( L w u ) i = X j w ij d i ( u i − u j ) . No w replace w ij /d i in the latter express ion with the symmetric form w ij / p d i d j . This is equiv alent to constructing a rew eighted graph with w eights b w ij giv en b y: b w ij = w ij p d i d j . The corresp onding rew eighted Laplacian b L satisﬁes: ( b L u ) i = X j b w ij ( u i − u j ) = X j w ij p d i d j ( u i − u j ) , (14) and h u , b L u i = 1 2 X i,j w ij p d i d j ( u i − u j ) 2 . (15) While b L = b D − c W is not a standard normalized Laplacian, it does hav e the desirable prop erties of stability and consistency with increasing sample size of the data set, and of satisfying the conditions for Γ -con vergence to TV in the  → 0 limit [2]. It also generalizes to the tree distance more easily than does L s . Replacing the diﬀerence ( u i − u j ) 2 with the generalized diﬀerence [ ρ ( u i , u j )] 2 then gives the new smo othing multiclass term of equation (13). Empirically , this new term seems to perform well ev en though the normalization pro cedure diﬀers from the binary case. By implementing the generalized diﬀerence function on a tree, the cost of in terfaces b et ween classes becomes the same regardless of class lab eling. 8 Garcia-Cardona et al. 3.2 Computational Algorithm The GL energy functional giv en b y (13) ma y b e minimized iterativ ely , using gradien t desc en t: u n +1 i = u n i − dt  δ MGL SSL δ u i  , (16) where u i is a shorthand for u ( v i ), dt represents the time step and the gradient direction is giv en b y: δ MGL SSL δ u i =  ˆ R ( u n i ) + 1  Φ 0 M ( u n i ) + µ i ( u n i − ˆ u i ) (17) ˆ R ( u n i ) = X j w ij p d i d j  ˆ r ( u n i ) ± ˆ r ( u n j )  ˆ r 0 ( u n i ) (18) Φ 0 M ( u n i ) = 2 { u n i } 3 − 3 { u n i } 2 + { u n i } (19) The gradient of the generalized diﬀerence function ρ is not deﬁned at half in teger v alues. Hence, we mo dify the metho d using a greedy strategy: after de- tecting that a vertex changes class, the new class that minimizes the smo othing term is selected, and the fractional part of the state computed b y the gradient descen t up date is preserved. Consequen tly , the new state of v ertex i is the re- sult of gradien t descent, but if this causes a c hange in class, then a new state is determined. Sp eciﬁcally , let k represen t an in tege r in the range of the problem, i.e. k ∈ [0 , K − 1], where K is the n umber of classes in the problem. Giv en the fractional part { u } resulting from the gradien t descen t up date, ﬁnd the integer k that mini- mizes P j w ij √ d i d j [ ρ ( k + { u i } , u j ) ] 2 , the smoothing term in the energy functional, and use k + { u i } as the new vertex state. A summary of the pro cedure is shown Algorithm 1 Calculate u Require: , dt, N D , n max , K Ensure: out = u end for i = 1 → N D do u 0 i ← r and ((0 , K )) − 1 2 . If µ i > 0 , u 0 i ← ˆ u i end for for n = 1 → n max do for i = 1 → N D do u n +1 i ← u n i − dt   ˆ R ( u n i ) + 1  Φ 0 M ( u n i ) + µ i ( u n i − ˆ u i )  if Lab el( u n +1 i ) 6 = Lab el( u n i ) then ˆ k = arg min 0 ≤ k

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment