Scalable Semidefinite Relaxation for Maximum A Posterior Estimation

Scalable Semideﬁni te Relaxation f or Maximum A P osterior Estimation Qixing Huang H UA N G Q X @ S TA N F O R D . E D U Departmen t of Computer Science, Stanford Univ ersity , Stanfo rd, CA 94 305, USA Y uxin Chen Y X C H E N @ S TAN F O R D . E D U Departmen t of Electrical Enginee ring, Stanford Univ ersity , Stanfo rd, CA 9430 5, USA Leonidas Guibas G U I BA S @ C S . S TA N F O R D . E D U Departmen t of Computer Science, Stanford Univ ersity , Stanfo rd, CA 94 305 USA Abstract Maximum a posteriori (MAP) infere nce over dis- crete Markov r andom ﬁelds is a fun damental task spanning a wide spectru m of r eal-w orld applica- tions, which is known to b e NP-hard for gen- eral graphs. In this paper, we propo se a novel semideﬁnite relaxa tion formu lation (refer red to as SDR) to estimate th e MAP assignm ent. Al- gorithmic ally , we develop an acceler ated v ariant of th e alter nating d irection method of multipli- ers (r eferred to as SDP AD-LR) that c an effec- ti vely exploit the special structure o f the new relaxation. E ncouragingly , the p roposed proce- dure allows solving SDR for large-scale p rob- lems, e.g. , p roblems o n a g rid graph comprising hundr eds of tho usands of variables with mu lti- ple states p er node. C ompared with pr ior SDP solvers, SDP AD-L R is capable of attaining com- parable acc uracy while exhibitin g remarkab ly im- proved scalability , in contrast to the co mmonly held belief that semideﬁnite relax ation can only been applied on small-scale MRF problem s . W e have ev aluated the perfor mance of SDR on var- ious ben chmark datasets inc luding OPENGM2 and PIC in terms of both the quality o f the so- lutions an d co mputation time. Experime ntal re- sults demonstrate that for a broad class of prob- lems, SDP AD-LR outperforms state-o f-the-art al- gorithms in produ cing better MAP assignments in an efﬁcient man ner . 1. Intr oduction Computing the maximum a po st eriori (M AP) assignment in a graphical model is a c entral inferenc e task span- Pr oceedings of the 31 st International Confere nce on Machine Learning , Beijing, Chin a, 2014. JM LR: W&CP v olume 32. Cop y- right 2014 by the author(s). ning a wide scope of scenarios ( W ainwrig ht & Jordan , 2008 ), ranging from traditional ap plications in grap h matching, stereo recon s truction, object detection , error- correcting cod es, gene mappin g, etc., to a more recen t ap - plication in estimating co nsistent object orientatio ns from noisy p airwise measurem ents ( Crandall et al. , 2011 ). For general graph s, this pro blem is well-known to be NP- hard ( Shimony , 19 94 ). Howe ver , due in part to its impo r - tance in practice, a large body of algor ithms hav e been pro- posed to app roximate MAP estimates by solving various conv ex relaxation formulation s. Among those methods based on conve x surrogates, semideﬁnite relaxation usually s trictly dominates other for - mulations based on linear progr amming or qua dratic p ro- grammin g in terms o f solution q uality . Despite its supe- riority in o btaining more ac curate estimates, however , the most signiﬁcant challenge that limits the applicab ility of any semideﬁn ite relaxation parad igm o n real prob lems is efﬁciency . So far e xisting general-p urpose SDP s olvers can only handle problems with small dimensionality . In this paper, we prop ose a novel semideﬁnite relax ation approa ch (refe rred to as SDR) for second-o rder M AP infer- ence in pairwise un directed graphical mo dels. Our key ob- servation is that the mar ginalization constraints in a typical linear pro gramming r elaxation (c.f.( Kumar et al. , 20 09 ) ) can be sub sumed by combing a sem ideﬁnite conic c on- straint with a small set o f linear constra ints. As a result, SDR adm its a concise set of n icely decoupled con st raints, which allows us to d e velop an accelerated v ariant (referr ed as SDP AD-LR) of the alternating d irection method of mul- tipliers method (ADMM) tha t is scalab le to very large-scale problem s. On a stan dard PC, we h a ve succ ess fully ap plied SDR o n dense prob lems of d imensions of ( # s tates × # variables) up to ﬁve thousand , and on grid-struc tured pr oblems up to 10 5 variables each with doz ens of states per no de. Practically , SDP AD-LR perfor ms remarkab ly well on a v ariety of problems. W e have ev aluated Semideﬁnite Relaxation f or MAP Estimation SDP AD-LR on two collections of benchm ark datasets: OPENGM2 ( Kappes et al. , 2013a ) an d a probabilistic infer- ence ch allenge ( PIC , 20 11 ) . Each benchm ark consists of multiple categories o f problems der i ved from various MAP estimation tasks. Experim ental re s ults demonstrate tha t SDP AD-LR outper forms th e state-of-the-art alg orithms in computatio nal speed, while o ften obtainin g b etter MAP estimates. 1.1. Background There is a vast literature concer ning MAP estimation over discrete u ndirected grap hical m odels an d it is beyo nd the scope of this paper to discuss all existing algorithm s . Inter- ested readers are referred to ( W ainwrig ht & Jord an , 200 8 ) for an in -depth introdu ction to this topic. In the following, we focus on methods that in volve convex re laxation, which are the most relev a nt to our approach . Many pr ior conve x relaxation techniques ar e derived fro m the original g raph stru cture under lying the MAP estima - tion problem, among which linear pr ogramming relaxation (LPR) method s ( C hekuri et al. , 2004 ; W ainwright et al. , 2005 ) are the most po pular . In addition to LPR, researchers have considered alternati ve con vex relaxations, e.g ., quadra ti c relax ation (QP-RL) ( Ra vikumar et al. , 2010 ) and second-o rder cone r elaxation (SOCP-MS) ( Kumar et al. , 2009 ). In the seminal work of ( Kumar et al. , 2009 ), the au- thors evaluate various conve x relaxatio n appr oaches, and assert that LPR do minates QP-RL and SOCP-MS. How- ev er , as will be shown later , LPR is f urther dominated by a standard SDP relaxation ( W ainwrig ht & Jordan , 2 008 ), which is one of the main foci of this paper . A recent line of approache s have aimed at obtaining tighter conv ex r elaxations by increm entally ad ding higher-orde r interactions to e nforce pr oper marginalization over g roups of variables ( Sontag et al. , 2012 ; Komodakis & P ar agios , 2008 ; Batra et al. , 2 011 ). Despite the practical success of these appr oaches, it r emains an open pro blem to analyze their beh a vior — for example, to decide whether a polyno- mial number of clusters are sufﬁcient. There have been several attempts in ap plying semideﬁ- nite re laxation to obtain MAP assignme nt ( T o rr , 20 03 ; Olsson et al. , 200 7 ; W ang et al. , 2 013 ; Peng et al. , 20 12 ) . Howe ver, mo st of these methods are primarily designed for bina ry MAP estimation pro blems. In a recent work, ( Peng et al. , 2012 ) con si dered a gener al MAP e s timation problem , where each variable has multiple states. T he key difference between the p roposed form ulation an d that of ( Peng et al. , 2012 ) is th at we utilize the semideﬁnite co ne constraint to pru ne re dundant linear marginalizatio n con- straints. This leads to a concise set of loo sely decoupled constraints, which is impo rtant in developing ef fectiv e op- timization paradigms. 1.2. Notation Before p roceeding, we in troduce a few notations th at will be used thr oughout the pape r . For any linear o perator A , we let A ⋆ represent its conjugate op erator . Denote by R N × M + the set of N × M matr ices with nonn e g ati ve entries, and ( · ) + : R N × M → R N × M + the p rojection oper ator onto R N × M + . For any symm etric matrix M , we u se M  0 to represent th e p rojection of M onto the p ositi ve semideﬁ- nite co ne. Finally , we denote by k X k F the Fro benius norm of a matrix X . 2. MAP Estimation and SDP Relaxation W e start with state co nﬁgurations over n discrete random variables X = { x 1 , · · · , x n } . W itho ut loss o f g eneral- ity , assume th at each x i takes values in a discrete state set { 1 , · · · , m } . Consider a pairwise Markov rando m ﬁeld (MRF) G param eterized by the po tentials (or sufﬁcient statistics) w i ( x i ) for all vertices and w ij ( x i , x j ) for all edges ( i, j ) ∈ G . The energy (o r log-likelihoo d) associated with this MRF is giv en by f ( X ) = n X i =1 w i ( x i ) + X ( i,j ) ∈E w ij ( x i , x j ) . (1) The go al of MAP estimation is then to com pute the con- ﬁguration of states that m aximizes the ene r g y – the most probab le state assignment X M . 2.1. Semideﬁnite Programming Relaxation (SDR) MAP estimation over discrete sets is a n NP-hard co mbina- torial p roblem, and can be cast as an integer quadratic pro- gram (IQP). Denote by x i = ( x i, 1 , · · · , x i,m ) ⊤ ∈ { 0 , 1 } m a bina ry vector such that x i,j = 1 if a nd only if x i = j . Then MAP estimatio n is equiv ale nt to the following inte- ger progr am. (IQP): maximize x ∈{ 0 , 1 } nm n X i =1 h w i , x i i + X ( i,j ) ∈G  W ij , x i x ⊤ j  subject to 1 ⊤ x i = 1 , 1 ≤ i ≤ n, (2) where w i and W ij encode the correspon ding potentials. The har dness of th e above IQP arises in two aspects: ( i) x are b inary-valued, an d ( ii) the ob jecti ve fun ction is a quadra ti c fun ction of these b inary variables. T hese moti- vate us to relax the constraints in some appropr iate man ner , leading to o ur semideﬁn ite relax ation. In the sequel, we present the prop osed relaxation in a step-by-step fashion. 1) In the same spirit as existing conv ex formulation s (e.g., ( Kumar et al. , 2 009 ; Pen g et al. , 20 12 ) ), we intro - duce a binary blo ck matrix X := xx ⊤ ∈ { 0 , 1 } nm × nm Semideﬁnite Relaxation f or MAP Estimation to accommod ate quadratic objective terms: X =       Diag( x 1 ) X 12 · · · X 1 n X ⊤ 12 Diag( x 2 ) . . . . . . . . . · · · . . . . . . X ⊤ 1 n · · · · · · Diag( x n )       , which apparently exhibits the follo win g pro perties: X ii = x i x ⊤ i = Diag ( x i ) , 1 ≤ i ≤ n. (3) 2) The n on-con vex con straint X = xx ⊤ is then rela x ed and r eplaced by X  xx ⊤ , wh ich by Schu r comple- ment cond ition is equiv ale nt to the fo llo wing sem ideﬁ- nite conic constraint :  1 x ⊤ x X   0 . (4) 3) The binar y co nstraints x ∈ { 0 , 1 } nm and X ∈ { 0 , 1 } nm × nm are replaced by weaker linear constraints X ≥ 0 . Note that the constraints 0 ≤ x ≤ 1 and X ≤ 1 · 1 ⊤ are essentially subsumed by the constraints ( 2 ), ( 3 ), and ( 4 ) taken together . For the sake of numerical efﬁcienc y , we further relax the non-negative con s traint X ≥ 0 to be X ij ≥ 0 , ( i, j ) ∈ G . (5) As we will see later , this relax ation is cru cial in acceler- ating SDP solvers for large-scale pro blems. Remark 1. The non -ne gativity constraints described in ( 5 ) ar e necessary since otherwise SDR beco mes loose for sub- modular functio ns. Below is a n e xample in th e pr esence of 2 variables each having 2 states: w 1 =  2 0  , w 2 =  − 3 0  , W 12 =  0 2 2 0  . It is clear that W 12 satisﬁes the submodu lar p r operty . However , the optimizer of SDR after dr o pping the con - straint X ij ≥ 0 is given by x 1 = 1 3  1 2  , x 2 = 1 9  8 1  , X 12 = 1 9  4 − 1 4 2  , which does not obey the non-n e gativity con s traint on X . The feasib il ity constrain ts ( 2 ),( 3 ), ( 4 ) an d ( 5 ) taken col- lectiv e ly g i ve rise to the following semideﬁnite relaxation (SDR) formu lation for MAP estimation: (SDR): maximize x , X n X i =1 h w i , x i i + X ( i,j ) ∈G h W ij , X ij i subject to  1 x ⊤ x X   0 , (6) X ii = Diag ( x i ) , 1 ≤ i ≤ n, (7) 1 ⊤ x i = 1 , 1 ≤ i ≤ n, (8) X ij ≥ 0 , ( i, j ) ∈ G . (9) 2.2. Comparison with Prior Relaxat ion Heuristics 2 . 2 . 1 . Superiority over LP relaxations. Careful readers will remark that there might exist o ther con- vex constraints o n X and x that we can e nforce to tig hten the pro posed sem ideﬁnite relaxa tion. One altern ati ve is the fo llo wing marginalization constraints, which ha ve been widely in voked in LP relaxation for MAP estimation: X ij 1 = 1 , X ⊤ ij 1 = 1 , 1 ≤ i < j ≤ n. (10) Somewhat unexpe ctedly , these co nstraints turn ou t to be redund ant, as asserted in the following t heorem. Theorem 1. Any feasib le solutio n X to SDR (i.e. any X obeying the feasibility con str ain ts of S DR) necessarily sat- isﬁes X ij 1 = 1 , X ⊤ ij 1 = 1 , 1 ≤ i < j ≤ n. (11) Pr oof. See the su pplemental mater ial. Intuitively , this property arises f rom the following features of x and X ii : x ⊤ i · 1 = 1 , X ii 1 = x i , X ⊤ ii 1 = 1 , 1 ≤ i ≤ n. These intrinsic p roperties are then pr op agated to all o f f- diagona l blocks by the semideﬁnite constraint. 2 . 2 . 2 . In variance under variable r eparameterizat ion. Pioneered by the beautiful relaxation pr oposed for the MAX-CUT problem ( Goeman s & W illiamson , 199 5 ), many SDP ap proaches developed for combin atorial p rob- lems em ploy the integer in dicator y = 1 2 ( 1 + x ) to pa- rameterize binary v a riables (e.g. , ( T o rr , 2003 ; K umar et al. , 2009 )). If one applies matrix lifting Y = y y ⊤ and follows a similar relaxation proced ure, the resu lt ing sem ideﬁnite re- laxation (referred to as SDR2) can be derived as follows Semideﬁnite Relaxation f or MAP Estimation maximize y , Y n P i =1 h w i , y i i + 1 2 P ( i,j ) ∈G h W ij , Y ij i subject to  1 y ⊤ y Y   0 , 1 ⊤ y i = 2 − m, 1 ≤ i ≤ n, Y ij + 1 · y ⊤ j + y i · 1 ⊤ + 1 · 1 ⊤ ≥ 0 , ( i, j ) ∈ G , 1 · 1 ⊤ + y i · 1 ⊤ + 1 · y ⊤ i + Y ii 2 = Diag ( 1 + y i ) , 1 ≤ i ≤ n, (12) where w i are deﬁned as w i = w i + 1 2   X j :( i,j ) ∈ G W ij 1 + X j :( j,i ) ∈G W ⊤ j i 1   . In fact, SDR2 is i dentical to SDR, as formally stated below . Theorem 2. ( x ⋆ , X ⋆ ) is the solution to SDR if and only if y ⋆ := 2 x ⋆ − 1 , Y ⋆ := 4 X ⋆ − 2  x ⋆ · 1 ⊤ + 1 · x ⋆ ⊤  + 1 · 1 ⊤ is the solution to SDR2. Pr oof. See the su pplemental mater ial. Despite the theoretical eq ui valence between SDR2 a nd SDR, from a numerical perspecti ve, s olving SDR2 is much harder than so lving SDR. The difﬁculty arises fr om the complicated fo rm o f the linear constra ints e nforced by SDR2 (i.e. , ( 12 )). Note that the advantage of SDR2 is that all diagon al entries of Y are equal to 1 as follows diag( Y ii ) = 2( 1 + y i ) − 1 − y i − y i = 1 , ( 1 ≤ i ≤ n ) . Nev ertheless, no ne o f pr ior SDP algorithms takes full ad- vantage of this pro perty in accele rating the algo rithm. 3. Scalable Optimization Algorithm The cu rse of dimensionality poses in e vitable n umer - ical challenges when applying g eneral-purpo s e SDP solvers to solve SDR. Despite th eir super ior accu - racy , p rimal-dual interior poin t m ethods (IPM) like SDPT ( T o h et al. , 1999 ) are limited to small-scale prob- lems (e. g. nm < 150 o n a regular PC). More scal- able solvers such a s C SDP ( Helmberg & Rend l , 2000 ) and DSDP ( Benson & Y e , 200 8 ) propose to solve the d ual p rob- lem. Howe ver , since the non-n e g ati vity constraints X ij ≥ 0 p roduce nu merous dual variables, these solvers a re still far too restrictive fo r o ur progr am — none of them can solve SDR on a standard PC when nm exceeds 1000. The limited scalab ility of inter ior p oint m ethods has in- spired a ﬂur ry of acti vity in de velop ing ﬁrst-or der methods, among which th e altern ating dir ection me thod o f m ultipli- ers (ADMM) ( W en et al. , 2 010 ; Boyd et al. , 201 1 ) pr o ves well suited f or large-scale pr oblems. In this section, we propo se an efﬁcient variant of ADMM – refer red to as SDP AD-LR ( SDP Alternating Direc tion m ethod f or L o w Rank structure), wh ich is tailored to th e special structure of SDR (in cluding lo w rank and sparsity) and ena bles us to solve problems with very large dimensionality . 3.1. Alternating Dir ection A ugmented Lagrangia n Method (ADMM) For con venien ce o f pr esentation, we den ote X :=  1 x ⊤ x X  , and rewrite SDR in the op erator form: minimize  C , X  dual variables subject to A  X  = b , y P  X  ≥ 0 , z ≥ 0 X  0 , S  0 (13) where C encode s all w i and W ij , A ( X ) = b collects the equa lit y constraints, and P ( X ) gath ers element-wise non-n e g ati ve co nstraints. W e let variables y , z , a nd S re p- resent the corr esponding dual variables fo r r especti ve c on- straints. I n the seque l, we will start by reviewing SDP AD, i.e., the origin al altern ating dir ection method introduced in ( W en et al. , 2010 ), and the n pr esent the key modiﬁcation underly ing the propo sed efﬁcient variant SDP AD-LR. 3 . 1 . 1 . SDP AD: Procedures and Conv ergence SDP AD consider s the following augmented Lagrangian: L ( y , z , S , X ) = h b , y i +  P ⋆ ( z ) + S − C − A ⋆ ( y ) , X  + (2 µ ) − 1 kP ⋆ ( z ) + S − C − A ⋆ ( y ) k 2 F , where the p enalty parame ter µ c ontrols th e strength of the quadra ti c term. As sug gested by ( Boyd et al. , 2011 ), we initialize µ with a small value, and gradu ally in crease it throug hout the optimization process. Let superscript ( k ) indicate the variable in the k th iteration. Each iteration of the SDP AD consists of a du al optimiza- tion s tep, follo wed by a primal update step given as follows X ( k ) = X ( k − 1) + P ⋆ ( z ( k ) ) + S ( k ) − C − A ⋆ ( y ( k ) ) µ . (14) Instead of jointly optim izing a ll du al variables, the key idea o f SDP AD is to de couple the du al optimization step into s ev e ral sub -problems or , more speciﬁcally , to optimize Semideﬁnite Relaxation f or MAP Estimation y , z , S in order with other variables ﬁxed. This leads to closed-for m solutions for each sub-pro blem as follows y ( k ) = ( AA ∗ ) − 1  A  S ( k − 1) − C + µ X ( k − 1)  − µ b  , z ( k ) = P  C − S ( k − 1) − µ X ( k − 1)  + , S ( k ) =  C + A ⋆ ( y ( k ) ) − P ⋆ ( z ( k ) ) − µ X ( k − 1)   0 . Similar to that co nsidered in ( W en et al. , 201 0 ), our stop- ping criterion inv o lv es measurin g o f both primal feasibility kA ( X ( k ) ) − b k an d du al fea s ibility µ ( X ( k ) − X ( k − 1) ) . Con vergence property . In gen eral, conv ergence pro per - ties of SDP AD are known wh en on ly equality con s traints are present ( W en et al. , 2010 ). However , the inequality con- straints of SDR are special in the following two aspects: (i) They are element-wise non-negativity co nstraints; (ii) They are essentially decoup led from other linear con - straints. Property (ii) arises as all equality constrain ts ar e concerned with d iagonal blocks of X , while all lin ear inequality con - straints are only enforced o n its o f f-d iagonal blocks. Such special stru cture leads to th eoretical con vergence guaran - tees for SDP AD, as stated in the following theorem. Theorem 3. The SDP AD method pr esented above co n- ver ges to the optimizer of SDR. Pr oof. See the su pplemental mater ial. 3 . 1 . 2 . SDP AD-LR: Accelerated Method Apparen tly , the most computationally expen s i ve step of SDP AD is the u pdate of S , which inv olves the eigen- decomp osition of an nm × nm matrix. This limits the ap- plicability of SDP AD to large- scale problems (e.g. nm > 10 4 ). T o b ypass this numerica l bottleneck, we m odify SD- P AD and pr esent an efﬁcient h euristic called SDP AD-LR, which exploits the lo w-r ank structu re of X . First, we observe that S can be alternatively expressed as S ( k ) = C + A ⋆ ( y ( k ) ) − P ⋆ ( z ( k ) ) − µ  X ( k ) − X ( k − 1)  . This allows us to present SDP AD without in voking S . The detailed steps o f SDP AD can now be summ arized as in Al- gorithm 1 . It is straig htforward to see that the bottleneck of Algo - rithm 1 lies in how to compute an d store the p rimary vari- able X . T o derive a n efﬁcient solver , we make the a s - sumption th at the optima l solution X ⋆ is low-rank. This is motivated by the empir ical evidence that fo r a v ariety of problem s ( see the experimental section for details), SDR is Algorithm 1 SDP AD for solving SDR input : k max = 1000 , ǫ = 10 − 4 , µ min = 10 − 3 , ρ = 1 . 005 . initialize : X (0) = X ( − 1) = 0 , y (0) = 0 , z (0) = 0 repeat X ( k ) temp = 2 X ( k − 1) − X ( k − 2) t ( k ) temp = ( AA ⋆ ) − 1 ( A ( X ( k ) temp ) − b ) y ( k ) = y ( k − 1) + µ t ( k ) temp z ( k ) =  z ( k − 1) − µ P ( X ( k ) temp )  + X ( k ) =  X ( k − 1) − C + A ⋆ ( y ( k ) ) − P ⋆ ( z ( k ) ) µ   0 (15) k ← k + 1; µ = µρ until min( µ k X ( k ) − X ( k − 1) k F , kA ( X ( k ) ) − b k ) ≤ ǫ or k > k max exact, mean ing rank ( X ⋆ ) = 1 . Moreover, in th e gen eral case, the rank o f X ⋆ is expected to b e much small tha n its dimension (e.g. ( Burer & Monteiro , 2003 )), i.e., rank  X ⋆   rank ( X ⋆ ) + 1  ≤ 2 M , where M is the number of constrain ts . 1 of SDPR. Based on this assump tion, the key idea of SDP AD-LR is to in voke a low-rank matrix Y ∈ R ( nm +1) × r for some small r and en code X = Y Y ⊤ throug hout the iterative process. This allows us to keep all the v ariab les in memory e ven for large-scale problems. In this case, ( 15 ) is mo diﬁed as Y ( k ) = U ( k ) Σ 1 2 + , wher e Σ = Diag ( σ 1 , · · · , σ r ) and U = ( u 1 , · · · , u r ) rep resent the top r eig en values and respective eigenvectors of V ( k ) = Y ( k − 1) Y ( k − 1) ⊤ − C + A ⋆ ( y ( k ) ) − P ⋆ ( z ( k ) ) µ . (16) Although V ( k ) is a dense matrix, its top eigenvectors can be efﬁciently compu ted using the La nczos pr ocess ( Cullum & W illoug hby , 2002 ), whose efﬁciency is dictated by the com ple xity of th e matrix multiplication o perator V ( k ) : u ∈ R nm +1 → V ( k ) u ∈ R nm +1 . As SDR on ly in volves the constrains X ij ≥ 0 , ( i , j ) ∈ E , the matrix C + A ⋆ ( y ( k ) ) − P ⋆ ( z ( k ) ) turns our to share the same spar- sity pattern with G . Thus, the co mplexity o f com puting V ( k ) u is at most O ( nmr 2 + m 2 |E | ) . Theoretically , it is extremely challen ging to derive an up- per bou nd on r to ensure the exactness of th e mo diﬁed al- gorithm. T o ad dress t his issue, we thu s design SDP AD-LR so that it iteratively doubles the v alu e of r a nd reapplies the modiﬁed algor ithm until it return s the optimal solution . For most of our experiments, we found that r = 8 is sufﬁcient. 1 Practically , many ne gativity constraints are redundant. Semideﬁnite Relaxation f or MAP Estimation Algorithm 2 SDP AD-LR for solvin g SDR input : k max = 5000 , ǫ = 10 − 4 , µ min = 10 − 3 , ρ = 1 . 005 , δ = 1 e − 2 , r max = 3 2 , r = 4 . initialize : X (0) = X ( − 1) = 0 , y (0) = 0 , z (0) = 0 repeat X ( k ) temp = 2 X ( k − 1) − X ( k − 2) t ( k ) temp = ( AA ⋆ ) − 1 ( A ( X ( k ) temp ) − b ) y ( k ) = y ( k − 1) + µ t ( k ) temp z ( k ) =  z ( k − 1) − µ P ( X ( k ) temp )  + Compute X ( k ) accordin g to ( 16 ) k ← k + 1; µ = ρµ if mod ( k , 1000) = 0 , λ min ( X ( k ) ) > δ λ max ( X ( k ) ) then r = min( r max , 2 r ); µ = µ min end if until k > k max or λ min ( X ( k ) ) ≤ δ λ max ( X ( k ) ) an d min( µ k X ( k ) − X ( k − 1) k F , kA ( X ( k ) ) − b k ) ≤ ǫ The p seudo-code of SDP AD-L R is summarized in Alg o- rithm 2 . 3.2. Iterative Rounding Similar to other ADMM metho ds ( Boyd et al. , 2 011 ), SDP AD-LR co n verges r apidly to moderate accuracy within the ﬁrst 40 0 iterations, and sig niﬁcantly slows down af- terwards. Th us, rather than co ntinuing un til SDP AD-LR conv erges, it would b e more efﬁcient to shrin k the pro b- lem size b y ﬁxing those variables whose optimal states are likely to h a ve been revealed. Speciﬁcally , after each ro und of SDP AD-L R , we ﬁx the optim al state j of a variable x i if x i,j > t max ( t max = 0 . 9 9 for all the examples) or x i,j = max 1 ≤ i ≤ n, 1 ≤ j ≤ m x i,j . W e then reap ply the itera- ti ve procedu res on the red uced prob lem. In pra ctice, we ﬁnd that due to th e tightne s s of SDR, the size of the re- duced problems are signiﬁcantly smaller th an the o riginal problem , and on e iterative roundin g procedu re is usually sufﬁcient. 4. Experimental Results In this section, we ev a luate SDP AD-LR on sev eral ben ch- mark data sets and compare its perform ance ag ainst exist- ing SDP solvers and state-o f-the-art MAP inferen ce algo- rithms. 4.1. Benchmark Datasets W e perform experimental evaluation on MAP estimation problem s from three popular be nchmark data sets (See T a- ble 2 ), i.e., OPENGM2 ( Kapp es et al. , 20 13a ), PIC ( PIC , 2011 ), an d a new data set ORIE NT for the task o f estimat- ing consistent cam era orientation s ( Crandall et al. , 2011 ). OPENGM2 comprises 19 categories of mostly sparse MAP categories G n m probs t PIC-Object full 60 11-21 37 5m32s PIC-Folding mixed 2K 2-503 21 21m42s PIC-Align dense 30-400 20-93 19 37m63s GM-Label sparse 1K 7 324 6m32s GM-Char sparse 5K-18K 2 100 1h13m GM-Montage grid 100K 5 ,7 3 9h32m GM-Matching dense 19 19 4 2m21s ORIENT sparse 1K 16 10 10m21s T able 2. Statistics of the d atasets e va luated in this paper . G : graph structure of the MAP problem in each catego ry; n : number of v ariables; m : number of states; probs: number of instances; t : av erage running time of SDP AD-LR. problem s. W e ch oose fou r repre s entative categories for ev aluatio n: Geometric Surface Lab eling (GM-Label), Chi- nese Char acters (GM-Char), MRF Photom ontage (GM- Montage) an d Matching ( GM-Matching). The ﬁr s t thre e categories GM-Labe l, GM-Character and GM- Montage are sparse MAP estimation pr oblems with increasing scales. GM-Matching is a special category wher e our conv ex re- laxation is no t tight. PIC compr is es 10 categories o f MAP inference pr oblems of various structu re. As we already in - clude sparse MAP in ference pr oblems f rom OPENG M2, we pick 3 r epresentati ve dense c ate g ories f rom PIC: Ob- ject Detection (PIC-Object), Im age Alignmen t (PIC-Align) and Folding (PIC-Folding). 4.2. SDP Solver Evaluation Baseline algo rithms . W e ev alu ate the prop osed SDP AD- LR against the following existing large-scale SDP s olvers. • SDP AD — the original ADMM method presented in ( W en et al. , 201 0 ). • SDPN A L — the Newton-CG (conjug ate grad ient) augmen ted method propo s ed in ( Zhao et al. , 2010 ). • IPM-NC — the n oncon vex interio r point m ethod which attempts to solve a direct relaxa ti on of th e MAP inference prob lem ( Burer & Monteiro , 2003 ): minimize h C , xx ⊤ i subject to 1 ⊤ x i = 1 , x i ≥ 0 , 1 ≤ i ≤ n This meth od serves as an alternative low-rank heu ris- tic for the propo sed SDP AD-LR. W ith lo si ng gen eral- ity , we set the initial v alues of x i = 1 m 1 , 1 ≤ i ≤ n . • MOSEK — the cutting-edg e interior point m ethod. T o apply it o n large-scale SDRs, we add th e no nnegati v- ity con s traints in an in cremental fashion, i.e. , at each iteration, we detect the 100 smallest negative en tries and add them to the constraint set. • MUL-Up date — an approxima te on -line SDP solver that is based on multi variate weight up - dates ( Arora et al. , 2012 ). Semideﬁnite Relaxation f or MAP Estimation T able 1. Comparison of SDP Solvers on Representati ve Problems. N : dimension of the matrix. M : number of constraints. Method deer 0034.K10.F100 (dense) ﬁle 30mark ers (sparse) folding 2BE6 (dense) gm275 (sparse) N = 661 , M = 218791 N = 862 , M = 218791 N = 3836 , M = 218791 N = 5201 , M = 218791 cpu gap inf cpu gap inf cpu gap inf cpu gap inf SDP AD-LR 4:33 7.2e-4 1.3e-6 7:33 2.2e-4 5. 3e-6 2:44:36 2.3e-4 5.3e-7 21:33 5.1e-4 1.3e-6 SDP AD 8:29 8.2e-5 4.3e-7 10:33 9.4e-5 1.3e-7 25:56:37 2 .3e-4 3.7e-6 41 :33:21 1.2e-4 3.1e-6 SDPNAL 10:55 8.1e-5 1.3e-6 9:42 6.2e-5 2. 1e-6 18:33:11 5. 2e- 5 4.7e-7 21:34:35 9. 7e-5 4.5e-7 IPM-NC 1:27 2.3e3 na 2:37 4.1e-7 na 10:23 4.5e2 na 21:56 3.5e-6 na MOSEK 21:33:10 2.3e-6 1.3e-9 na na na MUL-Update 6:13:56 8. 1e-3 2.7e-5 na na na Problem sets. For ev aluation, we consider fou r cate g ories, on which m ost baseline alg orithms are ap plicable: PIC- OBJ, PI C -Align, PIC-Folding and GM2- Label. For sim- plicity , we pick a r epresentati ve prob lem from each cate- gory . T he d imensions o f the s e problem sets rang e fr om 600 to 5000 , and they co ntain bo th den se and sparse prob lems (See T able 1 ). Evaluation protoc ol. Follo win g the stand ard p rotocol for assessing con vex pro grams, we e valuate the duality gap and the primal/dual infeasibility of each algorithm : gap = |h b , y i − h C , X i| 1 + |h b , y i| + |h C , X i| , inf = max n kA ( X ) − b k 2 + k min( P ( X ) , 0) k 2 1 + k b k 2 , k C + A ∗ ( y ) − P ∗ ( z ) − S k F 1 + k C k F o As IPM-NC solves a dif fer ent o ptimization problem, we re- port the gap between its optimal solutions with the gr ound- truth optimal solutions. Analysis of results. W e ru n each algorithm u ntil the dual- ity gap is below 1 e − 4 or the maximu m nu mber o f itera- tions is reach ed. T ab le 1 shows the runn ing time, du ality gap and m aximum primal/dual infe asi bility of each algo- rithm o n e ach problem. W e can see th at SDP AD-LR gen- erates results that are compar able to SDP AD and SDPN AL. Howe ver, SDP AD-LR turn s out to be r emarkably mor e ef- ﬁcient th an SDP AD an d SDPN AL o n large-scale or sparse datasets. This is due to the fact that SDP AD-LR only re- quires computin g the to p eigenv a lues, which is b oth mem- ory and computatio nally ef ﬁcient. Both interio r po int metho ds (i.e. , IPM-NC and MOSEK) have provable guar antees to generate more accurate results than other methods. Howev e r , MOSEK is not scalable to large data sets, as r eported in T able 1 . IPM-NC is scalable to large-scale pr oblems, as the number variables inv olved is small. Howe ver, a s IPM- NC solves a non-co n vex op ti- mization prob lem, it may easily get trapp ed into local min- imals (e.g., on d eer 0034. K10.F100 30 markers and f old- ing 2BE6). Finally , the multiv ar iate weight update meth od M UL- Update turns ou t be inef ﬁcient on s olving SDRs of MAP in- ference problems. This is due to the f act that MUL-Update is an app roximate solver and it requires a lot of iter ations to obtain an accurate solution. 4.3. MAP Inference Ev aluation Experimental setup. W e compare SDR with the top-perform ing algorithms from OPENGM2 ( Kappes et al. , 2 013a ). These a lgo- rithms include (i) BRA OBB ( Otten & Dechter , 2012 ), which is based o n co mbinatorial search , (ii) α - expansion ( Szeliski et al. , 2008 )–a move makin g method, (iii) MCBC ( Kappes et al. , 2013b ), which is based o n a highly op timized max-cu t so lv er , (iv) TR WS-LF2 ( K o lmogorov , 2006 )– Tree-re weighted message p ass ing, (vi) og m-TRBP— Tree-re weighted belief p ropagation ( Szeliski et al. , 2 008 ) an d ( vii) ﬁ- colofo ( Cooper et al. , 2010 )– the to p perf orming metho d on dense prob lems of PIC. W e use two measures to assess the perf ormance of each method. Th e ﬁrst measur e evaluates for each method the mean objective v alues f of the resulting MAP assignm ents on eac h ca te g ory . For the c onsistency with ( Kappes et al. , 2013a ), we rep ort − f , m eaning that the smaller the value, the be tter the algorithm . Th e second measure rep orts the percentag e that each method achie ves the best so lution among all existing metho ds (not nece s sarily the glob al op- timal). Th e hig her the per centage, the better the algorith m. Perf orma nce. T able 3 summarize s the perf ormance of SDP AD-LR v .s. state-of-the-a rt MAP infer ence algorithm s on eac h ty pe o f problems. In eac h b lock, the top elem ent (which is tilted) d escribes − f o f each m ethod on each cat- egory , an d the bo ttom b lock describes the p ercentage o f obtaining the best solutio n. W e can see that the overall perfor mance of SDP AD-LR is superior to each other indi- vidual algorith m. Exc ept o n GM- Matching, SDP AD-LR is the to p performing on ea ch other dataset. In co ntrast, each existing method either d oes no t apply or genera tes poo r re - sults o n one or several datasets. This shows the advantage of s olving a strong conve x relax ation of the MAP inference problem . Below we b reak d o wn the p erformance on each benchm ark. • ORIENT . SDP AD-LR is the leading meth od on ORI- Semideﬁnite Relaxation f or MAP Estimation T able 3. Results on benchmark datasets. SDP AD-LR Ficolofo BRA OBB α -ex pand TR WS-L F2 ogm-TRBP MCBC A-star ORIENT -7834.6 na -3059.2 -7695.4 -7592 .4 -7553.8 na na 100% 0% 0% 0% 0% PIC-Object -19316.12 -193 08.94 -19113.87 -10106.8 -1902 0.82 -18900.81 na na 97.3% 91.9% 24.3% 0% 59.5% 32.2% PIC-Folding -5963.68 -5963.68 -5927.01 -5652.76 -59 05.01 -5907.24 na na 100% 100% 4 2.9% 14.2% 38.1% 42.9% PIC-Align 2285.23 2285.34 2285 .34 2 285.34 2 286.64 2289.12 na na 100% 90% 90% 90% 80% 70% GM-Label -476.95 na na -476.95 -476.95 486.42 na na 100% 100% 99.67% 40% GM-Char -59550.67 na na na -49519.44 -49507.98 -49550.10 na 86.1% 11% 6% 89.1% GM-Montage 168298 .00 na na 168220 .00 735193.0 235611.00 na na 66.3% 33.3% 0% 0% GM-Matching 44.19 na 21.22 na 32.38 5.5e10 na 21.22 0% 100% 0% 0% 100% ENT . Th e pro blems in ORIENT exhibit speciﬁc s truc- tures, i.e, the pair-wise p otentials co nsist o f approx i- mately shifted permu tation matrices. Ex perimentally , we foun d t hat SDR is usua lly tigh t on these prob lems. This explains the superior per formance SDP AD-LR. In con trast, linear pro gramming relaxatio ns are no t tight on OR IENT , and thus TRBP and TR WS only de- li ver moderate perform ance. Moreover, this structural pattern leads to hug e search spaces for combin atorial algorithm s (e.g., BRA OBB), and they can easily g et stuck in local optimu ms. • Dense problems. SDP AD-LR also outperfo rms other methods o n thr ee dense categories from PIC. It achieves the b est me an energy value as we ll as the highest per centage of ob taining the best solu tion. Th is again arises since SDR is tight on these problems. • Sparse problems. SDR yields co mparable results with state-of-th e-art algorithm s on the three sparse cat- egories fro m OPENGM2. GM-Lab el consists of prob - lems where the stan dard LP relaxation is tig ht. On GM-Char wh ich consists of large-scale b inary prob - lems, SDR is com parable to MCBC in th e sense that SDR ac hie ves a better me an en er gy value while MCBC a ttains a hig her percentage of being the best solution. This arises because MCBC is a h ighly o pti- mized solver designe d for b inary q uadratic pro blems. On the other ha nd, SDP AD- LR is only an approx imate SDP solver which, in som e cases, may n ot co n verge to the global optimu m due to numerical issues. • GM-Matching. SDR o nly yields moderate results on GM-Matching . This occurs because SDR is n ot tig ht on GM-Matching . In con trast, as GM-M atching is a small-scale problem, combinatorial op timization tech- niques such as BRA OBB and A-star are capable of ﬁnding globally optimal solutions. Running Times. Th e run ning time of SDP AD- LR (includ- ing the round ing procedu re) is of the same scale as o ther conv ex relation tech niques. As shown in T able 2 , ou r p re- liminary Matlab imple mentation takes less th an 10 mins on sma ll -scale pr oblems (i.e. those in PIC-Object, GM- Matching a nd PIC-Lab el). On medium size pr oblems, i.e., those in PIC-Folding, PIC-Alig n, GM-Char and ORIENT , the r unning time o f SDP AD- LR ranges from 2 0 min utes to 1 hou r . On large-scale pro blems from GM-Montag e, SDP AD-LR takes around 8 hour s on ea ch p roblem. How- ev er , there is still h uge ro om for im provement. One alter- native is to use the eigenv alu es computed in th e p re vio us iteration to accelerate the eigen -decomposition at the cur- rent iteration, which is left for future work. 5. Conclusions In this pap er , we have p resented a novel semid eﬁnite re- laxation fo r seco nd-order M AP estimation and prop osed an efﬁcient ADMM solver . W e have e x tensi vely compared the p roposed SDP solver with various state-of-th e-art SDP solvers. Ex perimental results con ﬁrm that our SDP solver is much more scalable than pr ior approaches when applied to various MAP e s timation p roblem, which enables us to apply SDR on large-scale datasets. Owing to the p o wer of semideﬁnite relaxation , SDR proves superior to other top-per forming MAP inference alg orithms o n a variety of benchm ark datasets. There ar e plenty of oppor tunities for f uture research. First, we would like to extend SDR to h igher -order MAP pro b- lems. Moreover, it would be interesting to in te g rate SDR and co mbinatorial o ptimization technique s , wh ich has th e potential to boo s t the power of both . From the theoreti- cal side, theo retical s upport for exact estimatio n wit h SDR would b e one exciting d irection for in vestigation. Th is would offer justiﬁcation of the p resented low-rank h euris- tic. On the other hand, as many combinator ial optimization problem s can be for mulated as MAP infer ence problem s , Semideﬁnite Relaxation f or MAP Estimation such exact estimation condition s can s hed light on the orig - inal combinato rial optimization problem s . Acknowledgmen ts This work has been su pported in part by NSF gr ants FO- D A V A 808 515 and CCF 101 1228, AFOSR gr ant F A9550- 12-1- 0372, ONR MURI N00 014-13-1- 0341, an d a Goo gle research aw ard. Refer ences Probabilistic in ference chanlleng e, 2011. http://www .cs.huji.ac. il/project/P ASCAL/index.php. Arora, San jee v , Hazan , Elad, and Kale, Satyen. The m ulti- plicative weigh ts up date m ethod: a meta-alg orithm and applications. Th eory of Computing , 8(1) :121–164, 2012. Batra, D., Nowozin, S. , and K o hli, P . T ighter relaxations for MAP-MRF inferen ce: A local primal-du al gap based separation algorithm . AIST ATS’11 , 15:146 –154, 2011. Benson, S. and Y e, Y . DSDP5: sof tw ar e for sem ideﬁnite progr amming. AC M T rans. Math. S oftw . , 34( 3):16:1– 16:20, May 2008. Boyd, S., Parikh, N. , Chu , E. , Peleato , B., an d Eckstein, J. Distributed optimization and statistical learnin g via the alternating direction method of multipliers. F ound ations and T r ends in Machine Learning , 3(1):1– 122, 2011. Burer, Samuel and Monteiro, Renato D. C. A non linear pr o- grammin g algorith m for solvin g semid eﬁnite pr ograms via low-rank factor ization. Math. P r o gram. , 95( 2):329– 357, 2003. Chekuri, C., Khann a, S., Naor, J., and Zosin, L. A linear progr amming formu lation and approxim ation alg orithms for the metric labeling pr oblem. SIAM J . Discr ete Math. , 18(3) :608–625, 2004. Cooper, M. C., de Gi vry , S., Sanche z, M., Schiex, T ., Zyt- nicki, M., an d W ern er , T . Sof t arc con sis tency revisited. Artif. Intell. , 174(7-8) :449–478, 2 010. Crandall, D., Owen s , A., Snavely , N., and Hu ttenlocher , D. SfM with M R Fs: discrete-con tinuous optim ization f or large-scale stru cture fr om motio n. CVPR’1 1 , pp . 300 1– 3008, 2011 . Cullum, J. K. and W illo ughby , R. A. La nczos A lgorithms for Large Symmetric Eigenvalue Co mputations . Num- ber 41. SIAM, 2002. Goemans, M. and W illiamson, D. Imp rov ed appr oximation algorithm s for max imum cut and satisﬁability pr oblems using semideﬁnite program ming. JA CM , 199 5. Helmberg, C. and Rend l, F . A sp ectral bundle method for semideﬁnite pro gramming. S IAM Journal on Optimiza- tion , 10(3 ):673–696, 2000. Kappes, J. H., An dres, B., Hamprec ht, F . A. , Schn orr , C., Now ozin, S., Batra, D., Kim, S., Kau sler , B. X., Lell- mann, J., K o modakis, N., and Rother , C. A comp arati ve study of mod ern inf erence techniq ues for discrete en ergy minimization prob lems. In CVPR’ 13 , Jun e 20 13a. Kappes, J. H., Speth, M ., Rein elt, G., an d Schn ¨ orr, C. T o- wards e f ﬁcient and exact MAP-inferen ce for large scale discrete compute r vision problems via combinatorial op- timization. In CVPR , 2013b . K o lmogorov , V . Conv ergent tree-reweighted message pass- ing fo r ene r g y minimization. IEEE P AMI. , 28:1 568– 1583, October 2006 . K o modakis, N. and Paragios, N. Be yond loose L P- relaxations: Op timizing MRFs b y repairing cycles. I n ECCV (3) , pp. 806– 820, 2008. Kumar , M., K olmo gorov , V ., and T orr , P . An analysis of convex relaxa ti ons for MAP estimation of discrete MRFs. JMLR , 10:71 –106, 2009. Olsson, C., Eriksson , A., an d Kah l, F . Solving large scale binar y quadr atic problems: Spectr al method s vs. semideﬁnite prog ramming. In CVPR’ 07 , 20 07. Otten, L ars and D echter , Rina. Anytime and/or d epth-ﬁrst search for combin atorial optimization. AI Commun. , 25 (3):21 1–227, 2012. Peng, Jian, Hazan , T amir, Srebro, Nathan, and Xu, Jinbo. App roximate inference b y intersecting semideﬁ- nite bo und and local polytop e. In AIST ATS , pp. 868–876, 2012. Ra vikumar, P ., Agarwal, A., and W ainwrig ht, M. J. Message-passing for grap h-structured linear prog rams: Proximal methods and round ing schem es. The Journal of Machine Learning Resear ch , 11:1043 –1080, 2 010. Shimony , S. E. Findin g M APs for belief networks is NP- hard. Artif. Intell. , 68(2):399 –410, Au gust 1994. Sontag, D., Meltzer , T ., Globerson, A., Jaakkola, T . S., and W eiss, Y . T ig htening LP relax ations for MAP using mes- sage passing. arXiv pr ep rint arXiv: 1206.3288 , 201 2. Szeliski, R., Zabih, R., Scharstein, D., V eksler , O., Kol- mogor o v , V ., A garwala, A. , T appen, M. , and Ro ther , C. A c omparati ve study of energy m inimization methods for Markov ran dom ﬁelds with smooth ness-based prior s. P AMI , 30 (6):1068–1 080, June 2008 . T o h, K. C., T o dd, M. J., and T utunc u, R. H. SDPT3– a Matlab so ftw a re package for semid eﬁnite progra mming. Opt. Methods and Softwar e , 11(12):5 45–581, 1 999. Semideﬁnite Relaxation f or MAP Estimation T o rr , Philip. Solving Mar k ov random ﬁelds using semidef - inite program ming. In AI -ST A Ts’03 , 2003. W ainwrig ht, M., J aakkola, T ., and Will sky , A. MAP estima- tion via ag reement on trees: message-passing and line ar progr amming. I EEE T rans Info Theory , 2005. W ainwrig ht, M. J. and Jord an, M. I. Graph ical mo dels, ex- ponen ti al families, and variational inference . F ounda - tions and T r end s in Machine Learning , 1(1-2 ), 2008. W ang , P ., Shen, C., and v an den Heng el, A. A f ast semid ef- inite appro ach to solv ing bin ary quad ratic problems. In CVPR ’13 , pp. 1312– 1319, 2013. W en, Z., Goldfarb , D., and Y in, W . Alterna ting direction augmen ted Lagrangian metho ds for semide ﬁ nite p ro- grammin g. Math . Pr og. Comp. , 2( 3-4):203–2 30, 2010 . Zhao, Xin- Y uan , Sun, Def eng, and T oh, Kim-Chuan. A newton-cg augmen ted la grangian method fo r semid ef- inite prog ramming. SIAM J. on Optimization , 20 (4): 1737– 1765, January 2010. ISSN 1052-6 234.

Scalable Semidefinite Relaxation for Maximum A Posterior Estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment