Tight Error Bounds for Structured Prediction

Tigh t Error Bounds for Structured Prediction Amir Glob erson Tim Roughgarden Da vid Sontag Cafer Yildirim No vem ber 23, 2021 Abstract Structured prediction tasks in mac hine learning inv olve the sim ultaneous prediction of m ul- tiple labels. This is t ypically done by maximizing a score function on the space of labels, which decomp oses as a sum of pairwise elements, each dep ending on tw o sp eciﬁc lab els. In tuitively , the more pairwise terms are used, the better the exp ected accuracy . How ever, there is curren tly no theoretical account of this intuition. This paper takes a signiﬁcant step in this direction. W e formulate the problem as classifying the vertices of a known graph G = ( V , E ), where the vertices and edges of the graph are lab elled and correlate semi-randomly with the ground truth. W e sho w that the prosp ects for ac hieving low exp ected Hamming error dep end on the structure of the graph G in in teresting wa ys. F or example, if G is a very po or expander, like a path, then large expected Hamming error is inevitable. Our main p ositive result shows that, for a wide class of graphs including 2D grid graphs common in machine vision applications, there is a p olynomial-time algorithm with small and information-theoretically near-optimal expected error. Our results pro vide a ﬁrst step to ward a theoretical justiﬁcation for the empirical success of the eﬃcien t appro ximate inference algorithms that are used for structured prediction in mo dels where exact inference is in tractable. 1 In tro duction An increasing num b er of problems in machine learning are b eing solved using structured prediction [13, 30, 37]. Examples of structured prediction include dep endency parsing for natural language pro cessing, part-of-speech tagging named entit y recognition, and protein folding. In this setting, the input X is some observ ation (e.g., an image, a sen tence) and the output is a set of lab els (e.g., whether each pixel in the image is foreground or bac kground, or the parse tree for the sentence). The adv an tage of p erforming structured prediction is that one can sp ecify features that encourage sets of lab els to take some v alue (e.g., a feature that encourages tw o neighboring pixels to take diﬀeren t foreground/bac kground states whenev er there is a big diﬀerence in their colors). The feature v ector can then b e used within an exp onential family distribution o ver the space of lab els, conditioned on the input. The parameters are learned using maximum likelihoo d estimation (as with conditional random ﬁelds [30]) or using structured SVMs [2, 37]. In the applications ab ov e, p erformance is t ypically quantiﬁed as the discrepancy b etw een the correct “ground truth” lab els Y and the predicted lab els b Y . The most common p erformance measure, which w e study in this pap er, is Hamming error, the n um b er of disagreements b etw een Y and b Y . The optimal decision strategy for minimizing Hamming error is to use marginal inference, namely b Y i ← arg max Y i p ( Y i | X ) for eac h i , where p is the true generating distribution. How ever, in practice MAP inference is more often used. Namely , the assignmen t maximizing p ( Y | X ) is returned. One adv an tage of using MAP inference is computational, as the partition function (normalization constan t) no longer needs to b e estimated during training or at test time. Ho wev er, in the worst case, ev en MAP inference can b e NP-hard, such as for binary pairwise Marko v random ﬁelds with arbitrary p oten tial functions. It is no w widely understo o d from a practical persp ective that b etter p erformance (measured in terms of Hamming error) can b e obtained by using a more complex mo del incorporating a strong set of features than a simple mo del for which exact inference can b e p erformed. Despite the worst-case in tractability of inference in these mo dels, heuristic MAP inference algorithms often w ork well in practice, including those based on linear programming relaxations and dual decomp osition [29, 35], p olicy-based searc h [16], graph cuts [28], and branch-and-bound [36]. By “work well in practice”, w e mean that they obtain high accuracy predicting the true lab els on test data, measured in terms of the actual loss function of interest suc h as Hamming error. Ho wev er, the theoretical understanding of the setup is fairly limited. F or example, for man y ap- plications even the state-of-the-art structured prediction mo dels are unable to achiev e zero lab eling error, and there is no characterization of the choice of feature sets and the generative settings for whic h high prediction accuracy can b e exp ected, ev en ignoring computational limitations. More- o ver, the go o d p erformance of these heuristic algorithms indicates that real-w orld instances are far from the theoretical w orst case, and it is a ma jor op en problem to b etter characterize the com- plexit y of inference problems to distinguish those that are in fact easy to solve from those that are computationally in tractable. Finally , it is not well understo o d why MAP inference can pro vide suc h go o d results for these structured prediction problems and ho w muc h accuracy is lost relative to marginal inference. The goal of this pap er is to initiate the theoretical study of structured prediction for obtaining small Hamming error. Such an analysis m ust deﬁne a generativ e pro cess for the X, Y pairs, in order to prop erly deﬁne exp ected Hamming error. Our mo del assumes that the observ ed X is a noisy v ersion of Y in the following sense: X i is a noisy version of the true Y i and X i,j is a noisy version of the v ariable I [ Y i = Y j ]. The resulting p osterior for Y giv en X is then very similar to the data and 2 smo othness terms used for structured prediction in machine vision. Motiv ated b y mac hine vision applications w e also fo cus on the case where i, j pairs corresp ond to a tw o dimensional grid graph [38]. W e also provide results for classes of non-grid and non-planar graphs. As noted earlier, prediction is often p erformed by taking marginals of the p osterior or its maxim um. Both of these turn out to b e computationally intractable in our setting. W e are thus also in terested in analyzing algorithms that ar e p olynomial time and hav e guaran tees on the exp ected Hamming error. Our main result is that there exists a p olynomial-time algorithm that achiev es the information-theoretic lo wer b ound on the exp ected Hamming error, and is th us optimal (up to m ultiplicative constan ts). The algorithm is a t wo-step pro cedure which ignores the no de evidence in the ﬁrst step, solving a MaxCut problem on a grid (which can b e done in p olynomial time), and in the second step uses no de observ ations to break symmetry . W e use com binatorial argumen ts to pro vide a worst-case upp er b ound on the error of this algorithm. Our analysis is v alidated via exp erimen tal results on 2D grid graphs. 2 Related W ork Our goal is to reco ver a set of unobserv ed v ariables Y from a set of noisy observ ations X . As such it is related to v arious statistical recov ery settings, but distinct from those in several imp ortant asp ects. Belo w we review some of the related problems. Channel Co ding: This is a classic recov ery problem (e.g., see [4]) where the goal is to exactly reco ver Y (i.e., with zero error). Here Y is augmented with a set of “error-correcting” bits, deter- ministic functions of Y , and the complete set of bits is sent through a noisy channel. In our mo del, X i,j is a noisy v ersion of the parit y of Y i and Y j . Thus our setting ma y b e viewed as comm unication with an error correcting co de where each error-correcting bit in volv es t wo bits of the original mes- sage Y , and each Y i app ears in d i c heck bits, where d i is the num b er of edge observ ations inv olving Y i . Such co des cannot b e used for errorless transmission (e.g., see our lo wer b ound in Section 4). As a result, the tec hniques and results from c hannel co ding do not appear to apply to our setting. Correlation Clustering (CC): There are n umerous v ariants of this problem, but in the t ypical setting Y is a partition of N v ariables in to an unkno wn n um b er of clusters and X u,v sp eciﬁes whether Y u and Y v are in the same cluster (with some probability of error as in [25] or adversarially as in [32]). The goal is to ﬁnd Y from X . Most CC w orks assume an unrestricted n umber of clusters [7, 25], although a few consider a ﬁxed num ber of clusters (e.g. see [21]). Our results apply to the case of tw o clusters. The most signiﬁcant diﬀerence is that most of the CC w orks study the ob jectiv e of minimizing the num b er of edge disagreemen ts. It is not obvious how to translate the guarantees provided in these w orks to a non-trivial b ound on Hamming error (i.e., num b er of no de disagreemen ts) for our analysis framework. Appro ximately Stable Clusterings: W ork on appro ximation stability , initiated b y Balcan et al. [5] and Bilu and Linial [9], also seek p olynomial-time algorithms with lo w Hamming error with resp ect to a ground truth clustering. Instead of assuming that the input is derived from the ground truth by a random pro cess, these pap ers make an incomparable assumption that all near- optimal solutions w.r.t. some ob jectiv e function hav e lo w error w.r.t. the ground truth clustering. Appro ximation stable instances of correlation clustering problems were studied b y Balcan and Bra verman [6], who gav e p ositive results when G is the complete graph and stated the problem of understanding general graphs as an op en question. 3 (a) Ground truth (b) Observed evidence = -1 = +1 = same = different (c) Approximate recovery (all -1’ s) ˆ Y ( X ) X Y Figure 1: Statistical recov ery on a grid graph. (a) Ground truth, whic h we w an t to reco ver. (b) A p ossible set of noisy no de and edge observ ations. (c) Approximate recov ery (prediction), in this case with Hamming error 2. Reco v ery Algorithms in Other Settings: The high-lev el goal of reco vering ground truth from a noisy input has b een studied in numerous other application domains. In the ov erwhelming ma jorit y of these settings, the fo cus is on maximizing the probability of exactly recov ering the ground truth, a manifestly imp ossible goal in our setting. This is the case with, for example, plan ted cliques and graph partitions (e.g. [14, 19, 33]), detecting hidden com m unities [3], and phylogenetic tree reconstruction [15]. A notable exception is w ork by Bra verman and Mossel [10] on sorting from noisy information, who give p olynomial-time algorithms for the approximate reco very of a ground truth total ordering given noisy pairwise comparisons. Their approach, similar to the presen t work, is to compute the maxim um lik elihoo d ordering given the data, and prov e that the exp ected distance b et ween this ordering and the ground truth ordering is small. Reco v ery on Random Graphs: Tw o very recen t works [1, 11] ha ve addressed the case where noisy pairwise observ ations of Y are obtained for edges in a graph. In b oth of these, the fo cus is mainly on guaran tees for random graphs (e.g., Erd¨ os-Ren yi graphs). F urthermore, the analysis is of p erfect recov ery (in the limit as n → ∞ ) and its relation to the graph ensemble. The goal of our analysis is considerably more challenging, as we are in terested in the Hamming error for ﬁnite N . Abb e et al. [1] explicitly state partial (as opp osed to exact) recov ery for sparse graphs with constan t degrees as an op en problem, whic h we solv e in this pap er. P ercolation: Some of the tec hnical ideas in our study of grid graphs (Section 4) are inspired b y arguments in p ercolation, the study of connected clusters in random (often inﬁnite) graphs. F or example, our use of “ﬁlled-in regions” in Section 4 is reminiscen t of argumen ts in p ercolation theory (e.g., see p. 286 in [22]). In addition, we can directly adapt results from statistical physics that b ound the connectivit y constan t of square lattices [12, 31] and the n um b er of self-a voiding polygons of a particular length and area [24], to give precise constan ts for our theoretical results. 3 Preliminaries W e consider the setting of prediction on a graph G = ( V , E ) where V denotes the set of lab els that w e wan t to predict, and E the observ ed pairwise relationships. Let Y ∈ {− 1 , +1 } N denote the ground truth lab els, where N = | V | . The setting is depicted in Figure 1. The Generative Mo del and Hamming Error: A random pro cess generates observ ations for the edges and no des of G as a function of the ground truth. It has t wo parameters, an edge noise p ∈ [0 , . 5] and a no de noise q ∈ [0 , . 5]. The generative mo del is as follows. F or each edge ( u, v ) ∈ E , the edge observ ation X uv is indep endently sampled to b e Y u Y v with probability 1 − p (called a go o d edge), and − Y u Y v with probability p (called a b ad edge). Observ e that adjacen t vertices are likely to hav e the same (or diﬀerent) lab els if the observ ation on the edge connecting them is +1 (or − 1). Similarly , for each no de v ∈ V , the no de observ ation X v is indep endently sampled to b e Y v with 4 probabilit y 1 − q ( go o d no des), and − Y v with probabilit y q ( b ad nodes). A lab eling algorithm is a function A : {− 1 , +1 } E × {− 1 , +1 } V → {− 1 , +1 } V from graphs with lab eled edges and no des (i.e., the noisy observ ations described ab o ve) to a labeling of the nodes V . W e measure the p erformance of A by the exp ectation of the Hamming error (i.e., the n umber of mispredicted lab els) o v er the observ ation distribution induced b y Y . By the err or of an algorithm, w e mean its worst-case (o v er Y ) exp ected error (ov er inputs generated by Y ). F ormally , we denote the error of the algorithm given a v alue Y = y by e y ( A ) and deﬁne it as: e y ( A ) = E X | Y = y  1 2 kA ( X ) − y k 1  . (1) The o verall error is then: e ( A ) = max y e y ( A ) . (2) MAP and Marginal Estimators: The maximum likelihoo d (ML) estimator of the ground truth is given by b Y ← arg max Y p ( X | Y ), where p ( X | Y ) = Y uv ∈ E (1 − p ) 1 2 (1+ X uv Y u Y v ) p 1 2 (1 − X uv Y u Y v ) · Y v ∈ V (1 − q ) 1 2 (1+ X v Y v ) q 1 2 (1 − X v Y v ) . (3) T aking the logarithm and ignoring constants, w e see that maximizing p ( X | Y ) is equiv alen t to max Y X uv ∈ E 1 2 X uv Y u Y v log 1 − p p + X v ∈ V 1 2 X u Y u log 1 − q q , (4) or simply max Y P uv ∈ E X uv Y u Y v + γ P v ∈ V X u Y u , where γ = log 1 − q q / log 1 − p p . Assuming a uniform prior ov er ground truths Y , MAP inference reduces to maxim um likeli- ho o d inference, and marginal inference can b e performed using p ( Y | X ) ∝ p ( X | Y ). Standard argumen ts pro ve that the algorithm that p erforms marginal inference using a uniform distribution o ver Y achiev es the smallest p ossible error according to Eq. 2; for completeness, w e include a pro of in Appendix A. In other w ords, marginal inference using a uniform prior minimizes the w orst case exp ected error (i.e., it is minimax optimal). Appro ximate Reco very: The interesting regime for structured prediction is when the no de noise q is lar ge. In this regime there is no correlation deca y , and correctly predicting a lab el requires a more global consideration of the node and edge observ ations. The intriguing question — and the question that rev eals the imp ortance of the structure of the graph G — is whether or not there are algorithms with small error when the edge noise p is a small constant. Precisely , we mak e the follo wing deﬁnition. Deﬁnition 3.1 (Appro ximate Reco very) F or a family of graphs G , w e say that appr oximate r e c overy is p ossible if there is a function f : [0 , 1] → [0 , 1] with lim p ↓ 0 f ( p ) = 0 such that, for ev ery suﬃcien tly small p and all N at least a suﬃcien tly large constan t N 0 ( p ), the minim um-p ossible error of an algorithm on a graph G ∈ G with N vertices is at most f ( p ) · N . A Non-Example: Some graph families admit approximate recov ery whereas others do not. T o illustrate this and impart some in tuition ab out our mo del, consider the family of path graphs. Assume that the no de noise q is extremely close to . 5, so that no de lab els pro vide no information ab out the ground truth, while the edge noise p is an arbitrarily small p ositiv e constan t. If G is a path graph on N no des with N suﬃcien tly large then, with high probability , for most pairs of 5 no des, the unique path b etw een them contains a bad edge. This implies that approximate reco very is not p ossible. A bit more formally , imagine that an adv ersary generates the ground truth Y by picking i uniformly at random from { 1 , 2 . . . , N } , giving the ﬁrst i no des the lab el -1 and the last N − i no des the lab el +1. With high probability a constant fraction of the input’s edges are “-1” edges — one go o d edge consisten t with the ground truth and the rest bad edges inconsisten t with the ground truth. Intuitiv ely , no algorithm can guess whic h is whic h, whic h means that every algorithm has expected error Ω( N ) with resp ect to the distribution o ver Y , and hence error Ω( N ) with respect to a worst case choice of Y . Thus, path graphs do not allow approximate reco v ery . 1 4 Optimal Reco v ery in Grid Graphs This section studies grid graphs. W e dev ote a lengthy treatment to them for sev eral reasons. First, grid graphs are central in applications suc h as machine vision. Second, the grid is a relatively p o or expander and for this reason p oses a num b er of interesting tec hnical c hallenges. Third, our algorithm for the grid and other planar graphs is computationally eﬃcien t. Our grid analysis yields matc hing upp er and low er b ounds of Θ( p 2 N ) on the information-theoretically optimal error. 4.1 The Algorithm W e study the algorithm ¯ A , which has tw o stages. The ﬁrst stage ignores the no de observ ations and computes a lab eling b Y that maximizes the agreement with respect to edge observ ations only , i.e. b Y ← arg max Y X uv ∈ E X uv Y u Y v . (5) Note that b Y and − b Y agree with precisely the same set of edge observ ations, and thus b oth maximize Eq. 5. The second stage of algorithm ¯ A outputs b Y or − b Y , according to a “ma jorit y v ote” by the no de observ ations. Precisely , it outputs − b Y if P v ∈ V X v Y v < 0, and b Y otherwise. Algorithm 1 ¯ A ( X ) Require: Edge and node observ ations X Ensure: Node predictions b Y b Y ← arg max Y P uv ∈ E X uv Y u Y v if P v ∈ V X v Y v < 0 then b Y ← − b Y end if return b Y When the graph G is a 2D grid, or more generally a planar graph, this algorithm can b e implemen ted in p olynomial time by a reduction to the maxim um-weigh t matc hing problem (see [20, 8]). By contrast, it is N P -hard to maximize the full expression in (4) [8]. 1 It is not diﬃcult to make this argumen t rigorous. See Section 4.3 for a rigorous, and more interesting, version of this lo wer b ound argumen t. 6 4.2 An Upp er Bound on the Error Our goal is to pro ve the follo wing theorem, whic h sho ws that appro ximate recov ery on grids is p ossible. Theorem 4.1 If p < 1 / 39 , then the algorithm ¯ A achieves err or e ( ¯ A ) = O ( p 2 N ) . Analysis of First Stage: W e analyze the tw o stages of algorithm ¯ A in order. W e ﬁrst show that after the ﬁrst stage, the exp ected error of the b etter of b Y , − b Y is O ( p 2 N ). W e then extend this error b ound to the output of the second stage of the algorithm. W e b egin b y highlighting a simple but k ey lemma that characterizes a structural prop erty of the maximizing assignmen t in Eq. 5. W e use δ ( S ) to denote the b oundary of S ⊆ V , i.e. the set of edges with exactly one endp oint in S . Lemma 4.2 (Flipping Lemma) L et S denote a maximal c onne cte d sub gr aph of G with every no de of S inc orr e ctly lab el le d by b Y or − b Y . Then at le ast half the e dges of δ ( S ) ar e b ad. Pr o of: The computed lab eling b Y (or − b Y ) agrees with the edge observ ations on at least half the edges of δ ( S ) — otherwise, ﬂipping the lab els of all no des in S w ould yield a new lab eling with agreemen t strictly higher than b Y (or − b Y ). On the other hand, since S is maximal, for every edge e ∈ δ ( S ), exactly one endp oint of e is correctly lab eled. Thus every edge of δ ( S ) is inconsistent with the ground truth. These t wo statements are compatible only if at least half the edges of δ ( S ) are bad.  Call a set S b ad if at least half its b oundary δ ( S ) is bad. The Flipping Lemma motiv ates b ounding the probabilit y that a given set is bad, and then enumerating ov er sets S . This approach can b e made to work only if the collection of sets S is chosen carefully — otherwise, there are far to o man y sets and this approac h fails to yield a non-trivial error b ound. T o b egin the analysis, let H denote the error of our algorithm on a random input. H seems diﬃcult to analyze directly , so we in tro duce a simpler-to-analyze upper b ound. This requires some deﬁnitions. Let C denote the subsets S of V suc h that the induced subgraph G [ S ] is connected. W e classify subsets S of C in to 6 categories (see Figure 2): 1. S con tains no vertices on the p erimeter of G ; 2. S con tains vertices from e xactly one side of the p erimeter of G ; 3. S contains v ertices from exactly tw o sides of the p erimeter of G , and these tw o sides are adjacen t; 4. S contains v ertices from exactly tw o sides of the p erimeter of G , and these tw o sides are opp osite; 5. S con tains vertices from exactly three sides of the p erimeter of G ; 6. S con tains vertices from all four sides of the p erimeter of G . Let C < 6 denote the set of all S ⊂ V from one of the ﬁrst 5 categories. F or a set S ∈ C < 6 , w e deﬁne a corresp onding ﬁl le d in set F ( S ). Consider the connected comp onents C 1 , . . . , C k of G [ V \ S ] for suc h a subset S . Call suc h a connected comp onent 3-side d if it includes vertices from at least three 7 Figure 2: Examples of type 1, 2, 3, and 6 regions, left-to-righ t. sides of the grid G . F or ev ery S ∈ C < 6 there is at least one 3-sided comp onen t; it is unique if S has t yp e 1, 2, 3, or 5. W e deﬁne F ( S ) as the union of S with all the connected comp onents of G [ V \ S ] except for a single 3-sided one. App endix B illustrates the ﬁlling-in procedure. F ( S ) is not deﬁned for t yp e-6 comp onen ts S . Observe that F ( S ) ⊇ S . Let F = { F ( S ) : S ∈ C < 6 } denote the set of all suc h ﬁlled-in comp onents. Lemma 4.3 If S 1 , S 2 ar e disjoint and not typ e 6, then F ( S 1 ) , F ( S 2 ) ar e distinct and not typ e 6. Pr o of: If a set S is not type 6, then ev ery 3-sided comp onen t of G [ V \ S ] contains one entire side of the grid p erimeter. Since F ( S ) excludes a 3-sided comp onen t, it cannot b e type 6. Also, for a set S that is not t yp e 6, the b oundary of F ( S ) is a non-empt y subset of that of S . Th us, the non-empty set of endp oints of δ ( F ( S )) that lie in F ( S ) also lie in S . This implies that if F ( S 1 ) = F ( S 2 ), then S 1 ∩ S 2 6 = ∅ .  The following error upper b ound applies to whichev er of b Y , − b Y does not incorrectly classify a t yp e-6 set (there is at most one type-6 set, so at least one of them has this prop erty). Let B denote the mislab eled v ertices of such a labeling and let B 1 , . . . , B k denote the connected comp onen ts (of t yp es 1–5) of G [ B ]. The next lemma extends the Flipping Lemma. Lemma 4.4 F or every set B i , the ﬁl le d-in set F ( B i ) is b ad. Pr o of: W e ﬁrst claim that b Y agrees with the data on at least half the edges of δ ( F ( B i )); the same is true of − b Y . The reason is that ﬂipping the lab el of ev ery vertex of F ( B i ) increases the agreemen t with the data by the num b er of disagreeing edges of δ ( F ( B i )) min us the num b er of agreeing edges of δ ( F ( B i )), and this diﬀerence is non-p ositive by the optimalit y of b Y . On the other hand, since B i is maximal, every neighbor of B i is correctly lab eled in b Y . Since the neighborho o d of F ( B i ) is a subset of B i , this also holds for F ( B i ). Thus, b Y disagrees with Y on ev ery edge of δ ( F ( B i )).  A crucial p oin t is that Lemmas 4.3 and 4.4 imply that the random v ariable T = X F ∈F | F | · 1 F is bad (6) is an upp er b ound on the error H with probability 1. W e no w upp er b ound the easier-to-analyze quan tity T . T he ﬁrst lemma provides an upp er b ound on the probability that a set S is bad, as a function of its b oundary size | δ ( S ) | . Lemma 4.5 F or every set S with | δ ( S ) | = i , Pr [ S is b ad ] ≤ (3 √ p ) i . 8 Pr o of: By the deﬁnition of a bad set, Pr [ S is bad] equals the probabilit y that at least half of δ ( S ) are bad edges. Since | δ ( S ) | = i this is the probabilit y that at least i 2 edges are bad. Since these ev ents are I ID, w e can b ound it via: Pr   X j Z j ≥ i 2   <  i i 2  p i 2 ≤ (2 e ) i 2 p i 2 ≤ (3 √ p ) i (7) where Z j is the indicator even t of the i -th edge b eing bad.  The probability b ound in Lemma 4.5 is naturally parameterized by the num ber of b oundary edges. Because of this, we face tw o tasks in upp er b ounding T . First, T counts the num ber of no des of bad ﬁlled-in sets F ∈ F , not b oundary sizes. The next lemma states that the num b er of no des of suc h a set cannot b e more than the square of its b oundary size. Lemma 4.6 F or F ∈ F : (1) | F | ≤ | δ ( F ) | 2 ; (2) if F is a typ e-1 r e gion, then | F | ≤ 1 16 | δ ( F ) | 2 . Pr o of: If F is a type 4 or 5 set, then | δ ( F ) | ≥ √ N and the b ound is trivial. If F is a t yp e 1 set, let U b e the smallest rectangle in the dual graph that contains F . Let k, m denote the side lengths of U . Then: | F | ≤ k m ≤ 1 16 (2 k + 2 m ) 2 ≤ 1 16 | δ ( F ) | 2 . Similarly for type 2 sets we ha ve | F | ≤ k m ≤ min  (2 k + m ) 2 , ( k + 2 m ) 2  ≤ | δ ( F ) | 2 . Finally , for t ype 3 sets hav e | F | ≤ k m ≤ ( k + m ) 2 ≤ | δ ( F ) | 2 .  The second task in upp er b ounding T is to count the n umber of ﬁlled-in sets F ∈ F that hav e a giv en b oundary size. W e do this by coun ting simple cycles in the dual graph. Lemma 4.7 L et i b e a p ositive inte ger. (a) If i is o dd or 2, then ther e ar e no typ e-1 sets F ∈ F with | δ ( F ) | = i ; (b) If i is even and at le ast 4, then ther e ar e at most N · 4 · 3 i − 2 2 i = N · 2 · 3 i − 2 i typ e 1 sets F ∈ F with | δ ( F ) | = i ; (c) If i is at le ast 2, then ther e ar e at most 2 √ N · 3 i − 2 typ e 2–5 sets F ∈ F with | δ ( F ) | = i . Pr o of: Recall that, b y construction, a ﬁlled-in set F ∈ F is such that b oth G [ F ] and G [ V \ F ] are connected. This is equiv alen t to the prop ert y that δ ( F ) is a minimal cut of G — there is no subset S suc h that δ ( S ) is a strict subset of δ ( F ). In a planar graph suc h as G , this is equiv alent to the prop ert y that the dual of δ ( F ) is a simple cycle in the dual graph G d of G (e.g., see Section 4.6 of [17]) Note that the dual graph G d is just an ( n − 1) × ( n − 1) grid — with one v ertex p er “grid cell” of G — plus an extra vertex z of degree 4( √ N − 1) that corresp onds to the outer face of G . The type-1 sets of F are in dual corresp ondence with the simple cycles of G d that do not include z , the other sets of F are in dual corresp ondence with the simple cycles of G d that do include z . The cardinalit y of the b oundary | δ ( F ) | equals the length of the corresp onding dual cycle. P art (a) follows from the fact that G d \ { z } is a bipartite graph, with only even cycles, and with no 2-cycles. F or part (b), we count simple cycles of G d of length i that do not include z . There are at most N c hoices for a starting p oint. There are at most 4 c hoices for the ﬁrst edge, at most 3 c hoices for the next ( i − 2) edges, and at most one c hoice at the ﬁnal step to return to the starting p oint. 9 Eac h simple cycle of G d \ { z } is counted 2 i times in this w ay , once for each choice of the starting p oin t and the orientation. F or part (c), we count simple cycles of G d of length i that include z . W e start the cycle at z , and there are at most 4 √ N c hoices for the ﬁrst no de. There are at most 3 c hoices for the next i − 2 edges, and at most one choice for the ﬁnal edge. This coun ts each cycle twice, once in each orien tation.  Let F 1 ⊆ F denote the t yp e-1 sets of F . The computation b elo w shows that E [ T ] ≤ cp 2 N + O ( p √ N ) (8) for a constant c > 0 that is indep enden t of p and N , whic h completes the analysis of the ﬁrst stage of the algorithm ¯ A . The intuition for why this computation works out is that Lemma 4.7 implies that there is only an exp onential n umber of relev an t regions to sum ov er; Lemma 4.6 implies that the Hamming error is quadratically related to the (bad) b oundary size; and Lemma 4.5 implies that the probability of a bad b oundary is decreasing exp onentially in i (with base 3 √ p ). Pro vided p is at most a suﬃciently small constant (indep enden t of N ), the probabilit y term dominates and so the exp ected error is small. F ormally , we hav e E [ T ] = X F ∈F | F | · Pr [ F is bad] (9) = ∞ X i =2 X F ∈F 1 : | δ ( F ) | =2 i | F | · Pr [ F is bad] + ∞ X j =2 X F ∈F \F 1 : | δ ( F ) | = j | F | · Pr [ F is bad] ≤ ∞ X i =2 X F ∈F 1 : | δ ( F ) | =2 i i 2 4 · (3 √ p ) 2 i + ∞ X j =2 X F ∈F \F 1 : | δ ( F ) | = j j 2 · (3 √ p ) j (10) ≤ ∞ X i =2 N · 2 · 3 2 i − 2 i i 2 4 · (3 √ p ) 2 i + ∞ X j =2 2 √ N · 3 j − 2 · j 2 · (3 √ p ) j (11) = N ∞ X i =2 i 16 (81 p ) i + √ N ∞ X j =2 2 j 2 9 (9 √ p ) j = N ( cp 2 ) + O ( p √ N ) , (12) for a constant c > 0 that is independent of p and N . In the deriv ation, (9) follo ws from the deﬁnition of T and linearity of exp ectation, (10) follows from Lemmas 4.5 and 4.6, and (11) follows from Lemma 4.7. In the ﬁnal line, we are assuming that p < 1 / 81. Remark 4.8 There are sev eral wa ys to optimization the computation ab ov e. The requirement that p < 1 / 81 was needed for the inﬁnite series to con v erge. T o improv e this, we can use the tigh ter upp er b ound of (2 ep ) i/ 2 for the probability that a region of b oundary size i is bad (see Lemma 4.5). W e can then replace the upp er b ound on the n umber of regions of each t yp e in Lemma 4.7 with tigh ter results from statistical physics. In particular, the n umber of type-1 sets with b oundary size i can b e upp er b ounded b y N µ i (Eq. 3.2.5 of [31]), where µ is the so-called connective constant of square lattices and is upp er b ounded by 2.65 [12]. The num b er of t yp e 2–5 sets with b oundary 10 length i can similarly b e upp er b ounded b y 4 √ N µ i e κ √ i for the same v alue of µ and for some ﬁxed constan t κ > 0 [23]. Putting these together, we obtain that the inﬁnite series for all region types is at most a constant when p < 1 / 39. T o compute an upp er bound on the constant c in the term in (12) that is linear in N , recall that this term can b e attributed to the type-1 regions. W e expand the sum in (9) ov er type-1 regions in to t wo terms: one term that explicitly en umerates o ver type-1 regions whose corresponding simple cycle in G d is of length i = 2 to 100, and a remainder term. The sum in the ﬁrst term can b e computed exactly as follo ws. F or eac h v alue of i , the probability that the region is bad is simply P i k = i/ 2  i k  p k (1 − p ) i − k . W e can then use the b ound P F ∈F 1 : | δ ( F ) | = i | F | ≤ N P i 2 / 16 a =1 ac a,i , where c a,i is the n umber of distinct cycles in an inﬁnite grid of length i and area a (up to translation). These cycles also go by the name of self-avoiding p olygons in statistical ph ysics, and the num b ers c a,i ha ve b een exhaustively computed up to i = 100 [24]. Finally , the inﬁnite sum in the remainder can b e sho wn to b e upp er b ounded b y 51 2 b 51 / (1 − b ) 3 for b = 2 ep (2 . 65) 2 . The resulting function can then b e sho wn to b e upp er b ounded by 8 N p 2 for p ≤ 0 . 017. Analyzing the Second Stage: Our analysis so far sho ws that the better of b Y , − b Y has small error with resp ect to the ground truth Y . In the second phase, we use the no de lab els to c ho ose b et ween them via a “ma jorit y v ote.” W e next show that, provided q is slightly b elow 1 2 , the b etter of b Y , − b Y is chosen in the second stage with high probability . This completes the pro of of Theorem 4.1. Our starting point for the second-stage analysis is the inequalit y E [ H 0 ] ≤ N · cp 2 , where H 0 is the Hamming error of the better of b Y , − b Y . Mark ov’s inequalit y implies that Pr h H 0 ≥ 1 kp 2 N cp 2 i ≤ k p 2 , where k is a free parameter. F or the second stage, let B 0 b e the set of wrong no de observ ations. Chernoﬀ b ounds imply that, for every constant δ > 0 and suﬃcien tly large N , Pr [ | B 0 | ≥ (1 + δ ) N q ] ≤ 1 N 2 . Observe that if the sum of the num b er of bad no de observ ations and the n umber of misclassiﬁed no des for the b etter of b Y , − b Y is less than N/ 2, then the t wo-stage algorithm ¯ A would c ho ose the better of b Y , − b Y . Hence, with probability 1 − k p 2 − 1 N 2 , the algorithm would choose the b etter of b Y , − b Y pro vided 1 kp 2 N cp 2 + (1 + δ ) N q < N 2 , or equiv alen tly , c k + (1 + δ ) q < 1 2 . This inequalit y is satisﬁed for small δ provided k > c 1 / 2 − (1+ δ ) q . Th us, E [ H ] ≤ 1 · N cp 2 + ( k p 2 + 1 N 2 ) · N ≤ N · (( c + 1) p 2 + k p 2 ) ≤ N · C p 2 for N > N 0 ( p, q ), where H is the error of the 2-step algorithm. (In second inequality we use that N > 1 p .) 4.3 Lo w er Bound In this section, we pro ve that every algorithm suﬀers w orst-case (o ver the ground truth) exp ected error Ω( p 2 N ) on 2D grid graphs, matching the upper b ound for the 2-step algorithm ¯ A that we pro ved in Theorem 4.1. W e use the fact that marginal inference is minimax optimal for Eq. 2 (see App endix A). The exp ected error of marginal inference is indep endent of the ground truth (by 11 symmetry), so we can lo wer b ound its exp ected error for the all-0 ground truth. Also, its error only decreases if it is given part of the ground truth. Let G = ( V , E ) denote an n × n grid with N = n 2 v ertices. Let Y : V → {− 1 , +1 } denote the ground truth. W e consider the case where Y is chosen at random from the following distribution. Color the no des of G with blac k and white lik e a chess b oard. White no des are assigned binary v alues uniformly and indep enden tly . Black no des are assigned the lab el +1. Giv en Y , input is generated using the random pro cess describ ed in Section 3. Consider an arbitrary function from inputs to lab ellings of V . W e claim that the exp ected error of the output of this function, where the exp ectation is ov er the c hoice of ground truth Y and the subsequen t random input, is Ω( p 2 N ). This implies that, for ev ery function, there exists a choice of ground truth Y such that the exp ected error of the function (ov er the random input) is Ω( p 2 N ). Giv en Y , call a white no de ambiguous if exactly tw o of the edges incident to it are lab eled “+1” in the input. A white no de is ambiguous with probabilit y 6 p 2 (1 − p ) 2 ≥ 5 . 1 p 2 for p ≤ 0 . 078. Since there are N / 2 white no des, and the even ts corresponding to am biguous white nodes are indep enden t, Chernoﬀ b ounds imply that there are at least 5 p 2 2 N ambiguous white no des with very high probabilit y . Let L denote the error contributed by am biguous white no des. Since the true lab els of diﬀerent white no des are conditionally indep enden t (given that all blac k no des are known to ha v e v alue +1), the function that minimizes E [ L ] just predicts each white no de separately . The algorithm that minimizes the exp ected v alue of L simply predicts that each am biguous white no de has true lab el equal to its input lab el. This prediction is wrong with constant probability , so E [ L ] = Ω( p 2 N ) for ev ery algorithm. Since L is a lo wer bound on the Hamming error, the result follows. 5 Extensions The section sketc hes sev eral extensions of our mo del and results, to planar graphs b eyond grids (Section 5.1), to expander graphs (Section 5.2), to graphs with a large minimum cut (Section 5.3), and to semi-random mo dels (Section 5.4). 5.1 Appro ximate Reco v ery in Other Planar Graphs Section 4 gives a p olynomial-time algorithm for essen tially information-theoretically optimal ap- pro ximate reco very in grid graphs. While the analysis do es use prop erties of grids beyond planarit y , it is robust in that it applies to all planar graphs that share tw o k ey features with grids. The path graph (see Section 3) shows that approximate recov ery is not p ossible for all planar graphs; additional conditions are needed. The ﬁrst prop erty , which fails in “thin” planar graphs lik e a path but holds in man y planar graphs of in terest, is the following w eak expansion prop erty: (P1) (We ak exp ansion.) F or some constan ts c 1 , c 2 > 0, every ﬁlled-in set F ∈ F satisﬁes | F | ≤ c 1 | δ ( F ) | c 2 . (Filled-in sets can b e deﬁned analogously to the grid case.) The second key property is that the num b er of ﬁlled-in sets with a given b oundary size i should b e at most exp onential in i . As in Lemma 4.7, a suﬃcient (but not necessary) condition for this prop ert y is that the dual graph has b ounded degree (except p ossibly for the v ertex corresp onding to the outer face, which can hav e arbitrary degree). 12 (P2) (Bounde d Dual De gr e e.) Ev ery face of G , except p ossibly for the outer face, con tains at most a constan t c 3 n umber of edges. Our pro of of computationally eﬃcient appro ximate reco very (Theorem 4.1) extends to show that approximate reco very is p ossible in every planar graph that satisﬁes prop erties (P1) and (P2); the precise b ound on the function f ( p ) dep ends on the constants c 1 , c 2 , c 3 . 5.2 Appro ximate Reco v ery in Expander Graphs Structured prediction on expander graphs is often applied to relational classiﬁcation (e.g., predicting protein-protein interactions or web-page classiﬁcation). This section pro ves that ev ery family G of d -regular expanders admits approximate recov ery . Recall the deﬁnition of such a family: for some constan t c > 0, for every G ∈ G with N v ertices and every set S ⊆ V with | S | ≤ N / 2, | δ ( S ) | ≥ c · d · | S | , where the boundary δ ( S ) is the set of edges with exactly one endp oint in S . W e claim that G allows approximate reco very with f ( p ) = 3 p/c , and pro ceed to the pro of. The algorithm is the same as in Section 4; it is not computationally eﬃcien t for expanders. As in Section 4, analyzing the tw o-stage algorithm reduces to analyzing the b etter of the tw o solutions pro duced by the ﬁrst stage. W e therefore assume that the output ˆ Y of the ﬁrst stage has error H at most N / 2. Fix a noise parameter p ∈ (0 , 1 2 ), a graph G ∈ G with N suﬃcien tly large, and a ground truth. Let B denote the set of bad edges. Chernoﬀ b ounds imply that for all suﬃciently large N , the probabilit y that | B | ≥ 2 p | E | = pdN is at most 1 / N 2 . When | B | > pdN , w e can trivially b ound the error H b y N / 2. When | B | ≤ pdN , we b ound H from ab o ve as follows. Let S denote the no des of V correctly classiﬁed b y the ﬁrst stage b Y and C 1 , . . . , C k the connected comp onen ts of the (misclassiﬁed) no des of the induced subgraph G [ V \ S ]. Since H / 2, | C i | ≤ N / 2 for ev ery i . W e hav e H = k X i =1 | C i | ≤ 1 cd · k X i =1 | δ ( C i ) | ≤ 2 cd · k X i =1 | δ ( C i ) ∩ B | ≤ 2 cd · | B | , where the ﬁrst inequality follo ws from the expansion condition, the second from the Flipping Lemma (Lemma 4.2), and the third from the fact that the δ ( C i )’s are disjoint (since the C i ’s are maximal). Th us, when | B | ≤ pdN , H ≤ 2 p c N . Overall, w e hav e E [ H ] ≤ 1 · 2 p c N + 1 N 2 · N 2 ≤ 3 p c N for N suﬃcien tly large, as claimed. 5.3 Graphs with a Large Min Cut Appro ximate recov ery is also p ossible in every graph family G for which the global minimum cut c ∗ is b ounded b elow by c log N for a suﬃciently large constant c . This class of graphs is incomparable to the expanders considered in Section 5.2. T o see wh y a large minim um cut is suﬃcient, w e mo dify the ﬁrst-stage analysis in the pro of of Theorem 4.1 as follows. Deﬁne C as the subsets S of V suc h that | S | ≤ N / 2 and G [ S ] is connected, and C i the subset of C corresp onding to sets S with | δ ( S ) | = i . Recall that, for ev ery α ≥ 1, the num b er of α -approximate minim um cuts of an undirected graph is at most N 2 α (e.g., 13 see [27]). Th us, |C i | ≤ N 2 i/c ∗ , which is at most 2 2 i/c when c ∗ ≥ c log 2 N . That is, there can only b e an exp onen tial n umber of connected subgraphs with a giv en b oundary size (cf., prop ert y (P1) in Section 5.1). A calculation along the lines of the pro of of Theorem 4.1, then implies that appro ximate recov ery is p ossible, provided the constant c is suﬃcien tly large. 5.4 Semi-Random Mo dels All of our p ositive results make minimal use of the prop erties of the random pro cess that generates inputs given the ground truth. Our pro ofs only need the fact that the probability that a b oundary δ ( S ) consists of at least half bad edges decays exponentially in the b oundary size | δ ( S ) | (Lemma 4.5). As suc h, our p ositive results are robust to man y v ariations in the random mo del. F or example, the fact that every edge has the same noise parameter p is not imp ortant — our algorithms contin ue to hav e the exact same guarantees, with the same pro ofs, with the function f ( p ) replaced by f ( p max ), where p max is the maximum noise parameter of any edge. If bad edges are negatively correlated instead of indep enden t, then the relev an t Chernoﬀ b ounds (and hence Lemma 4.5) contin ue to hold (see e.g. [18]), and our results remain unchanged. Most in terestingly , our p ositive results can accommodate the follo wing semi-random adversary (cf., [19]). Given a graph G and ground truth, as b efore nature indep endently designates each edge as go o d or bad with probabilit y 1 − p and p , and similarly for no des (with probabilit y 1 − q and q ). Go o d nodes and edges are lab elled according to the ground truth. An adv ersary , who kno ws what algorithm will b e used on the input, selects arbitrary lab els for the bad no des and edges. Our basic mo dels corresp onds to the sp ecial case in whic h the adv ersary lab els ev ery bad no de and edge to b e inconsistent with the input. Such semi-random adv ersaries can often foil algorithms that work w ell in a purely random mo del, esp ecially algorithms that are ov erly relian t on the details of the input distribution or that are “lo cal” in nature. In all of our pro ofs of our p ositive results, w e eﬀectiv ely assume that every relev an t set S that has a b oundary δ ( S ) with at least half bad edges contributes | S | to our algorithm’s error. Thus, an adv ersary maximizes our error upp er b ound by maximizing the num b er of bad no des and edges. In other w ords, from the standp oint of our error b ounds, a semi-random adversary is no w orse than a random one. 6 Empirical Study Our theoretical analysis suggests that statistical recov ery on 2D grid graphs can attain an error that scales with p 2 . F urthermore, w e sho w that this error is achiev ed using the tw o-step algorithm in Section 4. Here we describ e a syn thetic exp eriment that compares the t wo-step algorithm to other reco very pro cedures. W e consider a 20 × 20 grid, with high no de noise of 0 . 4 and v ariable edge noise levels. In addition to the tw o-step algorithm we consider the following: 2 • Marginal inference - predicting according to p ( Y i | X ). As men tioned in Section 3 this is the optimal recov ery pro cedure. Although it is generally hard to calculate, for the graph size w e use it can b e done in 20 minutes p er mo del. • Local LP relaxation - Instead of calculating p ( Y i | X ) one can resort to approximation. One p ossibilit y is to calculate the mo de of p ( Y | X ) (also known as the MAP problem). How ever, since 2 W e also exp erimented with a greedy hill climbing pro cedure, but results were p o or and are not shown. 14 this is also hard, w e consider LP relaxations of the MAP problem. The simplest suc h relaxation assumes lo cally consisten t pseudo-marginals. • Cycle LP relaxation - A tigh ter version of the LPs ab o ve uses cycle constraints instead of pairwise. In fact, for planar graphs with no external ﬁeld (as in the ﬁrst step of our tw o step algorithm) this relaxation is tight. It is th us of interest to study it in our con text. F or b oth the cycle and lo cal relaxations w e use the co de for [34]. Fig. 6 sho ws the exp ected error for the diﬀeren t algorithms, as a function of edge noise. It can b e seen that the tw o step pro cedure almost matc hes the accuracy of the optimal marginal algorithm for lo w noise lev els. As the noise increases the gap gro ws. Another in teresting observ ation is that the lo cal relaxation p erforms signiﬁcantly worse than the other baselines, but that the cycle relaxation is close to optimal. The latter observ ation is likely to b e due to the fact that with high no de noise and low edge noise, the MAP problem is “close” to the no no de-noise case, where the cycle relaxation is exact. Ho w ever, an analysis of the Hamming error in this case remains an op en problem. 0 0.05 0.1 0 0.05 0.1 0.15 0.2 Hamming Err Edge Noise Marginals Cycle LP Two − Step Local LP Figure 3: Average Hamming error for diﬀerent recov ery algorithms. Data is generated from a 20 × 20 grid with no de noise q = 0 . 4 and v ariable edge noise p . The true Y is the all zeros word. Results are av eraged o ver 100 rep etitions. 7 Discussion Structured prediction underlies man y empirically successful systems in mac hine vision and NLP . In most of these (e.g., [29, 26]) the inference problems are intractable and approximate inference is used instead. How ev er, there is little theoretical understanding of when structured prediction is exp ected to p erform w ell, how its p erformance is related to the structure of the score function, whic h approximation algorithms are exp ected to w ork in which setting, etc. In this work w e presen t a ﬁrst step in this direction, by analyzing the error of structured prediction for 2D grid mo dels. One key ﬁnding is that a t wo-step algorithm attains the information theoretically optimal error in a certain regime of parameters. What makes this setting particularly in teresting from a theoretical p ersp ective is that exact inference (marginals and MAP) is intractable due to the intractabilit y of planar mo dels with external ﬁelds. Th us, it is rather surprising that a tractable algorithm achiev es optimal p erformance. 15 Our work op ens the do or to a num ber of new directions, with b oth theoretical and practical implications. In the context of grid mo dels, we ha ve not studied the eﬀect of the no de noise q but rather assumed it ma y arbitrary (less than 0 . 5). Our tw o step pro cedure uses b oth no de and edge evidence, but it is clear that for small q , impro ved pro cedures are av ailable. In particular, the exp eriments in Section 6 show that deco ding with cycle LP relaxations results in empirical p erformance that is close to optimal. More generally , we would like to understand the statistical and computational prop erties of structured prediction for complex tasks suc h as dependency parsing [29] and non-binary v ariables (as in semantic segmentation). In these cases, it would b e interesting to understand how the structure of the score function aﬀects b oth the optimal exp ected accuracy and the algorithms that achiev e it. References [1] Emman uel Abb e, Afonso S. Bandeira, Annina Bracher, and Amit Singer. Deco ding binary no de lab els from censored edge measuremen ts: Phase transition and eﬃcien t reco v ery . CoRR , abs/1404.4749, 2014. [2] Y. Altun, I. Tso chan taridis, and T. Hofmann. Hidden Mark ov supp ort vector machines. In ICML , 2003. [3] Anima Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kak ade. A tensor sp ectral approach to learning mixed membership communit y mo dels. In COL T , 2013. [4] Sanjeev Arora, Constan tinos Dask alakis, and Da vid Steurer. Message passing algorithms and impro ved lp deco ding. In STOC , pages 3–12, 2009. [5] Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation stabilit y . J. ACM , 60(2), 2013. Article 8. [6] Maria-Florina Balcan and Mark Bra verman. Finding lo w error clusterings. In The 22nd Confer enc e on L e arning The ory , 2009. [7] Nikhil Bansal, Avrim Blum, and Shuc hi Chawla. Correlation clustering. Machine L e arning , 56(1-3):89–113, 2004. [8] F rancisco Barahona. On the computational complexity of Ising spin glass models. J. Phys. A , 15(10):3241, 1982. [9] Y onatan Bilu and Nathan Linial. Are stable instances easy? Combinatorics, Pr ob ability & Computing , 21(5):643–660, 2012. [10] Mark Brav erman and Elchanan Mossel. Noisy sorting without resampling. In SOD A , pages 268–276, 2008. [11] Y uxin Chen and Andrea J. Goldsmith. Information reco very from pairwise measurements. CoRR , abs/1404.7105, 2014. [12] Nathan Clisby and Iwan Jensen. A new transfer-matrix algorithm for exact enumerations: self-a voiding p olygons on the square lattice. J. Phys. A , 45(11):115202, 15, 2012. 16 [13] M. Collins. Discriminativ e training metho ds for hidden Marko v mo dels: Theory and exp eri- men ts with p erceptron algorithms. In EMNLP , 2002. [14] Anne Condon and Richard M Karp. Algorithms for graph partitioning on the planted partition mo del. R andom Structur es and Algorithms , 18(2):116–140, 2001. [15] Constan tinos Dask alakis, Elchanan Mossel, and S´ ebastien Ro ch. Optimal ph ylogenetic recon- struction. In STOC , pages 159–168, 2006. [16] Hal Daum ´ e, Iii, John Langford, and Daniel Marcu. Search-based structured prediction. Mach. L e arn. , 75(3):297–325, June 2009. [17] R. Diestel. Gr aph The ory . Springer-V erlag, 1997. [18] Devdatt P Dubhash and Alessandro Panconesi. Conc entr ation of me asur e for the analysis of r andomize d algorithms . Cambridge Universit y Press, 2009. [19] Uriel F eige and Jo e Kilian. Heuristics for semirandom graph problems. Journal of Computer and System Scienc es , 63(4):639–671, 2001. [20] Mic hael E Fisher. On the dimer solution of planar Ising mo dels. J. of Mathematic al Phys. , 7:1776, 1966. [21] Ioannis Giotis and V enk atesan Gurusw ami. Correlation clustering with a ﬁxed n umber of clusters. The ory of Computing , 2(1):249–266, 2006. [22] Geoﬀrey Grimmett. Per c olation . Springer, 1999. [23] J. M. Hammersley and D. J. A. W elsh. F urther results on the rate of con vergence to the connectiv e constant of the h yp ercubical lattice. Quart. J. Math. Oxfor d Ser. (2) , 13:108–110, 1962. [24] Iw an Jensen. Size and area of square lattice p olygons. J. Phys. A , 33(18):3533–3543, 2000. [25] Thorsten Joachims and John E. Hopcroft. Error b ounds for correlation clustering. In Pr o- c e e dings of the Twenty-Se c ond International on Machine L e arning (ICML) , pages 385–392, 2005. [26] J.H. Kappes, B. Andres, F.A. Hamprech t, C. Schnorr, S. Now ozin, D. Batra, Sungwoong Kim, B.X. Kausler, J. Lellmann, N. Komodakis, and C. Rother. A comparative study of mo dern inference techniques for discrete energy minimization problems. In CVPR , pages 1328–1335, June 2013. [27] Da vid R Karger. Global min-cuts in rnc, and other ramiﬁcations of a simple min-out algorithm. In Pr o c e e dings of the fourth annual ACM-SIAM Symp osium on Discr ete algorithms , pages 21– 30. So ciet y for Industrial and Applied Mathematics, 1993. [28] Vladimir Kolmogorov and Carsten Rother. Minimizing nonsubmo dular functions with graph cuts-a review. IEEE T r ans. Pattern Anal. Mach. Intel l. , 29(7):1274–1279, July 2007. 17 [29] T erry Ko o, Alexander M Rush, Michael Collins, T ommi Jaakkola, and Da vid Son tag. Dual decomp osition for parsing with non-pro jectiv e head automata. In EMNLP , pages 1288–1298, 2010. [30] J. Laﬀerty , A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic mo dels for segmen ting and lab eling sequence data. In ICML , pages 282–289, 2001. [31] Neal Madras and Gordon Slade. The self-avoiding walk . Probabilit y and its Applications. Birkh¨ auser Boston Inc., Boston, MA, 1993. [32] Claire Mathieu and W arren Sch udy . Correlation clustering with noisy input. In SOD A , pages 712–728, 2010. [33] F rank McSherry . Sp ectral partitioning of random graphs. In FOCS , pages 529–537, 2001. [34] Da vid Son tag, Do Ko ok Cho e, and Yitao Li. Eﬃciently searc hing for frustrated cycles in MAP inference. In UAI , pages 795–804, 2012. [35] Da vid Sontag, T alya Meltzer, Amir Glob erson, Y air W eiss, and T ommi Jaakk ola. Tightening LP relaxations for MAP using message-passing. In UAI , pages 503–510, 2008. [36] Min Sun, Murali T elaprolu, Honglak Lee, and Silvio Sa v arese. An eﬃcien t branch-and-bound algorithm for optimal human p ose estimation. In CVPR , 2012. [37] B. T ask ar, C. Guestrin, and D. Koller. Max-margin Marko v net works. In NIPS , 2003. [38] Sara Vicen te, Vladimir Kolmogorov, and Carsten Rother. Graph cut based image segmentation with connectivit y priors. In CVPR , pages 1–8, 2008. A Marginal Inference is the Minimax Optimal Algorithm In this section, w e prov e marginal inference using the uniform prior, which we denote by A 1 , is the minimax optimal algorithm (i.e., minimizes e ( A ) = max y e y ( A )). The marginal inference algorithm predicts eac h no de separately by b Y i ← arg max Y i p ( Y i | X ) using the uniform prior ov er X . Assume for contradiction that there is an algorithm A 0 that yields strictly smaller error than marginal inference. Hence, b y deﬁnition the of a minimax optimal algorithm, there exists ground truth assignmen ts y 0 and y 1 suc h that max y e y ( A 0 ) = e y 0 ( A 0 ) < e y 1 ( A 1 ). By symmetry , the marginal inference algorithm has equal error for ev ery ground truth. Hence, e y ( A 0 ) < e y ( A 1 ) for ev ery ground truth assignments y . On the other hand, marginal inference minimizes the exp ected Hamming error when the prior distribution o ver ground truth assignmen ts is uniform. T o see why , let b Y i ( X ) b e an estimator of the i -th no de. The exp ected Hamming error of b Y i assuming uniform prior on Y i is Pr h b Y i ( X ) 6 = Y i i = X X 1 b Y i ( X )=1 · 1 2 Pr  µ − i ( X )  + X X 1 b Y i ( X )= − 1 · 1 2 Pr  µ + i ( X )  (13) where µ + i and µ − i are the distributions of X conditioned on Y i = +1 and Y i = − 1, resp ectively . Since E y [ e y ( A )] is the sum of the exp ected error at individual no des, marginal inference using the 18 uniform prior minimizes it. The optimalit y of marginal inference for the uniform prior contradicts the fact that A 0 p erforms b etter than marginal inference on all ground truth assignments. Notice that this proof also w orks for the subset of ground truths considered in the proof of lo wer b ound for the grids (Section 4.3). B Illustration of Filled In Sets Recall that for every subset S we deﬁned a corresponding ﬁlled in set F ( S ). Figures B – B illustrate the transformation from a subset S to the corresp onding ﬁlled-in set F ( S ). Figure 4: An example of a t yp e 1 set (left) and a t yp e 2 set (right) and the corresp onding ﬁlled-in sets. Figure 5: An example of a t yp e 3 set (left) and a t yp e 4 set (right) and the corresp onding ﬁlled-in sets. Figure 6: An example of a t yp e 5 set and the corresp onding ﬁlled-in set (left) and an example of t yp e 6 set. 19

Tight Error Bounds for Structured Prediction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment