Recursive Decomposition for Nonconvex Optimization

Recursiv e Decomposition f or Noncon vex Optimization Abram L. Friesen and Pedr o Domingos Department of Computer Science and Engineering Uni versity of W ashington Seattle, W A 98195, USA { afriesen,pedrod } @cs.washington.edu Abstract Continuous optimization is an important problem in many areas of AI, including vision, robotics, probabilistic inference, and machine learning. Un- fortunately , most real-world optimization problems are noncon vex, causing standard conv ex techniques to ﬁnd only local optima, ev en with extensions like random restarts and simulated annealing. W e observe that, in many cases, the local modes of the objective function have combinatorial structure, and thus ideas from combinatorial optimization can be brought to bear . Based on this, we propose a problem-decomposition approach to noncon vex optimization. Similarly to DPLL-style SA T solvers and recursi ve conditioning in probabilistic infer- ence, our algorithm, RDIS, recursiv ely sets vari- ables so as to simplify and decompose the objec- tiv e function into approximately independent sub- functions, until the remaining functions are simple enough to be optimized by standard techniques like gradient descent. The v ariables to set are chosen by graph partitioning, ensuring decomposition when- ev er possible. W e sho w analytically that RDIS can solve a broad class of nonconv ex optimiza- tion problems exponentially faster than gradient de- scent with random restarts. Experimentally , RDIS outperforms standard techniques on problems like structure from motion and protein folding. 1 Introduction AI systems that interact with the real world often ha ve to solve continuous optimization problems. For con ve x prob- lems, which hav e no local optima, many sophisticated algo- rithms exist. Howe ver , most continuous optimization prob- lems in AI and related ﬁelds are nonconv ex, and often have an exponential number of local optima. For these problems, the standard solution is to apply con ve x optimizers with multi- start and other randomization techniques [ Schoen, 1991 ] , but in problems with an exponential number of optima these typ- ically fail to ﬁnd the global optimum in a reasonable amount of time. Branch and bound methods can also be used, but scale poorly due to the curse of dimensionality [ Neumaier et al. , 2005 ] . In this paper we propose that such problems can instead be approached using problem decomposition techniques, which hav e a long and successful history in AI for solving dis- crete problems (e.g, [ Davis et al. , 1962; Darwiche, 2001; Bayardo Jr . and Pehoushek, 2000; Sang et al. , 2004; Sang et al. , 2005; Bacchus et al. , 2009 ] ). By repeatedly decompos- ing a problem into independently solvable subproblems, these algorithms can often solve in polynomial time problems that would otherwise take exponential time. The main difﬁculty in noncon vex optimization is the combinatorial structure of the modes, which conv ex optimization and randomization are ill- equipped to deal with, but problem decomposition techniques are well suited to. W e thus propose a novel nonconv ex opti- mization algorithm, which uses recursiv e decomposition to handle the hard combinatorial core of the problem, leaving a set of simpler subproblems that can be solved using standard continuous optimizers. The main challenges in applying problem decomposition to continuous problems are extending them to handle contin- uous v alues and deﬁning an appropriate notion of local struc- ture. W e do the former by embedding continuous optimizers within the problem decomposition search, in a manner remi- niscent of satisﬁability modulo theory solv ers [ De Moura and Bjørner , 2011 ] , but for continuous optimization, not decision problems. W e do the latter by observing that many continu- ous objectiv e functions are approximately locally decompos- able, in the sense that setting a subset of the variables causes the rest to break up into subsets that can be optimized nearly independently . This is particularly true when the objective function is a sum of terms ov er subsets of the variables, as is typically the case. A number of continuous optimization techniques employ a static, global decomposition (e.g., block coordinate descent [ Nocedal and Wright, 2006 ] and partially separable methods [ Griew ank and T oint, 1981 ] ), but many problems only decompose locally and dynamically , which our algorithm accomplishes. For example, consider protein folding [ Anﬁnsen, 1973; Baker , 2000 ] , the process by which a protein, consisting of a chain of amino acids, assumes its functional shape. The computational problem is to predict this ﬁnal conformation by minimizing a highly noncon ve x energy function consisting mainly of a sum of pairwise distance-based terms represent- ing chemical bonds, electrostatic forces, etc. Physically , in any conformation, an atom can only be near a small number of other atoms and must be far from the rest; thus, many terms are negligible in any speciﬁc conformation, b ut each term is non-negligible in some conformation. This suggests that sec- tions of the protein could be optimized independently if the terms connecting them were negligible but that, at a global lev el, this is never true. Howe ver , if the positions of a fe w key atoms are set appropriately then certain amino acids will nev er interact, making it possible to decompose the problem into multiple independent subproblems and solve each sepa- rately . A local recursive decomposition algorithm for contin- uous problems can do exactly this. W e ﬁrst deﬁne local structure and then present our al- gorithm, RDIS, which (asymptotically) ﬁnds the global optimum of a nonconv ex function by (R)ecursi vely (D)ecomposing the function into locally (I)ndependent (S)ubspaces. In our analysis, we sho w that RDIS achieves an exponential speedup v ersus traditional techniques for non- con vex optimization such as gradient descent with restarts and grid search (although the complexity remains exponen- tial, in general). This result is supported empirically , as RDIS signiﬁcantly outperforms standard noncon ve x optimization algorithms on three challenging domains: structure from mo- tion, highly multimodal test functions, and protein folding. 2 Recursive Decomposition f or Continuous Optimization This section presents our nonconv ex optimization algorithm, RDIS. W e ﬁrst present our notation and then deﬁne local structure and a method for realizing it. W e then describe RDIS and provide pseudocode. In unconstrained optimization, the goal is to minimize an objectiv e function f ( x ) over the v ariables x ∈ R n . W e fo- cus on functions f : R n → R that are continuously differen- tiable and hav e a nonempty optimal set x ∗ with optimal value f ∗ = f ( x ∗ ) > −∞ . Let I = { 1 , . . . , n } be the indices of x , let C ⊆ I , let x C ∈ R | C | be the restriction of x to the indices in C , and let ρ C ∈ domain ( x C ) be a partial assign- ment where only the variables corresponding to the indices in C are assigned values. W e deﬁne x | ρ C ∈ R n −| C | to be the subspace where those variables with indices in C are set to the values in ρ C (i.e., for some ρ C and for all i ∈ C we hav e x | ρ C ,i = ρ C,i ). Giv en U = I \ C and partial assignment ρ C , then, with a slight abuse of notation, we deﬁne the restric- tion of the function to the domain x | ρ C as f | ρ C ( x U ) . In the following, we directly partition x instead of discussing the partition of I that induces it. 2.1 Local structure A function is fully decomposable (separable) if it can be ex- pressed as f ( x ) = P n i =1 g i ( x i ) . Such functions are easy to optimize, since they decompose with respect to minimization; i.e., min x f ( x ) = P n i =1 min x i g i ( x i ) . Con versely , decom- posable noncon ve x functions that are optimized without ﬁrst decomposing them require exponentially more e xploration to ﬁnd the global optimum than the decomposed version. For example, let M f be the set of modes of f and let M i be the modes of each g i . Kno wing that f is decomposable allows us to optimize each g i independently , giving | M f | = P n i =1 | M i | modes to explore. Howe ver , if we instead optimized f di- rectly , we would hav e to explore Q n i =1 | M i | modes, which is exponential in n . Unfortunately , fully decomposable func- tions like f are rare, as v ariables generally appear in multiple terms with many dif ferent v ariables and thus the minimiza- tion does not trivially distribute. Ho wever , decomposition can still be achieved if the function exhibits global or local structure, which we deﬁne here. Deﬁnition 1. (a) f ( x ) is globally decomposable if ther e exists a partition { x C , x U 1 , x U 2 } of x such that, for every partial assignment ρ C , f | ρ C ( x U 1 , x U 2 ) = f 1 | ρ C ( x U 1 ) + f 2 | ρ C ( x U 2 ) . (b) f ( x ) is locally decomposable in the sub- space x | ρ C if ther e e xists a partition { x C , x U 1 , x U 2 } of x and a partial assignment ρ C such that f | ρ C ( x U 1 , x U 2 ) = f 1 | ρ C ( x U 1 ) + f 2 | ρ C ( x U 2 ) . (c) f ( x ) is approximately locally decomposable in a neighbourhood of the subspace x | ρ C if there exists a partition { x C , x U 1 , x U 2 } of x , partial assignments ρ C , σ C , and δ,  ≥ 0 suc h that if || σ C − ρ C ||≤ δ then | f | σ C ( x U 1 , x U 2 ) − [ f 1 | σ C ( x U 1 ) + f 2 | σ C ( x U 2 )] | ≤  . Global structure (Deﬁnition 1a), while the easiest to ex- ploit, is also the least prev alent. Local structure, which may initially appear limited, subsumes global structure while also allowing dif ferent decompositions throughout the space, making it strictly more general. Similarly , approximate local structure subsumes local structure. In protein folding, for ex- ample, two amino acids may be pushed either close together or far apart for different conﬁgurations of other amino acids. In the latter case, they can be optimized independently be- cause the terms connecting them are ne gligible. Thus, for dif- ferent partial conﬁgurations of the protein, dif ferent approx- imate decompositions are possible. The independent sub- spaces that result from local decomposition can themselves exhibit local structure, allo wing them to be decomposed in turn. If an algorithm exploits local structure effecti vely , it nev er has to perform the full combinatorial optimization. Lo- cal structure does not need to exist ev erywhere in the space, just in the regions being explored. For con venience, we only refer to local structure below , unless the distinction between global or (approximate) local decomposition is relev ant. One method for achie ving local decomposition is via (local) simpliﬁcation. W e say that f i ( x C , x U ) is (ap- pr oximately locally) simpliﬁable in the subspace x | ρ C de- ﬁned by partial assignment ρ C if, for a given  ≥ 0 , f i | ρ C ( x U ) − f i | ρ C ( x U ) ≤ 2  , where h ( x ) and h ( x ) refer to the upper and lo wer bounds of h ( x ) , respecti vely . Simi- larly , f ( x ) is (appr oximately locally) simpliﬁed in the sub- space x | ρ C deﬁned by partial assignment ρ C if, for a given  ≥ 0 , all simpliﬁable terms f i | ρ C ( x U ) are replaced by the constant k i = 1 2 [ f i | ρ C ( x U ) + f i | ρ C ( x U )] . For a function that is a sum of terms, local decomposition occurs when some of these terms simplify in such a way that the minimization can distrib ute ov er independent groups of terms and variables (like component decomposition in Relsat [ Bayardo Jr . and Pehoushek, 2000 ] or in the protein folding example abov e). Giv en that there are m terms in the function, the maximum possible error in the simpliﬁed function versus the true func- tion is m ·  . Ho wever , this would require all terms to be simpliﬁed and their true values to be at one of their bounds, which is extremely unlikely; rather, errors in dif ferent terms often cancel, and the simpliﬁed function tends to remain ac- curate. Note that  induces a tradeoff between acceptable er- ror in the function ev aluation and the computational cost of optimization, since a simpliﬁed function has fewer terms and thus ev aluating it and computing its gradient are both cheaper . While the above deﬁnition is for sums of terms, the same mechanism applies to functions that are products of (non- negati ve) factors, although error gro ws multiplicati vely here. 2.2 The RDIS Algorithm RDIS is an optimization method that e xplicitly ﬁnds and ex- ploits local decomposition. Pseudocode is shown in Algo- rithm 1, with subroutines explained in the text. At each lev el of recursion, RDIS chooses a subset of the variables x C ⊆ x (inducing a partition { x C , x U } of x ) and assigns them v al- ues ρ C such that the simpliﬁed objectiv e function f | ρ C ( x U ) decomposes into multiple (approximately) independent sub- functions f i | ρ C ( x U i ) , where { x U 1 , . . . , x U k } is a partition of x U and 1 ≤ k ≤ n . RDIS then recurses on each sub- function, globally optimizing it conditioned on the assign- ment x C = ρ C . When the recursion completes, RDIS uses the returned optimal v alues (conditioned on ρ C ) of x U to choose new values for x C and then simpliﬁes, decomposes, and optimizes the function again. This repeats until a heuris- tic stopping criterion is satisﬁed. Algorithm 1 Recursive Decomposition into locally Indepen- dent Subspaces (RDIS). Input: Function f , variables x , initial state x 0 , subspace op- timizer S , and approximation error  . Output: (Approximate) global minimum f ∗ at state x ∗ . 1: function R D I S ( f , x , x 0 , S,  ) 2: x C ← C H O O S E V A R S ( x ) // variable selection 3: x U ← x \ x C , f ∗ ← ∞ , x ∗ ← x 0 4: repeat 5: partition x ∗ into { σ ∗ C , σ ∗ U } 6: ρ C ← S ( f | σ ∗ U ( x C ) , σ ∗ C ) // value selection 7: ˆ f | ρ C ( x U ) ← S I M P L I F Y ( f | ρ C ( x U ) ,  ) 8: { ˆ f i ( x U i ) } ← D E C O M P O S E ( ˆ f | ρ C ( x U ) ) 9: for i = 1, . . . , k do // r ecurse on the components 10: h f ∗ i , ρ U i i ← R D I S ( ˆ f i , x U i , σ ∗ U i , S,  ) 11: f ∗ ρ ← P k i =1 f ∗ i , ρ U ← ∪ k i =1 ρ U i 12: if f ∗ ρ < f ∗ then // r ecor d new minimum 13: f ∗ ← f ∗ ρ , x ∗ ← ρ C ∪ ρ U 14: until stopping criterion is satisﬁed 15: retur n h f ∗ , x ∗ i RDIS selects v ariables (line 2) heuristically , with the goal of choosing a set of v ariables that enables the largest amount of decomposition, as this pro vides the largest computational gains. Speciﬁcally , RDIS uses a hyper graph partitioning al- gorithm to determine a small cutset that will decompose the Choose&and&& Assign&x C& Simplify&and& Decompose& Recurse & …& …& …& …& Figure 1: V isualization of RDIS decomposing the objectiv e func- tion. V ertices (circles) represent variables and edges connect each pair of v ariables in a term. Left: RDIS selects x C (shaded ov al). Middle: The function during simpliﬁcation. Thick edges indicate simpliﬁable terms. Assigned variables are constant and have been remov ed. Right: The function after decomposition. graph; this cutset becomes the selected v ariables, x C . V al- ues for x C are determined (line 6) by calling a noncon ve x subspace optimizer with the remaining variables ( x U ) ﬁxed to their current values. The subspace optimizer S is spec- iﬁed by the user and is customizable to the problem being solved. In our experiments we used multi-start versions of conjugate gradient descent and Lev enberg-Marquardt [ No- cedal and Wright, 2006 ] . Restarts occur within line 6: if S con verges without making progress then it restarts to a new point in x C and runs until it reaches a local minimum. T o simplify the objecti ve function (line 7), RDIS deter - mines which terms are simpliﬁable (i.e., hav e suf ﬁciently small bounds) and then simpliﬁes (approximates) these by re- placing them with a constant. These terms are not passed to the recursiv e calls. After variables hav e been assigned and the function simpliﬁed, RDIS locally decomposes (line 8) the simpliﬁed function into independent sub-functions (compo- nents) that have no overlapping terms or v ariables and thus can be optimized independently , which is done by recursi vely calling RDIS on each. See Figure 1 for a visualization of this process. The recursion halts when C H O O S E V A R S se- lects all of x (i.e., x C = x and x U = ∅ ), which occurs when x is small enough that the subspace optimizer can op- timize f ( x ) directly . At this point, RDIS repeatedly calls the subspace optimizer until the stopping criterion is met, which (ideally) ﬁnds the global optimum of f | σ ∗ U ( x C ) = f ( x ) since S is a noncon ve x optimizer . The stopping criterion is user- speciﬁed, and depends on the subspace optimizer . If a multi- start descent method is used, termination occurs after a speci- ﬁed number of restarts, corresponding to a certain probability that the global optimum has been found. If the subspace opti- mizer is grid search, then the loop terminates after all values of x C hav e been assigned. More subroutine details are pro- vided in Section 4. 3 Analysis W e now present analytical results demonstrating the beneﬁts of RDIS versus standard algorithms for noncon vex optimiza- tion. Formally , we show that RDIS explores the state space in exponentially less time than the same subspace optimizer for a class of functions that are locally decomposable, and that it will (asymptotically) con ver ge to the global optimum. Due to space limitations, proofs are presented in Appendix A. Let the number of variables be n and the number of values as- signed by RDIS to x C be ξ ( d ) , where d is the size of x C . The form of ξ ( d ) depends on the subspace optimizer , but can be roughly interpreted as the number of modes of the sub- function f | σ ∗ U ( x C ) to explore times a constant factor . Proposition 1. If, at each level, RDIS chooses x C ⊆ x of size | x C | = d such that, for each selected value ρ C , the simpliﬁed function ˆ f | ρ C ( x U ) locally decomposes into k > 1 indepen- dent sub-functions { ˆ f i ( x U i ) } with equal-sized domains x U i , then the time complexity of RDIS is O ( n d ξ ( d ) log k ( n/d ) ) . Note that since RDIS uses hypergraph partitioning to choose variables, it will always decompose the remaining v ariables x U . This is also supported by our experimental results; if there were no decomposition, RDIS would not perform any better than the baselines. From Proposition 1, we can compute the time complex- ity of RDIS for different subspace optimizers. Let the sub- space optimizer be grid search (GS) over a bounded domain of width w with spacing δ in each dimension. Then the com- plexity of grid search is simply O (( w /δ ) n ) = O ( s n ) . Proposition 2. If the subspace optimizer is grid sear ch, then ξ ( d ) = ( w /δ ) d = s d , and the comple xity of RDIS GS is O ( n d s d log k ( n/d ) ) . Rewriting the complexity of grid search as O ( s n ) = O ( s d ( n/d ) ) , we see that it is exponentially w orse than the complexity of RDIS GS when decomposition occurs. Now consider a descent method with random restarts (DR) as the subspace optimizer . Let the volume of the basin of at- traction of the global minimum (the global basin) be l n and the volume of the space be L n . Then the probability of ran- domly restarting in the global basin is ( l/L ) n = p n . Since the restart behavior of DR is a Bernoulli process, the e xpected number of restarts to reach the global basin is r = p − n , from the shifted geometric distribution. If the number of iterations needed to reach the stationary point of the current basin is τ then the expected complexity of DR is O ( τ p − n ) . If DR is used within RDIS, then we obtain the following result. Proposition 3. If the subspace optimizer is DR, then the ex- pected value of ξ ( d ) is τ p − d , and the expected complexity of RDIS DR is O ( n d ( τ p − d ) log k ( n/d ) ) . Rewriting the expected complexity of DR as O ( τ ( p − d ) n/d ) shows that RDIS DR is exponentially more ef ﬁcient than DR. Regarding con ver gence, RDIS with  = 0 conv erges to the global minimum gi ven certain conditions on the subspace op- timizer . F or grid search, RDIS GS returns the global minimum if the grid is ﬁnite and has sufﬁciently ﬁne spacing. For gra- dient descent with restarts, RDIS DR will con verge to station- ary points of f ( x ) as long as steps by the subspace optimizer satisfy two technical conditions. The ﬁrst is an Armijo rule guaranteeing sufﬁcient decrease in f and the second guar - antees a sufﬁcient decrease in the norm of the gradient (see (C1) and (C2) in Appendix A.2). These conditions are nec- essary to sho w that RDIS DR behav es like an inexact Gauss- Seidel method [ Bonettini, 2011 ] , and thus each limit point of the generated sequence is a stationary point of f ( x ) . Giv en this, we can state the probability with which RDIS DR will con verge to the global minimum. Proposition 4. If the non-restart steps of RDIS satisfy (C1) and (C2),  = 0 , the number of variables is n , the vol- ume of the global basin is v = l n , and the volume of the entire space is V = L n , then RDIS DR r eturns the global minimum after t r estarts, with pr obability 1 − (1 − ( v /V )) t . For  > 0 , we do not yet have a proof of con ver gence, ev en in the conv ex case, since preliminary analysis indicates that there are rare corner cases in which the alternating aspect of RDIS, combined with the simpliﬁcation error , can poten- tially result in a non-con ver ging sequence of values; howe ver , we hav e not experienced this in practice. Furthermore, our experiments clearly show  > 0 to be extremely beneﬁcial, especially for large, highly-connected problems. Beyond its discrete counterparts, RDIS is related to many well-known continuous optimization algorithms. If all vari- ables are chosen at the top le vel of recursion, then RDIS sim- ply reduces to executing the subspace optimizer . If one lev el of recursion occurs, then RDIS behav es similarly to alternat- ing minimization algorithms (which also have global con ver- gence results [ Grippo and Sciandrone, 1999 ] ). For multiple lev els of recursion, RDIS has similarities to block coordinate (gradient) descent algorithms (see Tseng and Y un [2009] and references therein). Ho we ver , what sets RDIS apart is that de- composition in RDIS is determined locally , dynamically , and recursiv ely . Our analysis and experiments show that exploit- ing this can lead to substantial performance improv ements. 4 RDIS Subroutines In this section, we present the speciﬁc choices we’ ve made for the subroutines in RDIS, but note that others are possible and we intend to in vestigate them in future work. V ariable Selection. Many possible methods exist for choos- ing variables. For example, heuristics from satisﬁability may be applicable (e.g., VSIDS [ Moske wicz et al. , 2001 ] ). How- ev er, RDIS uses hypergraph partitioning in order to ensure decomposition whenev er possible. Hypergraph partitioning splits a graph into k components of approximately equal size while minimizing the number of hyperedges cut. T o maxi- mize decomposition, RDIS should choose the smallest block of v ariables that, when assigned, decomposes the remaining variables. This corresponds exactly to the set of edges cut by hypergraph partitioning on a hypergraph that has a ver - tex for each term and a hyperedge for each variable that con- nects the terms that v ariable is in (note that this is the in- verse of Figure 1). RDIS maintains such a hypergraph and uses the PaT oH hypergraph partitioning library [ C ¸ ataly ¨ urek and A ykanat, 2011 ] to quickly ﬁnd good, approximate parti- tions. A similar idea was used in Darwiche and Hopkins 2001 to construct d-trees for recursiv e conditioning; howe ver , they only apply hypergraph partitioning once at the beginning, whereas RDIS performs it at each lev el of the recursion. While v ariable selection could be placed inside the loop, it would repeatedly choose the same variables because hyper - graph partitioning is based on the graph structure. Howe ver , RDIS still exploits local decomposition because the variables and terms at each le vel of recursion v ary based on local struc- ture. In addition, edge and vertex weights could be set based on current bounds or other local information. V alue Selection. RDIS can use any noncon ve x optimization subroutine to choose values, allowing the user to pick an opti- mizer appropriate to their domain. In our experiments, we use multi-start versions of both conjugate gradient descent and Lev enberg-Marquardt, but other possibilities include Monte Carlo search, quasi-Ne wton methods, and simulated anneal- ing. W e experimented with both grid search and branch and bound, but found them practical only for easy problems. In our experiments, we ha ve found it helpful to stop the subspace optimizer early , because values are likely to change again in the next iteration, making quick, approximate improv ement more effecti ve than slo w , exact impro vement. Simpliﬁcation and Decomposition. Simpliﬁcation is per- formed by checking whether each term (or factor) is simpliﬁ- able and, if it is, setting it to a constant and removing it from the function. RDIS kno ws the analytical form of the function and uses interval arithmetic [ Hansen and W alster , 2003 ] as a general method for computing and maintaining bounds on terms to determine simpliﬁability . RDIS maintains the con- nected components of a dynamic graph [ Holm et al. , 2001 ] ov er the v ariables and terms (equi v alent in structure to a factor or co-occurrence graph). Components in RDIS correspond exactly to the connected components in this graph. Assigned variables and simpliﬁed terms are removed from this graph, potentially inducing local decomposition. Caching and Branch & Bound. RDIS’ similarity to model counting algorithms suggests the use of component caching and branch and bound (BnB). W e experimented with these and found them effecti ve when used with grid search; how- ev er, the y were not beneﬁcial when used with descent- based subspace optimizers, which dominate grid-search- based RDIS on non-trivial problems. For caching, this is be- cause components are almost ne ver seen again, due to not re- encountering variable values, ev en approximately . For BnB, interval arithmetic bounds tended to be overly loose and no bounding occurred. Our experience suggests that this is be- cause the descent-based optimizer effecti vely focuses explo- ration on the minima of the space, which are typically close in v alue to the current optimum. Howe ver , we belie ve that fu- ture work on caching and better bounds would be beneﬁcial. 5 Experimental Results W e e valuated RDIS on three difﬁcult noncon vex optimiza- tion problems with hundreds to thousands of v ariables: struc- ture from motion, a high-dimensional sinusoid, and protein folding. Structure from motion is an important problem in vision, while protein folding is a core problem in com- putational biology . W e ran RDIS with a ﬁxed number of restarts at each le vel, thus not guaranteeing that we found the global minimum. For structure from motion, we compared RDIS to that domain’ s standard technique of Levenber g- Marquardt (LM) [ Nocedal and Wright, 2006 ] using the lev- mar library [ Lourakis, 2004 ] , as well as to a block-coordinate descent version (BCD-LM). In protein folding, gradient- based methods are commonly used to determine the lowest Number of cameras 5 10 15 20 25 30 35 40 45 Minimum value found 10 0 10 10 10 20 LM BCD-LM RDIS-RND RDIS Figure 2: Minimum value found in one hour for increasing sizes of bundle adjustment problem (y-axis is log scale). energy conﬁguration of a protein, so we compared RDIS to conjugate gradient descent (CGD) and a block-coordinate de- scent version (BCD-CGD). CGD and BCD-CGD were also used for the high-dimensional sinusoid. Blocks were formed by grouping contextually-rele v ant v ariables together (e.g., in protein folding, we never split up an amino acid). W e also compared to ablated v ersions of RDIS. RDIS-RND uses a random variable selection heuristic and RDIS-NRR does not use any internal random restarts (i.e., it functions as a con vex optimizer) but does have top-level restarts. In each domain, the optimizer we compare to was also used as the subspace optimizer in RDIS. All experiments were run on the same cluster . Each computer in the cluster was identical, with two 2.33GHz quad core Intel Xeon E5345 processors and 16GB of RAM. Each algorithm was limited to a single thread. Fur- ther details can be found in Appendix C. Structure from Motion. Structure from motion is the prob- lem of reconstructing the geometry of a 3-D scene from a set of 2-D images of that scene. It consists of ﬁrst determining an initial estimate of the parameters and then performing non- linear optimization to minimize the squared error between a set of 2-D image points and a projection of the 3-D points onto camera models [ T riggs et al. , 2000 ] . The latter , known as bundle adjustment, is the task we focus on here. Global structure exists, since cameras interact explicitly with points, creating a bipartite graph structure that RDIS can decom- pose, but (nontrivial) local structure does not exist because the bounds on each term are too wide and tend to include ∞ . The dataset used is the 49-camera, 7776-point data ﬁle from the Ladybug dataset [ Agarwal et al. , 2010 ] Figure 2 shows performance on bundle adjustment as a function of the size of the problem, with a log scale y-axis. Each point is the minimum error found after running each al- gorithm for 5 hours. Each algorithm is given the same set of restart states, but algorithms that conv erge faster may use more of these. Since no local structure is exploited, Figure 2 effecti vely demonstrates the beneﬁts of using recursive de- composition with intelligent v ariable selection for noncon vex optimization. Decomposing the optimization across indepen- dent subspaces allows the subspace optimizer to mov e faster , further , and more consistently , allowing RDIS to dominate the other algorithms. Missing points are due to algorithms not returning any results in the allotted time. High-dimensional Sinusoid. The second domain is a highly- multimodal test function deﬁned as a multidimensional sinu- Time (seconds) 0 500 1000 1500 2000 2500 3000 3500 (Relative) Minimum value found × 10 5 0 0.5 1 1.5 2 2.5 3 3.5 CGD BCD-CGD RDIS-NRR RDIS arity 4 arity 8 arity 12 Figure 3: A comparison of the best minima found as a function of time for three different arities of the high-dimensional sinusoid. Protein ID, ordered by increasing number of terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Minimum value found -1500 -1000 -500 0 500 1000 1500 CGD BCD-CGD RDIS( ǫ =1) RDIS( ǫ =2) Figure 4: Minimum value (energy) found on 21 different proteins. soid placed in the basin of a quadratic, with a small slope to make the global minimum unique. The arity of this function (i.e., the number of variables contained in each term) is con- trolled parametrically . Functions with larger arities contain more terms and dependencies, and thus are more challeng- ing. A small amount of local structure exists in this problem. In Figure 3, we show the current best value found versus time. Each datapoint is from a single run of an algorithm using the same set of top-le vel restarts, although, again, algo- rithms that con ver ge faster use more of these. RDIS outper- forms all other algorithms, including RDIS-NRR. This is due to the nested restart behavior afforded by recursi ve decom- position, which allows RDIS to effecti vely explore each sub- space and escape local minima. The poor initial performance of RDIS for arities 8 and 12 is due to it being trapped in a lo- cal minimum for an early v ariable assignment while perform- ing optimizations lo wer in the recursion. Howe ver , once the low-le vel recursions ﬁnish it escapes and ﬁnds the best mini- mum without ever performing a top lev el restart (Figure 7 in Appendix C.2 contains the full trajectories). Protein Folding . The ﬁnal domain is sidechain placement for protein folding with continuous angles between atoms. Amino acids are composed of a backbone segment and a sidechain. Sidechain placement requires setting the sidechain angles with the backbone atoms ﬁxed. It is equi valent to ﬁnd- ing the MAP assignment of a continuous pairwise Markov random ﬁeld (cf., Y anover et al. [2006]). Signiﬁcant local structure is present in this domain. T est proteins were se- lected from the Protein Data Bank [ Berman et al. , 2000 ] with sequence length 300-600 such that the sequences of any two did not ov erlap by more than 30%. Figure 4 shows the results of combining all aspects of RDIS, including recursi ve decomposition, intelligent v ariable ǫ 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 Minimum value found -500 0 500 1000 1500 2000 2500 Time (seconds) × 10 4 0 2 4 6 8 10 12 Min. value found Total time (s) Figure 5: RDIS-NRR’ s minimum energy found and total time taken versus  , on protein 3EEQ (ID 9). The x-axis is log scale. selection, internal restarts, and local structure on a difﬁcult problem with signiﬁcant local structure. Each algorithm is run for 48 hours on each of 21 proteins of v arying sizes. RDIS is run with both  = 1 . 0 and  = 2 . 0 and both results are shown on the ﬁgure. RDIS outperforms CGD and BCD-CGD on all proteins, often by a very lar ge amount. Figure 5 demonstrates the effect of  on RDIS for protein folding. It shows the performance of RDIS-NRR as a func- tion of  , where performance is measured both by minimum energy found and time taken. RDIS-NRR is used in order to remov e the randomness associated with the internal restarts of RDIS, resulting in a more accurate comparison across multi- ple runs. Each point on the energy curve is the minimum energy found over the same 20 restarts. Each point on the time curve is the total time taken for all 20 restarts. As  in- creases, time decreases because more local structure is being exploited. In addition, minimum energy actually decreases initially . W e attrib ute this to the smoothing caused by in- creased simpliﬁcation, allo wing RDIS to avoid minor local minima in the objectiv e function. 6 Conclusion This paper proposed a ne w approach to solving hard noncon- ve x optimization problems based on recursi ve decomposition. RDIS decomposes the function into approximately locally in- dependent sub-functions and then optimizes these separately by recursing on them. This results in an exponential reduc- tion in the time required to ﬁnd the global optimum. In our experiments, we show that problem decomposition enables RDIS to systematically outperform comparable methods. Directions for future research include applying RDIS to a wide variety of noncon ve x optimization problems, further analyzing its theoretical properties, dev eloping ne w variable and value selection methods, extending RDIS to handle hard constraints, incorporating discrete v ariables, and using simi- lar ideas for high-dimensional integration. Acknowledgments This research was partly funded by ARO grant W911NF-08- 1-0242, ONR grants N00014-13-1-0720 and N00014-12-1- 0312, and AFRL contract F A8750-13-2-0019. The views and conclusions contained in this document are those of the au- thors and should not be interpreted as necessarily represent- ing the of ﬁcial policies, either e xpressed or implied, of AR O, ONR, AFRL, or the United States Gov ernment. References [ Agarwal et al. , 2010 ] Sameer Agarwal, Noah Sna vely , Stev en M. Seitz, and Richard Szeliski. Bundle adjustment in the large. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer V ision ECCV 2010 , volume 6312 of Lectur e Notes in Computer Science , pages 29–42. Springer Berlin Heidelberg, 2010. [ Anﬁnsen, 1973 ] Christian B. Anﬁnsen. Principles that gov- ern the folding of protein chains. Science , 181(4096):223– 230, 1973. [ Bacchus et al. , 2009 ] Fahiem Bacchus, Shannon Dalmao, and T oniann Pitassi. Solving #SA T and Bayesian Infer- ence with Backtracking Search. Journal of Artiﬁcial Intel- ligence Resear ch , 34:391–442, 2009. [ Baker , 2000 ] David Baker . A surprising simplicity to pro- tein folding. Natur e , 405:39–42, 2000. [ Bayardo Jr . and Pehoushek, 2000 ] Roberto J. Bayardo Jr . and Joseph Daniel Pehoushek. Counting models using connected components. In Pr oceedings of the Seventeenth National Confer ence on Artiﬁcial Intelligence , pages 157– 162, 2000. [ Berman et al. , 2000 ] Helen M. Berman, John W estbrook, Zukang Feng, Gary Gilliland, T . N. Bhat, Helge W eissig, Ilya N. Shindyalov , and Philip E. Bourne. The protein data bank. Nucleic Acids Resear ch , 28(1):235–242, 2000. [ Bonettini, 2011 ] Silvia Bonettini. Inexact block coordi- nate descent methods with application to non-negati ve ma- trix factorization. IMA Journal of Numerical Analysis , 31(4):1431–1452, 2011. [ Cassioli et al. , 2013 ] A Cassioli, D Di Lorenzo, and M Sciandrone. On the con ver gence of inexact block coor- dinate descent methods for constrained optimization. Eu- r opean J ournal of Oper ational Researc h , 231(2):274–281, 2013. [ C ¸ ataly ¨ urek and A ykanat, 2011 ] ¨ Umit C ¸ ataly ¨ urek and Cevdet A ykanat. PaT oH (partitioning tool for hyper- graphs). In David Padua, editor , Encyclopedia of P arallel Computing , pages 1479–1487. Springer US, 2011. [ Darwiche and Hopkins, 2001 ] Adnan Darwiche and Mark Hopkins. Using recursiv e decomposition to construct elimination orders, jointrees, and dtrees. In Symbolic and Quantitative Appr oaches to Reasoning with Uncertainty , pages 180–191. Springer , 2001. [ Darwiche, 2001 ] Adnan Darwiche. Recursive conditioning. Artiﬁcial Intelligence , 126(1-2):5–41, 2001. [ Davis et al. , 1962 ] Martin Davis, George Logemann, and Donald Loveland. A machine program for theorem- proving. Communications of the A CM , 5(7):394–397, 1962. [ De Moura and Bjørner , 2011 ] Leonardo De Moura and Nikolaj Bjørner . Satisﬁability modulo theories: Intro- duction and applications. Communications of the ACM , 54(9):69–77, September 2011. [ Griew ank and T oint, 1981 ] Andreas Griew ank and Philippe L. T oint. On the unconstrained optimization of partially separable functions. Nonlinear Optimization , 1982:247–265, 1981. [ Grippo and Sciandrone, 1999 ] Luigi Grippo and Marco Sciandrone. Globally con v ergent block-coordinate tech- niques for unconstrained optimization. Optimization Methods and Softwar e , 10(4):587–637, 1999. [ Hansen and W alster , 2003 ] Eldon Hansen and G. William W alster . Global optimization using interval analysis: r e- vised and expanded , v olume 264. CRC Press, 2003. [ Holm et al. , 2001 ] Jacob Holm, Kristian De Lichtenberg, and Mikkel Thorup. Poly-logarithmic deterministic fully- dynamic algorithms for connectivity , minimum spanning tree, 2-edge, and biconnectivity . Journal of the A CM (J ACM) , 48(4):723–760, 2001. [ Leav er -Fay et al. , 2011 ] Andrew Leaver -Fay , Michael T yka, Ste ven M Le wis, Oli ver F Lange, James Thompson, Ron Jacak, Kristian Kaufman, P Douglas Renfrew , Colin A Smith, W ill Shef ﬂer , et al. R OSETT A3: an object-oriented software suite for the simulation and design of macromolecules. Methods in Enzymology , 487:545–574, 2011. [ Lourakis, 2004 ] M.I.A. Lourakis. levmar: Lev enberg- Marquardt nonlinear least squares algorithms in C/C++. http://www.ics.forth.gr/ ˜ lourakis/ levmar/ , 2004. [ Moske wicz et al. , 2001 ] Matthew W . Moske wicz, Conor F . Madigan, Y ing Zhao, Lintao Zhang, and Sharad Malik. Chaff: Engineering an efﬁcient sat solver . In Pr oceedings of the 38th annual Design Automation Conference , pages 530–535. A CM, 2001. [ Neumaier et al. , 2005 ] Arnold Neumaier , Oleg Shcherbina, W altraud Huyer , and T ams V ink. A comparison of com- plete global optimization solvers. Mathematical Pr ogram- ming , 103(2):335–356, 2005. [ Nocedal and Wright, 2006 ] Jorge Nocedal and Stephen J Wright. Numerical Optimization . Springer, 2006. [ Sang et al. , 2004 ] T ian Sang, Fahiem Bacchus, Paul Beame, Henry A. Kautz, and T oniann Pitassi. Combining com- ponent caching and clause learning for ef fecti ve model counting. Se venth International Confer ence on Theory and Applications of Satisﬁability T esting , 2004. [ Sang et al. , 2005 ] T ian Sang, Paul Beame, and Henry Kautz. Performing Bayesian inference by weighted model counting. In Pr oceedings of the T wentieth National Con- fer ence on Artiﬁcial Intelligence (AAAI-05) , volume 1, pages 475–482, 2005. [ Schoen, 1991 ] Fabio Schoen. Stochastic techniques for global optimization: A survey of recent advances. Jour - nal of Global Optimization , 1(3):207–228, 1991. [ T rifunovi ´ c, 2006 ] Aleksandar T rifunovi ´ c. P arallel algo- rithms for hyper graph partitioning . Ph.D., University of London, February 2006. [ T riggs et al. , 2000 ] Bill Triggs, Philip F . McLauchlan, Richard I. Hartley , and Andrew W . Fitzgibbon. Bundle ad- justment – a modern synthesis. In V ision Algorithms: The- ory and Practice , pages 298–372. Springer , 2000. [ Tseng and Y un, 2009 ] Paul Tseng and Sangwoon Y un. A coordinate gradient descent method for nonsmooth sepa- rable minimization. Mathematical Pr ogr amming , 117(1- 2):387–423, 2009. [ Y anover et al. , 2006 ] Chen Y anover , T alya Meltzer , and Y air W eiss. Linear programming relaxations and belief propagation – an empirical study . The Journal of Machine Learning Resear ch , 7:1887–1907, 2006. A ppendix A Analysis Details A.1 Complexity RDIS begins by choosing a block of variables, x C . Assuming that this choice is made heuristically using the PaT oH library for hypergraph partitioning, which is a multi-le vel technique, then the complexity of choosing v ariables is linear (see T ri- funovi ´ c 2006, p. 81). W ithin the loop, RDIS chooses v alues for x C , simpliﬁes and decomposes the function, and ﬁnally recurses. Let the complexity of choosing v alues using the subspace optimizer be g ( d ) , where | x C | = d , and let one call to the subspace optimizer be cheap relativ e to n (e.g., com- puting the gradient of f with respect to x C or taking a step on a grid). Simpliﬁcation requires iterating through the set of terms and computing bounds, so is linear in the number of terms, m . The connected components are maintained by a dynamic graph algorithm [ Holm et al. , 2001 ] which has an amortized complexity of O (log 2 ( | V | )) per operation, where | V | is the number of vertices in the graph. Finally , let the number of iterations of the loop be a function of the dimen- sion, ξ ( d ) , since more dimensions generally require more restarts. Proposition 5. If, at each level, RDIS chooses x C ⊆ x of size | x C | = d such that, for each selected value ρ C , the simpliﬁed function ˆ f | ρ C ( x U ) locally decomposes into k > 1 indepen- dent sub-functions { ˆ f i ( x U i ) } with equal-sized domains x U i , then the time complexity of RDIS is O ( n d ξ ( d ) log k ( n/d ) ) . Pr oof. Assuming that m is o f the same order as n , the recurrence relation for RDIS is T ( n ) = O ( n ) + ξ ( d )  g ( d ) + O ( m ) + O ( n ) + O ( d log 2 ( n )) + k T  n − d k  , which can be simpliﬁed to T ( n ) = ξ ( d )  k T  n k  + O ( n )  + O ( n ) . Noting that the recursion halts at T ( d ) , the solution to the abov e recurrence relation is then T ( n ) = c 1 ( k ξ ( d )) log k ( n/d ) + c 2 n P log k ( n/d ) − 1 r =0 ξ ( d ) r . which is O  ( k ξ ( d )) log k ( n/d )  = O  n d ξ ( d ) log k ( n/d )  . A.2 Con vergence In the follo wing, we refer to the basin of attraction of the global minimum as the global basin. Formally , we deﬁne a basin of attraction as follows. Deﬁnition 2. The basin of attraction of a stationary point c is the set of points B ⊆ R n for whic h the sequence gener ated by DR, initialized at x 0 ∈ B , con ver ges to c . Intuitiv ely , at each le vel of recursion, RDIS with  = 0 par- titions x into { x C , x U } , sets values using the subspace opti- mizer ρ C for x C , globally optimizes f | ρ C ( x U ) by recursi vely calling RDIS, and repeats. When the non-restart steps of the subspace optimizer satisfy two practical conditions (below) of suf ﬁcient decrease in (1) the objecti ve function (a standard Armijo condition) and (2) the gradient norm ov er two succes- siv e partial updates, (i.e., conditions (3.1) and (3.3) of Bonet- tini [2011]), then this process is equiv alent to the 2 -block inexact Gauss-Seidel method (2B-IGS) described in Bonet- tini [2011] (c.f., Grippo and Sciandrone [1999]; Cassioli et al. [2013]), and each limit point of the sequence generated by RDIS is a stationary point of f ( x ) , of which the global minimum is one, and reachable through restarts. Formally , let superscript r indicate the recursion level, with 0 ≤ r ≤ d , with r = 0 the top, and recall that x ( r ) U = n x ( r +1) C , x ( r +1) U o if there is no decomposition. The following proofs focus on the no-decomposition case for clar - ity; ho wever , the extension to the decomposable case is tri vial since each sub-function of the decomposition is independent. W e denote applying the subspace optimizer to f ( x ) until the stopping criterion is reached as S ∗ ( f , x ) and a single call to the subspace optimizer as S 1 ( f , x ) and note that S ∗ ( f , x ) , by deﬁnition, returns the global minimum x ∗ and that repeatedly calling S 1 ( f , x ) is equiv alent to calling S ∗ ( f , x ) . For con venience, we restate conditions (3.1) and (3.3) from Bonettini [2011] (without constraints) on the sequence { x ( k ) } generated by an iterati ve algorithm on blocks x i for i = 1 , . . . , m , respecti vely as f ( x ( k +1) 1 , . . . , x ( k +1) i , . . . , x ( k ) m ) ≤ f ( x ( k +1) 1 , . . . , x ( k ) i + λ ( k ) i d ( k ) i , . . . , x ( k ) m ) , ( C 1) where λ ( k ) i is computed using Armijo line search and d ( k ) i is a feasible descent direction, and ||∇ i f ( x ( k +1) 1 , . . . , x ( k +1) i , . . . , x ( k ) m ) || ≤ η ||∇ i f ( x ( k +1) 1 , . . . , x ( k +1) i − 1 , . . . , x ( k ) m ) || , i = 1 , . . . , m ||∇ i f ( x ( k +1) 1 , . . . , x ( k +1) i , . . . , x ( k +1) m ) || ≤ η ||∇ i − 1 f ( x ( k +1) 1 , . . . , x ( k +1) i − 1 , . . . , x ( k ) m ) || , i = 2 , . . . , m ||∇ 1 f ( x ( k +2) 1 , . . . , x ( k +1) i , . . . , x ( k +1) m ) || ≤ η 1 − m ||∇ m f ( x ( k +1) 1 , . . . , x ( k +1) m ) || , ( C 2) where η ∈ [0 , 1) is a forcing parameter . See Bonettini [2011] for further details. The inexact Gauss-Seidel method is de- ﬁned as ev ery method that generates a sequence such that these conditions hold and is guaranteed to con ver ge to a crit- ical point of f ( x ) when m = 2 . Let RDIS DR refer to RDIS ( f , x , x 0 , S = DR ,  = 0) . Proposition 6. If the non-restart steps of RDIS satisfy (C1) and (C2),  = 0 , the number of variables is n , the vol- ume of the global basin is v = l n , and the volume of the entire space is V = L n , then RDIS DR r eturns the global minimum after t r estarts, with pr obability 1 − (1 − ( v /V )) t . Pr oof. Step 1. Given a ﬁnite number of restarts, one of which starts in the global basin, then RDIS DR , with no recursion, returns the global minimum and satisﬁes (C1) and (C2). This can be seen as follows. At r = 0 , RDIS DR chooses x (0) C = x 0 and x (0) U = ∅ and repeatedly calls S 1 ( f (0) , x (0) C ) . This is equi valent to call- ing S ∗ ( f (0) , x (0) C ) = S ∗ ( f , x ) , which returns the global min- imum x ∗ . Thus, RDIS DR returns the global minimum. Re- turning the global minimum corresponds to a step in the exact Gauss-Seidel algorithm, which is a special case of the IGS al- gorithm and, by deﬁnition, satisﬁes (C1) and (C2). Step 2. Now , if the non-restart steps of S 1 ( f , x ) satisfy (C1) and (C2), then RDIS DR returns the global minimum. W e sho w this by induction on the levels of recursion. Base case. From Step 1, we ha ve that RDIS DR ( f ( d ) , x ( d ) ) returns the global minimum and satisﬁes (C1) and (C2), since RDIS DR does not recurse beyond this le vel. Induction step. Assume that RDIS DR ( f ( r +1) , x ( r +1) ) returns the global minimum. W e no w show that RDIS DR ( f ( r ) , x ( r ) ) returns the global minimum. RDIS DR ( f ( r ) , x ( r ) ) ﬁrst partitions x ( r ) into the two blocks x ( r ) C and x ( r ) U and then iteratively takes the fol- lowing two steps: ρ ( r ) C ← S 1 ( f | ( r ) σ ∗ U ( x ( r ) C )) and ρ ( r ) U ← RDIS DR ( f | ( r ) ρ C ( x U )) . The ﬁrst simply calls the subspace optimizer on ρ ( r ) C . The second is a recursive call equi v alent to RDIS DR ( f ( r +1) , x ( r +1) ) , which, from our inductiv e assumption, returns the global minimum ρ ( r ) U = x ( r ) ∗ U of f | ( r ) ρ C ( x U ) and satisﬁes (C1) and (C2). For S 1 ( f | ( r ) σ ∗ U ( x ( r ) C )) , RDIS DR will never restart the subspace optimizer unless the sequence it is generating conv erges. Thus, for each restart, since there are only two blocks and both the non-restart steps of S 1 ( f | ( r ) σ ∗ U ( x ( r ) C )) and the RDIS DR ( f | ( r ) ρ C ( x U )) steps satisfy (C1) and (C2) then RDIS DR is a 2B-IGS method and the generated sequence con ver ges to the stationary point of the current basin. At each le vel, after conv erging, RDIS DR will restart, iterate until conv ergence, and repeat for a ﬁnite number of restarts, one of which will start in the global basin and thus con ver ge to the global minimum, which is then returned. Step 3. Finally , since the probability of RDIS DR starting in the global basin is ( v/V ) , then the probability of it not starting in the global basin after t restarts is (1 − ( v /V )) t . From abov e, we hav e that RDIS DR will return the global minimum if it starts in the global basin, thus RDIS DR will return the global minimum after t restarts with probability 1 − (1 − ( v /V )) t . A ppendix B RDIS Subroutine Details B.1 V ariable Selection In hypergraph partitioning, the goal is to split the graph into k components of approximately equal size while minimizing the number of hyperedges cut. Similarly , in order to maxi- mize decomposition, RDIS should choose the smallest block of variables that, when assigned, decomposes the remain- ing variables. Accordingly , RDIS constructs a hyper graph H = ( V , E ) with a vertex for each term, { n i ∈ V : f i ∈ f } and a hyperedge for each variable, { e j ∈ E : x j ∈ x } , where each hyperedge e j connects to all vertices n i for which the corresponding term f i contains the variable x j . Partitioning H , the resulting cutset will be the smallest set of variables that need to be removed in order to decompose the hyper - graph. And since assigning a v ariable to a constant effec- tiv ely removes it from the optimization (and the hyper graph), the cutset is exactly the set that RDIS chooses on line 2. B.2 Execution time V ariable selection typically occupies only a tiny fraction of the runtime of RDIS, with the vast majority of RDIS’ exe- cution time spent computing gradients for the subspace opti- mizer . A small, but non-negligible amount of time is spent maintaining the component graph, b ut this is much more ef ﬁ- cient than if we were to recompute the connected components each time, and the exponential gains from decomposition are well worth the small upkeep costs. A ppendix C Experimental Details All experiments were run on the same compute cluster . Each computer in the cluster was identical, with two 2.33GHz quad core Intel Xeon E5345 processors and 16GB of RAM. Each algorithm was limited to a single thread. C.1 Structure fr om Motion In the structure from motion task (bundle adjustment [ T riggs et al. , 2000 ] ), the goal is to minimize the error between a dataset of points in a 2-D image and a projection of ﬁtted 3- D points representing a scene’ s geometry onto ﬁtted camera models. The variables are the parameters of the cameras and the positions of the points and the cameras. This problem is highly-structured in a global sense: cameras only interact explicitly with points, creating a bipartite graph structure that RDIS is able to exploit. The dataset used is the 49-camera, 7776-point data ﬁle from the Ladybug dataset [ Agarwal et al. , 2010 ] , where the number of points is scaled proportionally to the number of cameras used (i.e., if half the cameras were used, half of the points were included). There are 9 variables per camera and 3 variables per point. C.2 Highly Multimodal T est Function The test function is deﬁned as follo ws. Given a height h , a branching factor k , and a maximum arity a , we deﬁne a com- plete k -ary tree of v ariables of the speciﬁed height, with x 0 as the root. For all paths p j ∈ P in the tree of length l j ≤ a , with l j ev en, we deﬁne a term t p j = Q x i ∈ p j sin( x i ) . The test function is f h,k,a ( x 0 , . . . , x n ) = P n i =1 c 0 x i + c 1 x 2 i + c 2 P P t p j . The resulting function is a multidimensional si- nusoid placed in the basin of a quadratic function parameter- ized by c 1 , with a linear slope deﬁned by c 0 . The constant c 2 controls the amplitude of the sinusoids. For our tests, we used c 0 = 0 . 6 , c 1 = 0 . 1 , and c 2 = 12 . A 2-D example of this func- tion is sho wn in Figure 6. W e used a tree height of h = 11 , with branching factor k = 2 , resulting in a function of 4095 variables. W e e v aluated each of the algorithms on functions with terms of arity a ∈ { 4 , 8 , 12 } , where a larger arity de- ﬁnes more complex dependencies between variables as well as more terms in the function. The functions for the three different arity lev els had 16372, 24404, and 30036 terms, re- spectiv ely . Figure 7 sho ws the value of the current state for each algo- rithm over its entire ex ecution. These are the trajectories for Figure 3 of the main paper . −10 0 10 −10 0 10 −20 0 20 40 x y Figure 6: A 2-D example of the highly multimodal test function. Time (seconds) 0 500 1000 1500 2000 2500 3000 3500 Current minimum value (relative to best) × 10 5 0 1 2 3 4 5 6 CGD BCD-CGD RDIS-NRR RDIS arity 4 arity 8 arity 12 Figure 7: Trajectories on the test function for the data in Figure 3 in the main paper . Sharp rises sho w restarts. Notably , RDIS-NRR restarts much more often than the other algorithms because decom- position allows it to move through the space much more efﬁciently . W ithout internal restarting it gets stuck at the same local minima as BCD-CGD and CGD. F or arity 12, RDIS nev er performs a full restart and still ﬁnds the best minimum, despite using the same ini- tial point as the other algorithms. C.3 Protein F olding Problem details Protein folding [ Anﬁnsen, 1973; Baker , 2000 ] is the process by which a protein, consisting of a long chain of amino acids, assumes its functional shape. The computational problem is to predict this ﬁnal conformation giv en a known sequence of amino acids. This requires minimizing an energy function consisting mainly of a sum of pairwise distance-based terms representing chemical bonds, hydrophobic interactions, elec- trostatic forces, etc., where, in the simplest case, the v ariables are the relative angles between the atoms. The optimal state is typically quite compact, with the amino acids and their atoms bonded tightly to one another and the volume of the protein minimized. Each amino acid is composed of a backbone seg- ment and a sidechain, where the backbone segment of each amino acid connects to its neighbors in the chain, and the sidechains branch off the backbone segment and form bonds with distant neighbors. The sidechain placement task is to predict the conformation of the sidechains when the backbone atoms are ﬁxed in place. Energies between amino acids are deﬁned by the Lennard- Jones potential function, as speciﬁed in the Rosetta protein folding library [ Leav er -Fay et al. , 2011 ] . The basic form of this function is E LJ ( r ) = A r 12 − B r 6 , where r is the dis- tance between two atoms and A and B are constants that vary for dif ferent types of atoms. The Lennard-Jones poten- tial in Rosetta is modiﬁed slightly so that it behaves better when r is very large or very small. The full energy func- tion is E ( φ ) = P φ E j k ( R j ( χ j ) , R k ( χ k )) , where R j is an amino acid (also called a residue) in the protein, φ is the set of all torsion angles, and φ i ∈ χ j are the angles for R j . Each residue has between zero and four torsion angles that deﬁne the conformation of its sidechain, depending on the type of amino acid. The terms E j k compute the energy between pairs of residues as E j k = P a j P a k E LJ ( r ( a j ( χ j ) , a k ( χ k ))) , where a j and a k refer to the positions of the atoms in residues j and k , respecti vely , and r ( a j , a k ) is the distance between the two atoms. The torsion angles deﬁne the positions of the atoms through a series of kinematic relations, which we do not detail here. The smallest (with respect to the number of terms) pro- tein (ID 1) has 131 residues, 2282 terms, and 257 variables, while the largest (ID 21) has 440 residues, 9380 terms, and 943 v ariables. The average number of residues, terms, and variables is 334 , 7110 , and 682 , respectively . The proteins with their IDs from the paper are as follo ws: (1) 4JPB, (2) 4IYR, (3) 4M66, (4) 3WI4, (5) 4LN9, (6) 4INO, (7) 4J6U, (8) 4O AF , (9) 3EEQ, (10) 4MYL, (11) 4IMH, (12) 4K7K, (13) 3ZPJ, (14) 4LLI, (15) 4N08, (16) 2RSV , (17) 4J7A, (18) 4C2E, (19) 4M64, (20) 4N4A, (21) 4KMA.

Recursive Decomposition for Nonconvex Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment