Tradeoffs between Convergence Speed and Reconstruction Accuracy in Inverse Problems

1 T radeof fs between Con v er g ence Spee d and Reconstruction Acc urac y in In v erse Problems Raja Giryes 1 , Y onina C. Eldar 2 , Alex M. Bronstein 3 , and Guillermo Sapiro 4 1 School of Electrical Enginee ring, T el A vi v University Uni versity , T el A vi v , Israel 699 78 raja@tauex.tau.ac.il 2 Electrical Enginee ring Department, T e c hnion - IIT , Haifa, Israel, 32000, yonina@ee. tec hnion.ac .il 3 Computer Science Department, T e chnion - IIT , Haifa, Israel, 32000, bron@cs.tech nion.ac.il 4 Electrical and Computer Engine ering Department, Duke Uni versity , Durham, NC, 27 7 08, guillermo.sapiro@duke.ed u Abstract —Solving in verse p roblems with iterativ e algorithms is popular , especially for large data. Due to time constraint s, the number of possible iterations is usually limited, potentially affecting the achiev able accuracy . Given an error one is willin g to tolerate, an important question is whether it is possible to modify the original i terations to obtain faster con ver gence to a min imizer achieving the allowed error without i ncreasing the computation al cost of each iteration considerably . Relying on recent recov ery techniques deve loped for settings in which th e d esired signal belongs to some low-dimensional set, we sh ow that using a coarse estimate of this set may lead to faster conv ergence at the cost of an additional rec onstru ction err or related to the accuracy of the set approximation. Our theory ties to recent advances in sparse reco very , co mpressed sensin g, and deep learning. Pa rticu larly , it may prov id e a possible explanation to the successful approxi- mation of the ℓ 1 -minimization solution by n eural networks with layers repr esentin g iterations, as practiced in the learned iterativ e shrinkage-thresholding algorithm (LIST A). I . I N T RO D U C T I O N Consider the setting in which we want to recover a vector x ∈ R d from linear measurements y = Mx + e , (1) where M ∈ R m × d is the measure ment matrix a nd e ∈ R d is additive noise. This setup appears in many ﬁelds including statistics (e.g., regre ssion ), image processing ( e.g., d e blurring and super-resolution ), and medical ima g ing (e.g. , CT and MRI), to name just a fe w . Often the recovery of x from y is an ill-posed p roblem. For example, whe n M has fewer r ows than colum ns ( m < n ), renderin g (1) an under determined linear s y stem of equations. In this case, it is impo ssible to recover x without introd ucing additional assumption s on its stru cture. A po p ular strategy is to assume that x resides in a low dimensional set K , e.g. , sparse vectors [15], [1 8], [27], [2 9] or a Gaussian Mixture Mod e l (GMM) [6 7]. The natural b y -prod uct minimization prob lem then bec o mes min x k y − Mx k 2 2 s.t. x ∈ K . (2) This can be r eformu lated in an unconstra in ed fo rm as min x k y − Mx k 2 2 + λf ( x ) , (3) where λ is a regularization p arameter an d f ( · ) is a cost function related to th e set K . For example, if K =  x ∈ R d : k x k 0 ≤ k  is the set of k -sparse vectors, then a natural choice is f ( · ) = k·k 0 or its conve x relax ation f ( · ) = k·k 1 . A p opular techniq ue for solving (2) and (3) is using it- erative pr o grams such as prox imal method s [13], [22] that include the iterativ e shrinkag e-threshold ing alg o rithm (IST A) [7], [24], [28] and the alternating direction method of multi- pliers (ADMM) [14], [ 34]. This strategy is particular ly useful for large dimen sions d . Many applications impo se time constrain ts, whic h limit the number o f comp utations that can be perform ed to r ecover x from th e measur ements. One way to minim ize time and computatio ns is to red u ce th e n umber of iterations without increasing the compu ta tio nal co st of each iteration. A differ - ent a pproach is to use mo mentum methods [64] or rand o m projection s [48]–[50], [6 0] to accelerate con vergence. Ano ther alternative is to keep the n umber of iterations ﬁxed while reducing the co st of each iter ation. For example, since the complexity of iterati ve methods rely , amon g other things, on m , a co mmon tech nique to save com putations is to sub- sample the measuremen ts y , removin g “redun dant informatio n, ” to an amount that still allows reconstructio n of x . A series o f recent works [12], [16], [19], [23], [47], [5 4] suggest tha t by obtaining more m easuremen ts on e can ben eﬁt from simple efﬁcient methods that cann ot be applied with a smaller number of measur ements. In [12] the g eneralization prop erties of large-scale learnin g systems have been studied sh owing a tradeoff be twe en the number of measuremen ts a n d the target appro ximation. The work in [54 ] showed how it is possible to make the run-time of SVM optimiza tion decrease a s the size of the training data increases. In [23], it is shown that the pro blem of supervised learning o f halfsp aces over 3 -sparse vectors with trinary values {− 1 , 1 , 0 } may be solved with efﬁcient algorithms only if the number of training examples exceed s a certain limit. Similar pheno m ena are encounter ed in the con text of sparse rec overy , where efﬁcient alg orithms are gu aranteed to reconstru ct th e sparsest vector only if the n u mber of sample s is larger th an a certain quantity [2 7], [ 3 0], [32]. In [ 19] it was shown th at by having a larger nu mber of trainin g examp les it is possible to 2 design mor e efﬁcient optimizatio n pro blems by projectin g on to simpler sets . This idea is further studied in [16] b y ch anging the am ount o f smo othing app lied in conv ex optimizatio n. In [47] the authors show that mo re measuremen ts may allo w increasing the step- size in the pr ojected grad ient algorith m (PGD) and thus accelerating its con vergence. While these works studied a tradeo ff between co n vergence speed and the num ber of available measurements, this pap er takes a different rou te. Con sid e r the case in which due to time constraints we need to stop the iterations before we ach iev e the desired r econstructio n ac curacy . For the original algorithm , this can result in the rec overy be in g very far from the optim um. An important question is wheth er we can modify the orig inal iterations (e.g., those dictated by the shrink a ge or ADMM technique s), such that th e m ethod conver ge n ces to a n improved solution with fewer iterations witho ut addin g comp lexity to them. Th is introdu ces a trad e off between the recovery error we are willing to tolerate and the compu tational cost. As we demonstra te , this go es beyond the trivial relationship between the appro ximation error and the nu mber of iteration s that exists for various iterative metho ds [7]. Such a trad eoff is expe r imentally demonstrated b y th e success o f learned I ST A (LIST A) [38] fo r spa r se recovery with f ( · ) = k·k 1 . This technique learns a neural network with only several layers, wh ere each lay e r is a modiﬁed version of the IST A iteration. 1 It achieves virtu ally the same accuracy as the original IST A using o ne to tw o orders of magnitud e less iterations. The acceler ation of iterative a lgorithms with neural networks is not uniqu e only to the sparse recovery prob lem and f ( · ) = k·k 1 . Th is behavior was demo nstrated for oth e r m o dels such as the analy sis cosparse and low-rank matr ix models [5 5], Poisson noise [5 2], acceleration of Eulerian ﬂuid simu lation [58], and f eature learning [ 2]. Howe ver , a prop er theor etical justiﬁcation to this pheno mena is still lacking . Contribution. I n this work , we provide theoretical foun - dations elu c idating th e tradeo ff betwee n the allowed m in- imization erro r and th e n umber of simple iteration s used for solvin g inv erse p roblems. W e forma lly show tha t if we allow a certain recon struction er ror in the solution, then it is possible to change iterati ve methods by modifying the linear operation s applied in them such that each iteration has the same co mplexity as befor e but the num ber of steps required to attain a certain err or is r educed. Such a trad eoff seems n a tural when working with real data, where both the data and the assumed models a re noisy or ap proxim a te; searchin g fo r the exact solution o f an op - timization pro blem, where all the variables are affected by measuremen t or mod el n o ise may be an unn ecessary use of v alu able compu ta tio nal resources. W e formally p r ove this relation for iter ativ e projectio n algo rithms. Interestingly , a related tradeoff exists also in the context of sampling theory , where by allowing some e r ror in the reconstructio n we may use fewer samples a nd/or qua n tization le vels [42]. W e argue that the tradeo ff we analyze ma y explain the smaller n u mber of iteration s required in LIST A compared to IST A. 1 IST A and its varia nts is one of the most powerful optimization techni ques for sparse codin g. Parallel efforts to ou r work also provide justiﬁcation for the success of LIST A. In [45], the fast convergence of LIST A is justiﬁed by connectin g between th e conv ergenc e speed and the factorizatio n of the Gr am matrix o f M . In [65], the conv ergen c e speed of IST A and L IST A is analy zed using the restricted isometry pr operty (RIP) [18], showing that LI ST A may r educe the RIP , which leads to faster conver g e nce. A relation between LIST A and appro x imate message passing (AMP) strategies is drawn in [11]. Our paper differs fro m previous contributions in three main points: (i) it g oes beyond the case of standard LIST A with sparse signals a nd co nsiders variants tha t ap ply to genera l low- dimensiona l models; (ii) our the o ry relies on the concep t of inexact projections and their relation to the trad eoff between conv ergen c e-speed an d r ecovery accuracy , which differs sig- niﬁcantly fro m other attempts to explain the success of LIST A; and ( iii) besides explorin g LI ST A, we provide acceleratio n strategies to other prog rams such as model- b ased compr e ssed sensing [30] and sp arse rec overy with side-inf ormation . Organization. This paper is organized as follows. In Sec- tion II we present preliminar y nota tio n a nd deﬁnitions, and describe th e IST A, LIST A and PGD techniqu es. Section III introdu c es a n ew theo r y for PGD f or non-c on vex cones. Section IV shows how it is possible to tradeoff be tween conv ergen c e speed an d reco n struction accu racy b y introdu cing the inexact projected gradient descent (IPGD) metho d u sing spectral comp r essed sensing [25] as a motiv ating example. The reconstruc tion err o r of IPGD is an alyzed as a function of the iterations in Section V. Section VI discusses the re latio n be- tween ou r theo ry and model- b ased compr essed sensing [ 4] an d sparse recovery with side inform ation [33], [41], [61], [6 2]. Section VII proposes a LIST A versio n o f IPGD, th e learn ed IPGD (LIPGD), an d demo nstrates its usage in the task of image super-resolution. Section VII I relates the approximation of minim ization problems studied here with n eural networks and deep learning , providin g a theoretical foundatio n for the success o f LIST A and suggesting a “mixture- model” extension of this technique. Section IX concludes th e pa p er . I I . P R E L I M I N A R I E S A N D B AC K G RO U N D Throu g hout this paper, we use the following notation. W e write k·k f or the Euclidian no rm for vector s a n d th e spectral norm f or matrices, k·k 1 for the ℓ 1 norm th at sums the ab- solute values of a vector an d k·k 0 for the ℓ 0 pseudo- n orm, which counts the numb er of non -zero e lements in a vector . The conjug ate tran spose of M is d enoted b y M ∗ and the orthog onal pro je c tion onto th e set K by P K . Th e origin al unknown vector is d enoted by x , the given measure m ents by y , the measur e m ent matrix by M and the sy stem noise by e . Th e i th entry of a vector v is denoted by v [ i ] . The sign fu nction sgn( · ) equals 1 , − 1 or 0 if its input is positive, negative or zero respectively . The d -d imensional ℓ 2 -ball of rad ius r is den o ted by B d r . For b alls o f radiu s 1 , we omit th e sub script and just write B d . A. Iterative shrinkage-thr eshold ing algorithm (IST A ) A popular iterati ve technique for minimizing (3) is IST A. Each of its iteration s is composed of a gradien t step with step 3 Fig. 1. The L IST A sche m e. size µ , obeying 1 µ ≥ k M k to ensur e con vergence [7], followed by a proximal mappin g S f ,λ ( · ) o f the function f , deﬁn ed as S f ,λ ( v ) = arg min z 1 2 k z − v k + λf ( z ) , (4) where λ is a parame ter of the map ping. The resulting IST A iteration can be written as z t +1 = S f ,µλ ( z t + µ M ∗ ( y − Mz t )) , (5) where z t is an estimate of x at iteration t . Note that the step size µ multiplies th e pa rameter of th e pr o ximal mappin g. The proximal m apping ha s a simple f orm fo r m any func- tions f . For example, when f ( · ) = k·k 1 , it is an elem e nt-wise shrinkage fu nction, S ℓ 1 ,λ ( v )[ i ] = sgn ( v [ i ]) max(0 , | v [ i ] | − λ ) . (6) Therefo re, the a dvantage o f IST A is that its iter ations requ ire only the app lica tion of matrix multiplicatio ns and the n a simple non-line ar f unction. None th eless, the main drawback of IST A is the large n umber o f iterations that is typ ica lly r equired for co nvergence. Many acceleratio n techniqu e s have been prop osed to speed up co n vergence of IST A (see [ 3], [ 6], [7], [28], [3 7], [43], [4 6], [56], [59], [63] as a par tial list of such works). A promin ent strategy is LIST A, whic h h as the same structu re as IST A but with different linear oper ations in (5). Empirically , it is observed tha t it is able to attain a solution very close to that of IST A with a signiﬁcan tly smaller ﬁxed num ber of itera tio ns T . T he LIST A iter a tio ns are given b y 2 z t +1 = S f ,λ ( Ay + Uz t ) , ( 7) where A , U and λ are lear n ed fro m a set of training examples by back-p ropag a tion with the objec tive b eing the ℓ 2 -distance between the ﬁnal IST A solution and the LIST A one (after T iterations) [38]. Other minimizatio n objectives may b e u sed, e.g., training LIST A to minimize (3) directly [ 55]. Notice that LIST A has a structure of a recurrent neural network as can b e seen in Fig. 1. While other acceler ation tech niques for IST A have been propo sed together with a thorou g h theore tica l analysis, the powerful LI ST A m ethod has b een introd u ced witho ut math- ematical justiﬁcation fo r its success. I n th is work, we foc u s 2 W e present the more general version [55] that ca n be used for a ny signa l model and not only for sparsity . on the PGD algorithm, wh ose itera tio ns are almost id e ntical to the o nes of IST A but with an o rthogo n al projection instead of a proximal mapping. W e pro p ose an a cceleration technique for it, wh ich is very similar to the one of LIST A, accom panied by a theoretical an alysis. B. Pr ojected gradien t descen t (PGD) The PGD iteration is given by z t +1 = P K ( z t + µ M ∗ ( y − Mz t )) , (8) where P K is an or thogon al pr ojection onto a given set K . For example, if K is the ℓ 1 -ball then P K is simply soft thresholdin g with a value that varies dependin g on the p rojected vector [26]. Note the similarity to the proximal mapping in IST A with f as the ℓ 1 -norm , which is a lso the soft thresholding op eration but with a ﬁxed threshold (6). T h is similarity is n o t uniqu e to the ℓ 1 -norm case but happens also for other types of f such as the ℓ 0 pseudo- n orm and the nu clear no rm. Th e step size µ is assumed to be co nstant for th e sake of simplicity , as in (5). In bo th methods it may vary between iter ations. PGD is a gen eralization o f the iterativ e ha r d th resholding (IHT) a lg orithm, whic h was developed for K being th e set o f sparse vectors [10]. This important method has been analyzed in various w o rks. For e xam ple, for standard sparsity in [10], for sparsity pa tterns that be lo ng to a certain model in [ 4], for a general u nion o f sub spaces in [9], fo r n o nlinear measur ements in [5], and more re c ently in [ 47] for a set of th e fo rm K =  z ∈ R d : f ( z ) ≤ R  . The fo r mulation (8) gene r alizes the special cases above. For example, if f ( · ) = k·k 0 and R is the sparsity le vel then we have the IHT method from [10]; whe n f co u nts the number of non-ze r os of o nly certain sparsity p atterns, which are b ounde d by R , we have the model-ba sed IHT of [4]. PGD may also be applied to non-linear inverse pro b lems [47], [ 66]. Theorem 2.5 below provid e s convergence guaran tees on PGD (it is the n oiseless version of Theorem 1.2 in [47]). Before presenting the result, we intro duce several pro perties of the set K and so me b asic lemmas. Deﬁnition 2.1 ( Descent set and tangent cone): The descent set of the function f at a point x is deﬁned as D f ( x ) =  h ∈ R d : f ( x + h ) ≤ f ( x )  . (9) The tang ent cone C f ( x ) at a point x is the con ic hull of D f ( x ) , i. e ., th e smallest closed con e C f ( x ) satisfyin g D f ( x ) ⊆ C f ( x ) . For con c ise wr iting, belo w we denote D f ( x ) and C f ( x ) as D and C , respecti vely . Lemma 2.2 (Lemma 6 .2 in [4 7 ]): Let v ∈ R d and C ⊂ R d be a closed co ne. Th e n kP C ( v ) k = s up u ∈ C ∩ B d u ∗ v . (10) Lemma 2 .3: If for x ∈ R d , K = { z ∈ R d : f ( z ) ≤ f ( x ) } ⊂ R d is a closed set, then for all v ∈ R d , P K ( x + v ) − x = P K−{ x } ( v ) = P D ( v ) . (11) 4 Pr oof: Fro m the deﬁnition of the d escent co ne we ha ve D =  h ∈ R d : f ( h + x ) ≤ f ( x )  = { z − x : f ( z ) ≤ f ( x ) } = { z − x : z ∈ K} = K − { x } , wh ere the second equality follows fro m a simple change of variables, and the last ones from the d eﬁnitions of the set K an d the Min ko wsk i difference. Therefo re, pr ojecting onto D is equi valent to a projection onto K − { x } .  Lemma 2.4 (Lemma 6. 4 in [47]): Le t D and C be a nonemp ty and clo sed set and co ne, respec tively , such that 0 ∈ D and D ⊆ C . Then for all v ∈ R d kP D ( v ) k ≤ κ f kP C ( v ) k , (12) where κ f = 1 if D is a con vex set and κ f = 2 otherwise. W e now intro duce the con vergence rate provided in [ 47] fo r PGD. For br evity , we present only its noiseless version. Theor em 2 .5 (Noiseless version of Theo rem 1. 2 in [47]): Let x ∈ R d , f : R d → R be a proper function , K =  z ∈ R d : f ( z ) ≤ f ( x )  , C = C f ( x ) th e tan gent co n e of the function f at point x , M ∈ R m × d and y = Mx a vector containing m linear m easuremen ts. Assum e we a r e u sing PGD with K to recover x from y . Then the estimate z t at the t th iteration (in itialized with z 0 = 0 ) obeys k z t − x k ≤ ( κ f ρ ( C )) t k x k , (13) where κ f is deﬁned in L e mma 2.4, and ρ ( C ) = ρ ( µ, M , f , x ) = sup u , v ∈ C ∩ B d u ∗ ( I − µ M ∗ M ) v , (14) is the con vergence rate o f PGD. C. Gaussian mean width When M is a r andom matrix with i.i.d . Gaussian d istributed entries N (0 , 1 ) , it ha s bee n shown in [47] that the conver g ence rate ρ ( C ) is tig htly related to th e dim e nsionality o f the set (model) x reside s in. A v er y u seful expression for measurin g the “intrin sic dimensiona lity” of sets is th e (Gaussian) mean width. Deﬁnition 2.6 ( Ga ussian mean width): The Gaussian mean width of a set Υ is deﬁned as ω (Υ) = E [ sup v ∈ Υ ∩ B d h g , v i ] , g ∼ N (0 , I ) . (15) T wo variants o f th is mea sure ar e gen erally used. The cone Gaussian mean width , ω C = ω ( C ) , which m easures the dimensiona lity of th e tangen t cone C = C f ( x ) ; an d th e set Gaussian mean width , ω K = ω ( K − K ) , which is re- lated directly to th e set K throu gh its Minkowski difference K − K = { z − v : z , v ∈ K} . The co ne Gaussian mean wid th relies on b oth the set K (throu gh f ) and a speciﬁc target point x , while the set Gaussian mean width con siders only K . On the other han d, the d ependen ce of ω C on K is indirect via the descent set at the p oint x . The r e is a series of work s, which developed conver ge nce a nd reco n struction gu a rantees for various me th ods based on ω C [1], [ 20], [ 47], and others that r ely on ω K [51], [5 7]. The ﬁrst ( ω C ) is mainly employed in the case o f conv ex functions f , which are u sed to relax the non-co n vex set in which x r e sid es. In this setting, often D is conv ex an d x ∈ K . As an e xam ple of ω C consider the c a se in which K is the ℓ 1 -ball and x is a k -sp arse vector . T h en ω C ≃ p 2 k log( d/k ) . If we add constraints on x such as ha ving a tree structure, i.e., belongin g to the set ˆ K = { z ∈ R d : k z k 0 ≤ k x k 0 & z obeys a tree structure } , (16) where an entr y may be non-zero on ly if its parent no de is non - zero, then the value of ω C does not ch ange. Alth o ugh th e de f- inition of ω K is very similar to ω C , it yields d ifferent results. For K the set o f k -sparse vector s ω K = O ( p k log( d/k )) , while f or (1 6), ω ˆ K = O ( √ k ) . The ﬁrst r esult is similar to the expression of ω C for the ℓ 1 ball with a k -sparse vector x , y et, the second provides a better measure of the set ˆ K in (16). D. PGD con ver gence rate and th e cone Gaussian mean width In [47], it has b een shown th a t the smaller ω C , the faster the co n vergence. More speciﬁcally , if m is very close to ω C , then we may apply PGD with a step-size µ = 1 ( √ d + √ m ) 2 ≃ 1 d and have a convergence rate of (Th eorem 2.4 in [47]) ρ ( C ) = 1 − O  √ m − ω C m + d  . (17) If ω C is smaller th an √ m by a certain constant factor, then we may apply PGD with a larger step size µ ≃ 1 m , which leads to improved c on vergence ( T heorem 2.2 in [47]) ρ ( K ) = O  ω C √ m  . (18) These relation ships rely on the fact that with larger m the eigenv alues of I − µ M T M (after projectio n o nto C , see (14)) are better positioned such that it is possible to improve conv ergen c e by increasing µ . Both (17) an d (18) set a limit o n the minim al value m for which PGD iterations conv erge to x , namely m = O ( ω 2 C ) . This implies that m ' 2 k log( d/k ) (big g er than app roximate ly 2 k log( d/k ) ) f or K as the ℓ 1 -ball and a k -sp a r se vector x . This is k nown to b e a tight condition. See more examples f or this relationship betwee n ω C and m in [20]. The connections in (17) and ( 1 8) between ρ ( C ) and ω C are not uniq u e on ly to th e case that M is a ra n dom Gaussian matrix. Similar relationships hold f o r many other types of matrices [47]. I I I . P G D T H E O RY B A S E D O N T H E P RO J E C T I O N S E T While Theorem 2.5 covers m any sets K , there are interesting examples that are n o t in cluded in it such as the set of k - sparse vector s corr espondin g to f bein g th e ℓ 0 pseudo- n orm, which is not a pro per fu nction. Even if we ig nore this condition and try to use the result of Theor em 2.5 in the case th at M is a r a ndom Gaussian matrix we face a pro blem. Using th e relationship between ρ ( C ) an d the Gaussian m ean width ω C in (17) and (18), an d the fact th at in this case ω C = √ d , we get th e co ndition m > d . Th is dem a n d on m is in ferior to existing the o ry that in this scenario g uarantees conver g ence with m = O ( k log( d/k ) [10]. One way to overcome this problem is b y con sidering the conv ex-hu ll of the set o f k -spar se vecto rs with b o unded ℓ 2 5 norm. In this case ω C = O ( p k log( d/k )) [51]. Howev er, as mentioned above, guarantees f o r PGD exist for the k -sparse case witho ut a bound on the ℓ 2 -norm . A similar pheno menon also occurs with the set of spa r se vectors with a tr ee structure ˆ K (see (16)), whe r e again ω C = √ d implying m = O ( d ) . Y et, from the work in [4], we know that in this settin g it is sufﬁcient to choose m = O ( k ) . No te that for ˆ K , the set Gau ssian mean width is ω ˆ K = √ k . If we would have relied on it instead of on ω C in the b ound for the required size of m , it would h av e coincided w ith [4]. In o rder to addr ess these deﬁciencies in the convergence rate, we provid e a variant of Theor e m 2.5 that relies on the set K dir ectly through the Minkowski difference K − K in lieu o f C . For simplicity we pr e sent only the noiseless case but the extension to the noisy setting can be perf ormed using the strategy in [47]. Theor em 3.1: Let x ∈ K , K ⊂ R d be a clo sed co ne, M ∈ R m × d and y = Mx a vector containin g m linear measuremen ts. Assume we are u sing PGD with K to recover x fro m y . Th e n the estimate z t at the t th iteration (initialized with z 0 = 0 ) obeys k z t − x k ≤ ( κ K ρ ( K )) t k x k , (19) where κ K = 1 if K is conve x and κ K = 2 otherwise, and ρ ( K ) = ρ ( µ, M , K ) = sup u , v ∈ ( K−K ) ∩ B d u ∗ ( I − µ M ∗ M ) v , (20) is the con vergence rate o f PGD. Pr oof: W e rep eat similar steps to the ones in the p roof of Theorem 1. 2 in [4 7]. W e start b y notin g th at the PGD er ror at iteration t + 1 is, k z t +1 − x k = kP K ( z t + µ M ∗ ( y − Mz t )) − x k (21) = kP D (( I − µ M ∗ M ) ( z t − x )) k , where th e last inequality is du e to Lemma 2 .3 an d the fact that y = Mx . Since K is a closed cone, also the Min ko wsk i difference K − K is a closed co n e. Mor eover , D ⊂ K − K as x ∈ K . Th us, following Lemma 2 .4 we h ave k z t +1 − x k ≤ κ K kP K−K (( I − µ M ∗ M ) ( z t − x )) k (22) ≤ sup v ∈ R d s.t. kP K ( v ) − x k≤k z t − x k κ K kP K−K (( I − µ M ∗ M ) ( P K ( v ) − x )) k ≤ s up v ∈ R d s.t. kP D ( v − x ) k≤k z t − x k κ K kP K−K (( I − µ M ∗ M ) ( P D ( v − x ))) k , where the second in equality is due to th e fact that z t is of the form P K ( v ) for some vector v , and the last in e quality follows from Lemm a 2.3. Noticing tha t the co n straint kP D ( v − x ) k ≤ k z t − x k is equiv alen t to v − x ∈ D ∩ B d k z t − x k (where B d k z t − x k is the ℓ 2 -ball of radiu s k z t − x k ) and using the relatio n D ⊂ K − K leads to k z t +1 − x k (23) ≤ sup v ∈ ( K − K ) ∩ B d κ K kP K−K (( I − µ M ∗ M ) v ) k k z t − x k ≤ sup v , u ∈ ( K−K ) ∩ B d κ K k u ∗ (( I − µ M ∗ M ) v ) k k z t − x k , where the last inequality f ollows fro m L e mma 2.2. Using the deﬁnitio n of ρ ( K ) and ap p lying the in equality in ( 23) recursively leads to the desired resu lt.  When M is a rand om Gaussian m atrix, the relationsh ips in (17) an d ( 18) hold with ρ ( K ) and ω K replacing ρ ( C ) a nd ω C respectively . This im plies tha t we n eed m = O ( ω 2 K ) for conv ergen c e. This result is in line with the c o nditions on m that appe ar in p revious work s for k -sparse vectors [10], fo r which ω K = O ( p k log( d/k )) , an d for sp a rse vectors with tree structur e [4], wher e ω K = O ( √ k ) . As discussed in Section II-C, the m easure ω K is related directly to the set K (may be no n-conve x ) in which x resides. Thus, it p rovides a better measure for the complexity o f K when it is unbo unded or has some speciﬁc structur e a s is the case fo r sp arsity with tree structure [57]. In such settings, Theorem 3. 1 sh ould b e fa vored over The orem 2.5. Notice that if x ∈ K , th en we have D ⊂ K − K . Thu s, in the setting s that D is co n vex and x ∈ K , we have C = D ⊂ K − K implying that ω C ≤ ω K ; when M is random Gaussian, this also implies ρ ( C ) ≤ ρ ( K ) . Theref ore, in this scenario Theor em 2.5 ha s an advantage over Theo rem 3.1. I V . I N E X AC T P RO J E C T E D G R A D I E N T D E S C E N T ( I P G D ) It may hap pen that the function f o r the set K are too loose for describing x . Instead, we may select a set ˆ K that better characterizes x and therefor e leads to a smaller ω , resulting in faster con vergence. This improvement can b e very sign iﬁcant; smaller ω both improves the conver g ence rate an d allows using a larger step-size ( see Section II-D). For example, co n sider the case of a k -spar se vector x , whose sparsity pattern obeys a tree structure. If we igno re the structure in x a n d choose f as the ℓ 0 or ℓ 1 norms, then the mea n wid ths are ω K = O ( k log ( d/k )) [5 1] and ω C ≃ 2 ( k log( d/k )) [20] re spectiv ely . Howe ver, if w e take this structure into acco u nt and u se the set of k -sparse vectors with tr ee structure (see (16)), then ω ˆ K = O ( k ) [5 7]. As m en- tioned ab ove, this improvement may be sign iﬁcant espe c ia lly when m is very clo se to ω K . Such an app r oach was taken in the co ntext of mo del-based compr essed sensing [ 4], where it is shown that fast er con vergenc e is achiev ed by projecting onto th e set o f k -sp a rse vectors with tree struc tu re instead of the standard k - sp arse set. A related study [67] sho wed that it is enoug h to use a small number of Gaussians to rep resent all the patches in natur a l images instead of u sin g a dictionary that spans a much larger union of subsp aces. This work relied on Gaussian Mixture Models (GMM), whose mean width scales p ropor tionally to the number o f Gaussians used, which is signiﬁcantly smaller than the mean wid th of the sparse m odel. A. Inexact pr ojection A dif ﬁculty often encountered is that the projection on to ˆ K , which may even be un known, is mor e complex to implem ent than the projectio n onto K . Th e latter can be easier to pro ject onto but provid e s a lo wer co n vergence rate. Thus, in this work we introd u ce a tech n ique th a t com- promises between the reconstructio n er ror and conver ge n ce 6 speed by using PGD with an inexact “projectio n ” that pr ojects onto a set that is approxim ately as small as ˆ K but y et is as compu tationally efﬁcient as the projection onto K . In this way , the com p utational complexity of each pro jec ted gradient descent iteration rem ains the same while the convergence rate becomes clo ser to tha t of the more complex PGD with a projection o n to ˆ K . The “p rojection” we propo se is co mposed of a simple operator p (e .g., a linear o r an element-wise fun c tion) and the projection onto K , P K , such that it introdu ces on ly a slight distortion into x . In pa rticular, we req uire the following: 1) The p r ojection co ndition for con vex K : I f K is co n vex, then we require k x − P K ( p ( x )) k ≤ ǫ k x k . (24) Due to Lemma 2.3, this is equivalent to kP D ( x − p ( x )) k ≤ ǫ k x k . (25) From the fact th at k P D ( x − p ( x )) k ≤ k x − p ( x ) k , it is sufﬁcient tha t k x − p ( x ) k ≤ ǫ k x k , (26) to ensure (24). Examp le s for projection s that satisfy co ndition (24) are gi ven hereaf ter in sections IV -B and VI-B. 2) The pr ojectio n co ndition for n on-conve x K : In the ca se that K is n on-convex, we re quire kP K ( p v − p x ) − P K ( p v − x ) k ≤ ǫ k x k , ∀ v ∈ R d . (27) Due to Le mma 2.3 an d a simple chang e of v ariab les, (2 7) is equiv alen t to kP D ( p v − x + p x ) − P D ( p v ) k ≤ ǫ k x k , ∀ v ∈ R d , (28) which by another simple c h ange of variables is the same as kP D ( p v − x ) − P D ( p v − p x ) k ≤ ǫ k x k , ∀ v ∈ R d . (29) An examp le fo r a projec tion that satisﬁes conditio n ( 27) is provided in Sectio n VI-A. B. Inexact PGD Plugging the inexact projectio n into the PGD step results in the p roposed inexact PGD (IPGD) iteration (comp a re to (8)) z t +1 = P K ( p ( z t ) + µp ( M ∗ ( y − Mz t ))) . (30) T o m otiv ate this algor ith m consider the pro blem of spectral compressed sensing [25], in wh ich one wants to recover a sparse rep resentation in a diction a ry that has h igh local coheren ce. It has b een shown that if the non-zero s in the representatio n are far from each other then it is easier to obta in good recovery [17]. Let M be a two times r edund ant DCT d ictionary and ˜ x be a k -spa r se vector , with sparsity k = 2 , of dimen sion d = 1 28 , such th at the min imal distance (with respect to the location in the vector ) between no n-zero n eighbo r ing coefﬁcients in it is g reater than 5 (ind ices) an d th e value in each non-z e ro coefﬁcient is gen erated fr om the normal distribution. W e construct th e vector x by adding to ˜ x random Gaussian values with zero mean and variance σ 2 = 0 . 05 at the neigh boring co efﬁcients of each non -zero entry in ˜ x with (location) distan ce 1 or 4 (two different experiments). As men tioned above, a better reconstru ction is ach iev ed b y estimating ˜ x from M ˜ x than by estimating x from Mx due to the highly correlated columns in M . A comm o n pra c tice to improve the recovery in such a c ase is to for ce the recovery algorithm to select a solu tion with separated coefﬁcients. I n our context it is simply using the IPGD with a pr ojection o nto the ℓ 1 ball and p ( · ) that keeps at mo st on ly one d ominant en try (in absolute value) in every neighborho od of size 5 in a given representatio n by zeroing out the other values. The operator p causes an error in the m odel (with ǫ ≃ 0 . 05 √ 2 ≃ 0 . 1 ) an d therefor e reac h es a sligh tly higher ﬁn al error than PGD with projection onto the ℓ 1 ball. Comp ared to PGD, IPGD pro jects onto a simpler set with a smaller Gaussian m ean width , thu s, attaining faster conv ergen ce at th e ﬁrst iteration s, where the approx imation err o r is still sign iﬁcantly larger than ǫ as can b e seen in Fig. 2(a). When the coherence is larger (as in the case of add ed coefﬁcients at d istan ce 1 ), the advantage of IPGD over PGD is mo re signiﬁcant. In some cases IPGD may even attain a lower ﬁnal recovery error compared to PGD. For example, consider the case of M being a four time s red undan t DCT d ictionary and x g enerated as above but with k = 4 . Due to the larger redun dancy in the dictionary , the coherence is larger in this case. Th us, the recovery of x is h arder . Here, PGD with a pro jec tion o nto the ℓ 1 ball co n verges slower and reaches a large erro r due to th e high corr elations between the atoms. Using IPGD with the ℓ 1 ball and a projection p ( · ) that keeps at most o nly o ne domin ant entry (in abso lu te value) in every neighb o rhood o f size 5 , leads both to faster co n vergence and better ﬁna l accuracy as can be seen in Fig. 2(b). V . I P G D C O N V E R G E N C E A N A LY S I S W e turn to analyze the p erform ance of IPGD. For simplicity of the discussion , we analyze the conv ergen ce of this techniq u e only for a linear o perator p and th e noiseless setting, i.e., e = 0 . The e xte n sion to other types of ope r ators an d the no isy case is straightforward by arguments similar to th ose used in [47] for treatin g the n oise term an d other classes of matr ices. W e p resent two theor ems o n the conv ergen ce of IPGD. The ﬁrst result pr ovid es a b ound in terms of ρ ( C ) (i.e., depend s on ω C if M is a ran dom Gaussian matrix) for the case that D is convex corr espondin g to κ f = 1 in Theorem 2 . 5; th e second provide s a boun d in terms of ρ ( K ) (i.e., depend s on ω K if M is a ran dom Gaussian matrix) wh en K is a closed cone but not n ecessarily conv ex. The proo fs of both th eorems are defer red to app endices A an d B. Theor em 5.1: Let x ∈ R d , f : R d → R be a pr oper fu nction, K =  z ∈ R d : f ( z ) ≤ f ( x )  , D = D f ( x ) and C = C f ( x ) the d escent set and the tangen t cone of the functio n f at p oint x respectively , p ( · ) a linea r operator satisfyin g (25), M ∈ R m × d and y = Mx a vector containing m linear measur ements. Assume we are using I PGD with K and p to recover x from y and that D is con vex. Th en the estimate z t at the t th iteration 7 0 50 100 150 200 250 300 350 400 450 50 0 t (Iteration Number) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PGD Dist 1 IPGD Dist 1 PGD Dist 4 IPGD Dist 4 (a) x2 redun dant DCT di ctionary with s parsity k = 2 . 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 t (Iteration Number) 10 4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 PGD Dist 1 IPGD Dist 1 PGD Dist 4 IPGD Dist 4 (b) x4 redundant DCT dictio nary with sparsity k = 4 . Fig. 2. Reconstructi on error as a function of iterati ons for sparse recov ery with a dictionary with high coherence between neighboring atoms. The sparse represent ation in the dictionary is generated such that the re are three correl ated neighboring atoms close to each other with loca tion distance 1 or 4. PGD is applie d with K being the ℓ 1 ball. IPGD is used with the same K and p being a non-linear func tion that for a gi ven v ector k eeps at most only one dominant entry in eve ry neighborhood of ﬁxed size (zeroin g the smaller valu es). This sho ws that IPGD may accel erate con vergenc e compare d to PGD and in some cases (right ﬁgure) ev en achie ve lo wer reco very err or . (initialized with z 0 = 0 ) obeys k z t − x k ≤ ( ρ p ( C )) t + 1 − ( ρ p ( C )) t 1 − ρ p ( C ) (2 + ρ p ( C )) ǫ ! k x k , where ρ p ( C ) = ρ ( µ, M , f , p , x ) = sup u , v ∈ C ∩ B d p ( u ) ∗ ( I − µ M ∗ M ) p ( v ) is the “ef fective convergence r ate” of IPGD for sma ll ǫ . Theor em 5.2: Let x ∈ K , K ⊂ R d be a closed co ne, p ( · ) a linear operato r satisfying (27), M ∈ R m × d and y = M x a vector co ntaining m line ar measuremen ts. Assume we are using IPGD with K an d p to recover x from y . Then the estimate z t at the t th iteration (in itialized w ith z 0 = 0 ) o beys k z t − x k ≤ ( κ K ρ p ( K )) t + 1 − ( κ K ρ p ( K )) t 1 − κ K ρ p ( K ) γ ! k x k , (3 1) where κ K and ρ ( K ) are d eﬁned in Theorem 3.1, γ , (2 ρ ( K ) κ K + ρ p ( K ) κ K + 1 ) ǫ, (32) and ρ p ( K ) = ρ ( µ, M , K , p ) (33) = s up u , v ∈ ( K−K ) ∩ B d p ( u ) ∗ ( I − µ M ∗ M ) p ( v ) is the “ef fective convergence r ate” of IPGD for sma ll ǫ . Theorem s 5.1 an d 5 .2 imp ly th at if ǫ is small enough (compar ed to ρ t p , where t is the iteratio n num ber and ρ p is deﬁned in (33)) then IPGD has an effecti ve co n vergence rate of ρ p = ρ p ( C ) when D is conv ex, an d ρ p = κ K ρ p ( K ) in the ca se that K is a closed cone but not nece ssarily con vex. Note that if p = I then ǫ = 0 and our results coin cide with theorems 2.5 and 3.1. As we shall see he r eafter, for som e operato rs p the rate ρ p may be signiﬁcantly smaller than ρ ( C ) an d ρ ( K ) . T he smaller the set that p maps to, the smaller ρ p becomes. At the same time, when p maps to smaller sets it usua lly p rovides a “coar ser estimate” and thu s the ap p roximatio n e rror ǫ in (25) and (27) increases. Thu s, IPGD allows u s to tradeoff approx imation erro r ǫ and improved conv ergen ce ρ p . The er ror term in theo rems 5.1 and 5.2 at iter a tion t is comprised o f two com ponents. The ﬁrst goes to zero as t increases while the second incr eases with iterations and is on the o rder of ǫ . The f ewer iterations we per form the larger ǫ we may allow . An alternative perspective is th at the larger the reconstruc tion error we can tolerate, the larger ǫ may be and thus we require fewer itera tio ns. T h erefore , the projec tio n p introdu c es a trad eoff. On the one hand, it leads to an increase in the r econstruction error . On the other hand, it simp liﬁes the projected set, which leads to faster conver gen ce (to a solution with larger err or). The works in [35], [36], [39] use a similar conce p t of near- optimal projec tion (compa r ed to [4] tha t assumes only exact projection s). The main d ifference between these co n tributions and ours is that the se pap e rs fo c u s on speciﬁc models, while we present a general f r amew o rk that is not speciﬁc to a certain low-dimension al prior . In ad dition, in these papers the projection is perf ormed to make it possible to rec over a vector from a certa in low-dimensional set, while in this work the main pu rpose of ou r inexact projectio n s is to accelerate th e conv ergen c e within a limited num ber o f iterations. For a lar g er number of iterations these projectio n s may not lead to a good reconstruc tion err o r . 8 0 5 10 15 20 25 30 35 40 45 50 t (Iteration Number) 10 -2 10 -1 10 0 PGD K PGD tree IPGD 1 tree level IPGD 2 tree levels IPGD 3 tree levels IPGD 4 tree levels IPGD changing levels ( a ) R e c o v e r y e r r o r a s a f u n c t i o n o f i t e r a t i o n n u m b e r 0 1 2 3 4 5 6 Running Time (sec) 10 -3 10 -2 10 -1 10 0 PGD K PGD tree IPGD 1 tree level IPGD 2 tree levels IPGD 3 tree levels IPGD 4 tree levels IPGD changing levels ( b ) R e c o v e r y e r r o r a s a f u n c t i o n o f r u n n i n g t i m e ( i n s e c ) 0 1 2 3 4 5 6 7 8 9 1 0 t (Iteration Number) 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 PGD K PGD tree IPGD 1 tree level IPGD 2 tree levels IPGD 3 tree levels IPGD 4 tree levels IPGD changing levels (c) Recove ry error as a funct ion of it eration number zo omed 0 0.2 0.4 0.6 0.8 1 1.2 Running Time (sec) 10 -3 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 PGD K PGD tree IPGD 1 tree level IPGD 2 tree levels IPGD 3 tree levels IPGD 4 tree levels IPGD changing levels (d) Reco very error as a function of running time (in sec) zoo med Fig. 3. Reco nstruction er ror as a function of the iterat ions (lef t) and the running time (right) for reco vering a sparse vecto r with tree structure. Since we initia lize all algorithms with the zero vect or, the error at iterati on/time zero is k x k . Zoomed ve rsion of the ﬁrst 10 iterati ons and ﬁrst 1 ms appe ars in the bottom row . T his ﬁgure demonstrate s the con ver gence rate of PGD with projecti ons ont o the sparse set and sparse tree set compared to IPGD with p that project s onto a certain number of le vels of the tree and IPGD with changi ng p that projects onto an increasin g number of le vels as the iteration s procee d. Note that while PGD with a projec tion onto a tre e structure con verges f aster than IPGD as a function of the number of ite rations (left ﬁgure), it con ver ges slower than IPGD if we ta ke into account the actual run time of each iteration , as sho wn in the right ﬁgure, due to the higher comple xity of th e PGD projections. V I . E X A M P L E S This section pr e sents examples of IPGD with an o perator p that acc elerates the co nvergence of PGD f or a gi ven set K . A. Sparse r ecovery w ith tr ee structure T o demonstrate our theo r y we co n sider a variant of the k - sparse set with tr ee structure in (1 6) that h as smaller we ights in the lower nod es of the tree. W e generate a k -spar se vector x ∈ R 127 with k = 1 3 a nd a sparsity pattern that obeys a tree structure. Moreover , we generate the non-zero entries in x indepen d ently from a Gaussian distribution with zero mean and variance σ 2 = 1 if they are at the ﬁrst tw o levels of the tree an d σ 2 = 0 . 2 2 for the rest. The best way to recover x is by using a pro jection onto the set ˆ K in (16), which is the strategy p roposed in the co ntext of mo del-based compr essed sensing [4]. Y et, this proje c tio n requires some ad ditional compu tations at each iteration [4]. Our tech nique suggests to ap proxim ate it by a linear projectio n onto the ﬁrst levels of the tree (a simple op eration) follo wed by a projection onto K = { z : k z k 0 ≤ k } . The m ore lev els we add in the proje c tio n p , the smaller the ap prox im ation error ǫ turns o ut to be. More speciﬁcally , it is easy to show th at ǫ in (27) is bou nded by two times the energy of th e entries eliminated f r om x divided by the total energy of x , i. e., by 2 k p ( x ) − x k k x k . Clearly , th e more layer s we add the smaller ǫ becomes. Y et, a ssuming tha t all nodes in each layer are selected with equal p robab ility , th e pr obability 9 of selecting a node at layer l is equ al to Q l i =1 0 . 5 i − 1 , where we take into acco unt the fact that a no de can be selected only if all its fo refathers have been chosen. Th us, th e uppe r layers have more signiﬁcant impact on the values of ǫ . On the other hand , the convergence rate ρ p ( K ) fo r a projection with l layer s is e q uiv alent to the con vergence rate for the set of vector s of size 2 l (denoted by K l ). Thus, we get that ρ p ( K ) = ρ ( K l ) , which is depe n dent on the Gaussian mean width ω K l that scales as max( k l , k log(2 l /k )) . Clearly , when we take all the layers l = log( d ) and we h av e ω K l = ω K = O ( k log ( d/k )) . Figure 3(a) presents the signal reconstructio n error ( k x − z t k 2 ) as a f u nction o f the n umber of iterations for PGD with the sets K (IHT [10]) and ˆ K (mod e l-based IHT [4]) 3 and for the prop osed IPGD with p that p rojects onto a different number of lev els (1- 5) of the tree. All algorithm s use step size µ = 1 ( √ d + √ m ) 2 . I t is interesting to note that if p projec ts only onto the ﬁrst layer, then the algo rithm does not conver g e as the resultin g appr oximation error ǫ is too large. Howe ver, starting from the secon d layer, we ge t a faster co n vergence at the ﬁrst iteratio ns with p th at p rojects onto a smaller set, which yields a smaller ρ . As the numbe r o f iteratio ns incr eases, the more accurate p rojections ach iev e a lower recon struction error, where th e plateau attained is propo rtional to the app roximatio n error of p a s predicted by our th eory . This tradeoff can be used to fu rther accelerate the con ver- gence by changin g the projection in IPGD over the iterations. Thus, in the ﬁrst iterations we enjoy the fast conv ergenc e o f the coarser proje c tio ns and in the later on es we use more accurate projection s th at allow ach ieving a lower plateau. The last line in Fig. 3 dem o nstrates this strategy , where at th e ﬁrst iteration p is set to be a pro jection onto the ﬁrst two le vels, and then every fou r iterations anothe r tree lev el is ad d ed to the p rojection un til it becom es a p rojection on to all the tree lev els (in th is case IPGD coincides with PGD). No te th a t IPGD co n verges faster than PGD also wh en the p r ojection in it becomes onto all the tree levels. This can be explained by the fact that typ ically convergence of no n -linear optim iza tion technique s dep ends on the initialization poin t [8 ]. While here we arbitrar ily chose to ad d an other le vel every ﬁxed num ber of iteration s, in general, a con trol set can be used for setting the number of iteration s to b e p erforme d in each training level. W e demonstrate this st r a tegy in Section VII. Since PGD with ˆ K do es not introd u ce an error in its pro- jection and projects o nto a precise set, it achieves the smallest recovery erro r throughou t all iteration s. Y et, as its projection is computationa lly deman ding, it converges slower than IPGD if we take into account the run tim e of each iter ation, as can been seen in Fig. 3(b ). This clear ly demon strates the advantage of using simp le projectio ns with IPGD c o mpared to accu rate but more comp lex pro jections with PGD. B. Sparse r ecovery w ith side informa tio n Another possible strategy to improve recon struction that relates to ou r framework is using side info rmation about the 3 For demonstrat ion purposes we plot only the cases where mode l-based IHT con verges to zero. (a) House image. (b) Patch from house image. 200 400 600 800 1000 10 -6 10 -4 10 -2 10 0 (c) Patch representati on magnitudes with DCT . 200 400 600 800 1000 10 -20 10 -15 10 -10 10 -5 10 0 (d) Patch representation m agnitu des with Haa r . Fig. 4. House Image (top left) and a random patch selec ted from it (top right) with the sorted magnitude (in log scale) of the representatio n of this patch in the DCT (bot tom left) and Haar (bottom right) bases. recovered signal, e.g., from estimates of similar signals. This approa c h was applied to im p rove the qu ality of MRI [31], [40], [62] and CT scans [21], an d also in the g eneral co ntext of sparse recovery [33], [4 1], [61], [6 2]. W e demonstrate this approac h, in comb ination with our propo sed framework, for th e re c overy of a sparse vecto r unde r the d iscrete cosine transform (DCT), g iven inform ation o f its representatio n und e r the Haar tr ansform. Our sampling m atrix is M = AD ∗ , where A ∈ R 700 × 1024 is a random ma tr ix with i.i.d. nor mally d istributed en tries, D ∗ is the ( unitary) DCT transform ( that is app lied on the signa l before multip lying the DCT co efﬁcients with the ran dom matrix A ) an d D is the DCT dictionary . W e use rand om patches of size 32 × 32 , normalized to have un it ℓ 2 norm, from the standa r d house image. Note that such a patch is not exactly sparse either in the Haar o r the DCT domains. See Fig. 4 f or an example of one patch of the h o use image. W ithout con sid e ring th e side in f ormation o f th e Haar transform , one may re c over x (a repr e sentation o f a patch in the DCT basis) b y using PGD with the set K = { z : k z k 1 ≤ k x k 1 } . Given x we may recover the patch Dx . Assume that so meone gives us or acle side inf ormation on th e set of Haar columns correspo nding to the largest coefﬁcients that contain 95% of the energy in a patch Dx . While there are many ways to incorpo rate the side in formation in the recovery , we sh ow here how IPGD can be u sed for this purpo se. Denoting b y P or acle x , 95% the lin ear p rojection on to this set of colu mns, one may apply IPGD with p = DP or acle x , 95% D ∗ and K . As k Dx k = k x k (since D is u nitary), we hav e th at 10 0 50 100 150 200 250 300 350 400 450 500 t (Iteration Number) 10 -1 PGD IPGD oracle IPGD oracle changing IPGD IPGD changing Fig. 5. Reconstr uction error as a functio n of the iterations for sparse recov ery with side information. This demonstrate s the con ve rgence rate of (i) PGD with a projection onto the ℓ 1 ball compared t o (ii) IPGD wit h oracl e side informati on on the columns of the represent ation of x in the Haar basis; (iii) IPGD with oracle side information that projects onto an increa sing number of col umns from the Haa r basis ordered a ccording to thei r s igniﬁca nce in represent ing x ; (iv) IPGD with a projection onto the ﬁrs t 512 columns of that Haar basis; and (v) IPGD with a ch anging p that projects onto an increa sing number of columns from the Haar basis. ǫ = 0 . 05 in (24). Figu re 5 co m pares b etween PGD with K and IPGD with p and K . W e average over 100 different random ly selected sensing matrices and p a tches. The nu mber of colu mns in H a ar that co ntain 9 5 % of th e energy is ro ughly d/ 2 . T hus, the Gaussian mean width ω p ( C ) in this case is rou ghly th e width of the tang ent cone of the ℓ 1 norm at a k -sparse vector in the space of dimension d/ 2 , which is smaller than ω ( C ) ℓ 1 ( x ) . Thus, ρ p ( C ) is smaller than ρ ( C ) . Clear ly , f or less energy preserved in x (i.e . , bigge r ǫ ) we need less colum ns from Haar, wh ich imp lies a smaller Gaussian m ean wid th and a faster convergence rate ρ p ( K ) . W e have here a tradeo ff between the app roximatio n erro r we may allow ǫ and the conv ergen ce rate ρ p ( K ) th at improves as ǫ in c r eases. Since p rojections onto smaller sets lead to faster conver - gence we sug g est as in the previous example to apply PGD with an oracle projectio n that uses less colu mns from th e Haar basis at th e ﬁrst iter ation (i.e., has larger ǫ ) and then adds columns gradually th r ougho ut the iterations. The third ( red) line in Fig. 5 demonstrates this option, wher e the ﬁr st iterations use a pr ojection onto the co lumns that contain 50% o f the energy of x an d then e very 5 iterations the next 5 0 columns correspo d ing to the c o efﬁcients with the largest energy are added. W e continu e un til the column s span 95% of the energy of the sign a l. T hus, IPGD with chan ging projection s conver g es faster than IPGD with a constant p = DP or acle x , 95% D ∗ but reaches the sam e plateau. T ypically , oracle informa tio n on the coefﬁcients of x in the Haar b asis is not accessible. Even though, it is still possible to use co mmon statistics of the data to accelerate co n vergence. For example, in our case it is kn own that most of the energy of the signal is co ncentrated in the low-resolution Haar ﬁlters. Image Bicubic OMP IHT LIPGD baboon 23.2 23.5 23.4 23.6 bridge 24.4 25.0 24.8 25.1 coastgua rd 26.6 27.1 26.9 27.2 comic 23 . 1 24.0 23.8 24.2 fac e 32.8 33.5 33.2 33.6 ﬂo wers 2 7.2 28.4 28.1 28.7 foreman 31.2 33.2 32.3 33.5 lenna 31.7 33.0 32.6 33.2 man 27.0 27.9 27.7 28.1 monarch 29.4 31.1 30.9 31.6 pepper 32.4 34.0 33.6 34.4 ppt3 23 . 7 25.2 24.6 25.5 zebra 26.6 28.5 28.0 28.9 T ABLE I P S N R O F S U P E R - R E S O LU TI O N B Y B I C U B I C I N T ER P O L ATI O N A N D A PA I R O F D I C T I O NA R I E S W I T H V A R I O U S S PA R S E C O D I N G M E T H O D S . 0 10 20 30 40 50 60 70 80 90 100 t (iteration number) 10 -3 10 -2 10 -1 ISTA LISTA LISTA-MM Fig. 6. T he ℓ 1 loss as a function of the iterati ons of IST A, LIST A and LIST A- MM appli ed on patc hes from the house image. This demonstrates the fa ster con ver gence of the propose d LIST A-MM compared to LIST A and the fast con ver gence of L IST A compa red to IST A. Therefo re, we propose to u se IPGD with a pr ojection p tha t projects on to the ﬁrst 512 colu mns of the Haar basis. As before, it is possible to ac celerate conv ergen ce b y projecting ﬁrst on a smaller nu mber of c o lumns an d then increasing the num ber as the iterations proce e d (in this case we add columns till IPGD coincides with PGD). These two option s are presented in th e f ourth and ﬁfth line of Fig. 5 , respectively . Both of these o ptions provide faster conv ergenc e , wher e I PGD with a ﬁx ed projection p in curs a hig her error as it uses less accurate projec tio ns in the la st iter a tions co mpared to PGD and IPGD with changing projections. The p lateau of the latter is th e same one of the regular PGD (which is not attained in the graph du e to its early stop) but is ach ieved with a much smaller num ber of iteratio ns. V I I . L E A R N I N G T H E P RO J E C T I O N – L E A R N E D I P G D ( L I P G D ) In many scenarios, we may not know what type of simple operator p cau ses P K ( p ( · )) to ap proxim a te ˆ K in the best 11 possible way . There f ore, a useful strategy is to learn p for a given da ta set. Ass u ming a lin ear p , we may rewrite (30) as z t +1 = P K ( p ( µ M ∗ y ) + p (( I − µ M ∗ M ) z t )) . (3 4) Instead o f learnin g p dir ectly , we may learn two matrices A and U , wher e th e ﬁrst replaces pµ M ∗ and the second p ( I − µ M ∗ M ) . Th is results in th e iterations z t +1 = P K ( Ay + Uz t ) , (35) which is very similar to those of LIST A in (7). Th e o nly difference between ( 3 5) an d LIST A is the no n-linear part, which is an or thogon al p r ojection in th e ﬁrst and a p roximal mapping in the second. W e apply th is method to replace th e sp a rse co ding step in the supe r-resolution algorithm pro p osed in [68], where a pair of low an d high resolution dictionaries is used to reconstruc t the patches of th e high- resolution imag e from the low-resolution one. In the co d e pr ovided by the au thors of [68], or thogon al matc h ing p u rsuit (OMP) [44] with spa r sity 3 is used. Th e complexity of this strategy c o rrespon ds to IHT with 3 iteratio ns. Th e target sparsity we use w ith IHT is h igher ( k = 4 0 ) as it was observed to provide better reconstruc tion results. Note that in IHT , u nlike OMP , th e number of iter ations m ay b e dif fer ent than the sparsity level. For op tim al hy perpar ameter selection (suc h as ch oosing th e target sparsity level), we use th e training set used for the training of the dictionary in [68], which contains 91 images. Since IHT does not converge with only 3 itera tio ns, we apply LIPGD to accelerate convergence. W e use the sam e dictionary dimension as in [68] ( 3 0 × 1000 and 81 × 1000 for the low and high r esolution dictionaries, respe c tively), and train an LIPGD n etwork to infer the sparse code of the image p atches in the low-resolution diction ary . Training of the weights is perfor med by stoch astic gradient descent with batch-size 1000 a nd Nesterov mom entum [4 6] for ad aptively setting the learning rate. W e train the network using only the ﬁrst 85 imag es in the training set, keepin g the last 6 as a validation set. W e reduce the trainin g rate by a factor of 2 if the validation erro r sto p s decr easing. The initial lear n ing rate is set to 0 . 00 1 and th e Nesterov parameter to 0 . 9 . W e use the sparse repr esentations of the training data calculated b y I HT or LIPGD to generate the high-reso lution dictionary as in [ 68]. T able I su m marizes the reconstru ction results of regular bicubic inter polation, the OMP-based sup er-resolution tech- nique o f [68] (with 3 iterations) an d its version with IHT and LIPGD ( replacing OMP). It can be seen clearly tha t IHT leads to inferior results compared to OMP since it does not converge in 3 iteratio ns. LIPGD improves o ver b oth IHT and OMP as the tr aining of the network allows it to pr ovide goo d sparse approx imation with o nly 3 iterations. Th is dem onstrates the efﬁciency of the p ropo sed L I PGD tech nique, which has the same co mputation al comp lexity of bo th OMP and I HT . V I I I . L E A R N I N G T H E P RO J E C T I O N – L I S T A M I X T U R E M O D E L Thoug h the theory in this pa per applies directly on ly to (35) (with some constraints on A and U that stem fr om the constraints on p ), the fast conver ge n ce of LIST A may b e explained by the resemb lance o f the two m ethods. The success of LIST A may b e interpreted as learning to approx imate the set ˆ K in an indirect way by learning the lin ear operator s A and U . In other word s, it can be viewed as a method fo r learning a linear operator that together with the proximal map ping S f ,λ approx imates a more accurate pr oximal mapping o f a true unknown fu nction ˆ f that leads to much faster convergence. W ith this und erstanding , we argue that using multiple inex- act pr ojections may lead to faster c o n vergence as each can approx imate in a more accura te way d ifferent parts of the set ˆ K . In order to show th is, w e propo se a LIST A mixture model ( M M), similar to the Gaussian m ix ture mo del p roposed in [67], in which we train several LI ST A networks, one for each par t of the dataset. Then, once we get a n ew vector, we apply all the network s o n it in p arallel (and therefor e with negligible imp a ct on the latency , which is very im portant in many applications) and chose the o ne that attains the smallest value in the objective of th e minimization problem (3). W e test this strategy o n the h ouse image b y extracting fr om it p atches of size 5 × 5 , adding rando m Gaussian noise to each of them with variance σ 2 = 25 and then rem oving the DC and norm alizing each. W e take 7 / 9 of the patch e s for training and 1 / 9 for validation an d testin g . W e train LIST A to m in imize d irectly the objective (3) as in [55] a n d sto p the optimization after the error of th e validation set increases. For the LIST A- MM we use 6 LIST A networks such that we train the ﬁr st one on the whole data. W e then remove 1 / 6 of the data who se o bjective value in (3) is the closest to the on e IST A attains a fter 1000 iterations. W e u se this LI ST A network a s the initialization of the n ext o ne th at is trained on the rest of the data. W e r e p eat this p rocess by removin g in the same way the part of the da ta with the smallest re lati ve erro r an d then train the next network. After training 6 n etworks we cluster the data points by selecting for each patch the network th a t leads to the smallest objective error for it in (3) and ﬁne tune each network for its correspo n ding group of patch e s. W e re peat this process 5 times. The objective erro r of (3) as a f u nction of the num ber of iteration s/depth of the networks is pr esented in Fig. 6. In deed, it can b e seen that pa r titioning the data, wh ich leads to a better app roximatio n , accelerates convergence. Our proposed LIST A-MM strategy bears some resemblence to the r e c ently prop o sed rap id a nd accu r ate im age super resolution (RAISR) algor ithm [5 3]. In th is m ethod, different ﬁlters are trained for different typ es of patches in n atural images. T h is leads to im p roved quality in the attained up- scaled image s with only m inor overhead in the computa tio nal cost, leading to a very efﬁcient sup er-resolution techn ique. I X . C O N C L U S I O N In this work we suggested a n ap proach to trade-off b etween approx imation error and convergence speed. T his is accom- plished by approx imating comp licated projections by inexact ones that are c omputatio nally efﬁcient. W e provid ed theo r y for the conver gen ce of an iterativ e algo rithm that u ses such an app r oximate projection and showed that at the cost o f an error in the pr ojection on e may achieve faster conv ergen ce in 12 the ﬁrst iterations. The larger the err o r th e smaller the n umber of iteration s that en joy fast conver g ence. This suggests that if we have a budget for on ly a small numbe r of iteration s (with a giv en complexity) , then it may be worthwhile to use inexact pr ojections which can result in a worse solutio n in the long term but make be tter u se of the given compu tational constraints. M o reover , we showed that e ven when we can afford a larger n umber of iterations, it may be worthwh ile to use inexact projectio n s in the ﬁrst iterations and then change to m o re accurate o nes at latter stages. Our theo ry offers an explanation to the recen t success o f neural networks for app roximatin g the solution o f certain min- imization prob lems. These networks achieve similar accu racy to iterative tech niques developed fo r such pro blems (e.g., IST A for ℓ 1 minimization ) but with much smaller com putationa l cost. W e demonstrate the usage of this method for the pr o blem of image sup er-resolution. In ad dition, o ur an a lysis p rovides a technique for estimating the solution of these minimizatio n problem s b y using m ultiple network s b u t with fewer layers in each of them. A P P E N D I X A P R O O F O F T H E O R E M 5 . 1 The pro of of Th eorem 5.1 relies on the f ollowing lemm a. Lemma A . 1: Under th e same conditions o f Theo r em 5.1 kP C ( p ( I − µ M ∗ M ) ( z t − x )) k (36) ≤ ρ p ( C ) k z t − x k + ǫ (1 + ρ p ( C )) k x k . Pr oof: Since z t = P K ( p v ) for a ce r tain vector v , we have kP C ( p ( I − µ M ∗ M ) ( z t − x )) k (37) ≤ sup v ∈ R d s.t. kP K ( p v ) − x k≤k z t − x k kP C ( p ( I − µ M ∗ M ) ( P K ( p v ) − x )) k = s up v ∈ R d s.t. kP D ( p v − x ) k≤k z t − x k kP C ( p ( I − µ M ∗ M ) P D ( p v − x )) k , where the last equality fo llows from Le m ma 2.3. Using the triangle ine quality with (37 ) leads to kP C ( p ( I − µ M ∗ M ) ( z t − x )) k (38 ) ≤ s up v ∈ R d s.t. kP D ( p v − x ) k≤k z t − x k kP C ( p ( I − µ M ∗ M ) P D ( p v − p x )) k +     P C  p ( I − µ M ∗ M ) P D ( x − p x ) kP D ( x − p x ) k      kP D ( p x − x ) k . W e tu rn now to b ound the ﬁrst and secon d terms in th e right-ha n d-side (rhs) of ( 38). For the second term, note that since P D ( x − p x ) ∈ D , we have     P C  p ( I − µ M ∗ M ) P D ( x − p x ) kP D ( x − p x ) k      (39) ≤ sup v ∈ D ∩ B d kP C ( p ( I − µ M ∗ M ) v ) k ≤ sup v , u ∈ C ∩ B d k u ∗ p ( I − µ M ∗ M ) v k ≤ ρ ( K ) , where the seco nd inequa lity follows f rom Lemma 2 .2 and th e fact that D ⊂ C . In the last ineq uality we replace u ∗ p = ( p ∗ u ) ∗ by ˜ u and take the supr emum over it instead of over u . In addition, kP D ( p x − x ) k ≤ ǫ from ( 25). For th e ﬁrst term in the r hs of (38) we use the in- verse trian gle in equality kP D ( p v − p x ) k − kP D ( p x − x ) k ≤ kP D ( p v − x ) k and ( 25). Combin ing the results leads to kP C ( p ( I − µ M ∗ M ) ( z t − x )) k (40 ) ≤ sup v ∈ R d s.t. kP D ( p ( v − x )) k≤k z t − x k + ǫ k x k kP C ( p ( I − µ M ∗ M ) P D ( p ( v − x ))) k + ǫρ ( K ) k x k = s up u ∈ C ∩ B d , v ∈ R d s.t. kP D ( p v ) k≤ 1 k u ∗ p ( I − µ M ∗ M ) P D ( p v ) k ( k z t − x k + ǫ k x k ) + ǫρ ( K ) k x k = sup u ∈ C ∩ B d ,p v ∈ D ∩ B d k ( p u ) ∗ ( I − µ M ∗ M ) p v k ( k z t − x k + ǫ k x k ) + ǫρ ( K ) k x k ≤ sup v , u ∈ C ∩ B d k ( p u ) ∗ ( I − µ M ∗ M ) p v k ( k z t − x k + ǫ k x k ) + ǫρ ( K ) k x k , where the ﬁrst equa lity follows f rom Lemma 2.2 and th e second fro m the deﬁn ition of a p rojection o nto a set. The last ineq u ality follows from the fact that { p v ∈ D ∩ B d } ⊂ D ∩ B d ⊂ C ∩ B d . Reord ering the terms and u sing the d eﬁnition of ρ p ( C ) lead s to the d esired result.  W e turn now to the pr oof of Th eorem 5.1 Pr oof: Th e IPGD e r ror at iter ation t + 1 is, k z t +1 − x k = kP K ( p ( z t + µ M ∗ ( y − Mz t ))) − x k . ( 4 1) Using Lemm a 2.3 an d the fact that y = Mx we h av e k z t +1 − x k = k P D ( p ( I − µ M ∗ M ) ( z t − x ) − x + p x ) k (4 2 ) ( a ) ≤ kP D ( p ( I − µ M ∗ M ) ( z t − x )) k + kP D ( x − p x ) k ( b ) ≤ kP C ( p ( I − µ M ∗ M ) ( z t − x )) k + ǫ k x k , where ( a ) follows fro m the con vexity of D and th e tr iangle in- equality; and ( b ) from (25) and Lemma 2 .4. U sin g Lemma A.1 with (42) leads to k z t +1 − x k ≤ ρ p ( K ) k z t − x k + ǫ (2 + ρ p ( K )) k x k . (4 3) Applying the in equality in ( 43) recur si vely provides the de- sired result.  A P P E N D I X B P R O O F O F T H E O R E M 5 . 2 The pro of Theor e m 5 .2 relies on the following lemma. Lemma B.1: Under the same conditions of Th eorem 5.2, kP K−K ( p ( I − µ M ∗ M ) ( z t − x )) k (44) ≤ ρ p ( K ) k z t − x k + ǫ (2 ρ ( K ) + ρ p ( K )) k x k . 13 Pr oof: Using Lemma 2 . 3 an d the fact that z t = P K ( p v ) for a certain vector v ∈ R d and the n the triangle inequ ality , leads to kP K−K ( p ( I − µ M ∗ M ) ( z t − x )) k (45) ≤ s up v ∈ R d s.t. kP D ( p v − x ) k≤k z t − x k kP K−K ( p ( I − µ M ∗ M ) P D ( p v − x )) k ≤ s up v ∈ R d s.t. kP D ( p v − x ) k≤k z t − x k kP K−K ( p ( I − µ M ∗ M ) P D ( p v − p x )) k +     P K−K  p ( I − µ M ∗ M ) P D ( p v − x ) − P D ( p v − p x ) kP D ( p v − x ) − P D ( p v − p x ) k      · kP D ( p v − x ) − P D ( p v − p x ) k . Using (29) an d th e sam e step s of the pro of of Theorem 3 .1, w e may boun d th e second term in the rhs of (45) by 2 ǫ ρ ( K ) k x k . This lead s to kP K−K ( p ( I − µ M ∗ M ) ( z t − x )) k (46) ≤ s up v ∈ R d s.t. kP D ( p v − x ) k≤k z t − x k kP K−K ( p ( I − µ M ∗ M ) P D ( p v − p x )) k +2 ǫρ ( K ) k x k . From the in verse trian gle in equality together with (29), we have that kP D ( p v − x ) k ≥ kP D ( p v − p x ) k − ǫ k x k . Thus, kP K−K ( p ( I − µ M ∗ M ) ( z t − x )) k (47 ) ≤ sup v ∈ R d s.t. kP D ( p ( v − x )) k≤k z t − x k + ǫ k x k kP K−K ( p ( I − µ M ∗ M ) P D ( p ( v − x ))) k +2 ǫρ ( K ) k x k ≤ ρ p ( K ) ( k z t − x k + ǫ k x k ) + 2 ǫρ ( K ) k x k , where the last in equality f ollows from th e same lin e o f argument used for der iving ( 40) in Lemma A.1 (with K − K instead of C ).  W e now turn to the proof of Theorem 5 . 2. Pr oof: De noting ˜ v = ( I − µ M ∗ M ) ( z t − x ) , the IPGD erro r at iteratio n t + 1 o b eys k z t +1 − x k = kP D ( p ˜ v − x + p x ) k (48) ( c ) ≤ kP D ( p ˜ v ) k + kP D ( p ˜ v − x + p x ) − P D ( p ˜ v ) k ( d ) ≤ κ K kP K − K ( p ˜ v ) k + ǫ k x k , where ( c ) follows fro m the triangle inequality; and ( d ) from (28) and Lemma 2.4. Usin g Lemma B.1 with (48), we g et k z t +1 − x k (49) ≤ ρ p ( K ) κ K k z t − x k + ǫ (2 ρ ( K ) κ K + ρ p ( K ) κ K + 1) k x k . Applying (4 9) recursively leads to the desire d r e sult.  A C K N OW L E D G M E N T S RG is partially suppo rted by GI F grant no. I - 2432- 406.1 0/2016 and ERC-StG grant no . 757 497 (SP ADE) . YE is partially suppo rted by th e ERC g r ant no. 6468 04-ERC-COG- BNYQ. AB is partially supported by ERC-StG RAPID. GS is partially suppo rted by ONR, NSF , NG A , and ARO. W e thank Dr . Pablo Sprechma n n fo r early work and in sights into this line of researc h , and Prof. Ron Kimmel and Prof. Gilles Blanchard for insigh tful commen ts. R E F E R E N C E S [1] D. Am elunx en, M. Lotz, M. B. McCoy , and J. A. Tro pp. L i ving on the edge: phase tra nsitions in conv ex programs with random data. Informatio n and Infer ence , 3(3):224–294, 2014. [2] M. Andrycho wicz, M.and Denil, S. G ´ omez, M. W . Hoffman, D. Pfa u, T . Schaul, and N. de Frei tas. Learning to learn by grad ient descent by gradien t descen t. In Advance s in Neura l Information Proce ssing Systems (NIPS) , pages 3981–3989 . 2016. [3] J.-F . Aujol and C. Dossal. Stability of over -relaxation s for th e forward- backw ard algor ithm, appl ication to FIST A. SIAM Journa l on Optimiza- tion , 25(4):2408–2433, 2015. [4] R. G. Baraniuk , V . Cevh er, M. F . Duarte, and C. Hegde. Model-based compressi ve sensing. IEEE T rans. Inf. T heory , 56(4):1982–200 1, April 2010. [5] A. Beck and Y . C. Eldar . Sparsity constrained nonlinea r opti mization: Optimali ty condit ions and algori thms. SIAM Optimizatio n , 23(3):1480– 1509, Oct. 2013. [6] A. Beck and M. T eboulle. Fast gradient-base d algorit hms for constrained total v ariation image denoising and deblurring problems. IEEE T rans- actions on Imag e Pr ocessing , 18(11 ):2419–2434, Nov 2009. [7] A. Beck and M. T eboul le. A fast iterati ve shrinkage- thresholding algorit hm for linear in verse problems. SIAM J . Img . Sci. , 2(1):183 –202, Mar . 2009. [8] D. P . Be rtsekas. Nonlinear pr ogr amming . Athena Scien tiﬁc, 1999. [9] T . Blumensat h. Sampling and rec onstructing signa ls from a union of linea r subspaces. IEEE Tr ans. Inf. Theory , 57(7):4660–4671, 2011. [10] T . Blumensath and M. E . Davie s. Iterati ve hard threshol ding for compressed sensing. Appl. Comput. Harmon. Anal , 27(3):265 – 274, 2009. [11] M. Bor gerding and P . Schniter . Onsager-c orrected deep learni ng for sparse linear in verse problems. arXiv:1607.0596 6 , 2016. [12] L. Bottou and O. Bousquet. T he tradeof fs of large scale learning. In Advances in Neural Information Proc essing Systems (NIPS) , pages 161– 168, 2008. [13] L. Bottou, F . E . Curtis, and J. Nocedal. Optimiz ation methods for large- scale machine lea rning. arXiv:1606 .04838 , 2016. [14] S. Boyd, N. Parikh, E. Chu, B. Pele ato, and J. E ckstein. Distributed optimiza tion and statistica l learni ng via the alterna ting dire ction method of multipliers. F ound. T rends Mach. Learn. , 3(1): 1–122, January 2011 . [15] A. M. Bruckstein, D. L. Donoho, and M. Elad. From sparse solutions of systems of eq uations to sparse model ing of signa ls and images. SIAM Revie w , 51(1):34–81 , 2009. [16] J. J. Bruer , J. A. Tropp, V . Ce vher , and S. R. Becker . Designing statistica l estimato rs that balance sample size, risk, and computational cost. IEEE J ournal of Selected T opics in Signal Pr ocessing , 9(4):612–624, June 2015. [17] E. J. Cand ` es and C. Fernandez-Gran da. T ow ards a m athemat ical theor y of super-resolu tion. Communicati ons on Pure and A pplied Mathemati cs , 67(6):906– 956, 2014. [18] E.J. Cand ` es and T . T ao. Decoding by linear programming. IE EE T rans. Inf. Theory , 51(12 ):4203 – 4215, dec . 2005. [19] V . Chandrasekara n and M. I. Jordan. Computational and statistical tradeof fs via con vex rela xation. Pro ceedings of the National Academy of Sciences (PN AS) , 110(13): E 1181–E1190, 2013. [20] V . Chandrasekara n, B. Recht, P . A. Parril o, and A. S. Wil lsky . T he con- ve x geometry of linear inv erse problems. F oundations of Computat ional Mathemat ics , 12(6):805–849, 2012. [21] G.-H. Chen, J. T ang, and S. Leng. Pri or image constrained compressed sensing (PICCS): A method to accurately reconstruct dynamic CT images from highly undersampled project ion data sets. Medical Physics , 35(2):660– 663, Mar . 2008. [22] P . L. Combettes and J.-C. Pesquet. Proximal splitting methods in s ignal processing. In H. H. Bauschke, S. R. Burach ik, L. P . Combett es, V . Elser , R. D. Luke, and H. W olk owicz, editors, F ixed-P oint A lgorithms for In verse Pr oblems in Science and Engineering , pages 185–212. Springer Ne w Y ork, 2011. [23] A. Dani ely , N. Linial, and S. Sha lev-Shw artz. More data spee ds up traini ng time in learning halfspa ces over sparse vecto rs. In International Confer ence on Neur al Information Pr ocessing Systems (NIPS) , pag es 145–153, 2013. [24] I. Daubechi es, M. Defrise, and C. De Mol. An iterat ive thresholding algorit hm for linea r in verse proble ms with a sparsity constra int. Commu- nicati ons on Pur e and Applied Mathematics , 57(11):14 13–1457, 2004. [25] M. F . Dua rte and R. G. Baraniuk. Spectral compressiv e sensing. Appl. Comput. Harmon. Anal. , 35(1):111 – 129, 2013. 14 [26] J. Duchi, S . Shale v-Shwartz , Y . Singer , and T . Chandra. Ef ﬁcient project ions onto the ℓ 1 -ball for learning in high dimensions. In Pr oceedings of the 25th Annual Internat ional Confer ence on Machine Learning (ICML) , 2008. [27] M. Elad. Sparse and Redundant Representa tions: F r om Theory to Applicat ions in Signal and Imag e Proc essing . Springer Publishing Compan y , Incorporat ed, 1st edition, 2010. [28] M. Elad, B. Matalon, and M. Zibule vsky . Coordinate and subspace optimiza tion methods for linear least squares with non-quad ratic regu- lariza tion. Appl. Comp ut. Harmon. Anal. , 23(3): 346 – 367, 2007. [29] Y . C. Eldar . Sampli ng Theory: Bey ond B andlimit ed Systems . Cambri dge Uni versity Press, 2015. [30] Y . C. Eldar and G. Kutyni ok. Compr essed Sensing: Theory and Applicat ions . Cambridge Uni versi ty Press, 2012. [31] J. A. Fessler , N. H. Clinthor ne, and W . L . Rogers. Regula rized emission image reconstructi on using imperfect side information. IEEE T rans. Nucl. Sc i. , 39(5):14 64–1471, Oct. 1992. [32] S. Foucart and H. Rauhut . A Mathematical Intr oduction to Compr essive Sensing . Springer Publishin g Compan y , Incorporat ed, 1st edition, 2013. [33] M. P . Friedlande r, H. Mansour , R. Saab, and O. Y ilmaz. Recove ring compressi vely sample d signals using pa rtial support informatio n. IEEE T rans. Inf. Theory , 58(2):1 122–1134, Feb 2012. [34] D. Gabay and B. Mercier . A dual algori thm for the solution of nonlinear v ariational problems via ﬁnite ele ment approximati on. Computer s and Mathemat ics with Applications , 2(1):1 7–40, 1976. [35] R. Giryes, S . Nam, M. Elad, R. Gribon v al, and M.E. Davies. Greedy- lik e algorithms for the cosparse analy sis model. Linear Alg ebra and its Applicatio ns , 441(0):22 – 60, 2014. Special Issue on Sparse Approximate Solution of Linear Syste ms. [36] R. Giryes and D. Needell. Greedy signal s pace methods for incoherenc e and beyond. Appl. Comput. Harmon. Anal. , 39(1) :1 – 20, 2015. [37] T . Goldstein and S. Os her . The split bregman method for l1-re gularized problems. SIAM Jou rnal on Ima ging Scienc es , 2(2):323–343, 2009. [38] K. Gregor and Y . LeCun. Learning fast approximati ons of sparse coding. In Proc eedings of the Int ernational Confe rence on M achine L earning (ICML) , 2010. [39] C. He gde, P . Indyk, and L. Schmidt. Approximation -tolerant model- based compressi ve sensing. In ACM Symposium on Discre te Algorithms (SOD A ) , 2014. [40] A. O. Her o, R. Piramuthu, J. A. Fessler , and S. R. Titus. Minimax emission computed to m ography using high-resolution anat omical side informati on and b-spline models. IEEE T rans. Inf . Theory , 45(3):9 20– 938, A pr . 1999. [41] M. A. Khajehne jad, W . Xu, A. S. A vesti mehr, and B. Hassibi. W eighted ℓ 1 minimizat ion for sparse recov ery with prior information. In IEEE Internati onal Sympo sium on Information Theory (ISIT) , pages 483–487, June 2009. [42] A. Kipnis, Y . C. Eldar , and A. J. Goldsmith. Sampling stationa ry signals subject to bitrat e constraint s. , 2016 . [43] P . Machart, S. Anth oine, and L. Baldassarre. Optimal comput ational trade-o ff of inexa ct proximal methods. In Mul ti-T rade-of fs in Machin e Learning (NIPS workshop) , 2012. [44] S. Mallat and Z. Zhang. Matching pursuits with time-freq uency dicti onaries. IEEE T rans. Signal Proc ess. , 41:3 397–3415, 1993. [45] T . Moreau and J. Bruna. Understand ing neural sparse coding with m atrix fac torization. ICLR , 2017. [46] Y . Nestero v . A method of solving a con vex programming problem with con ver gence rate o (1 /k 2 ) . Sov iet Mathemati cs Doklady , 27(2):372 – 376, 1983. [47] S. Oymak, B. Recht, a nd M. Soltanolko tabi. Sharp ti me–data tradeof fs for linear in verse problems. to appear in IEEE T rans. Inf . Theory , 2018. [48] M. Pilanci and M. J. W ainwright. Ne wton sketch: A linear- time optimization algo rithm with li near-quadra tic con vergenc e. arXiv:1505.02250 , 2015. [49] M. Pilanc i and M. J. W ainwrigh t. Randomized sketch es of con ve x programs with sharp guarantees. IEEE T rans. Inf. Theory , 61(9), 2015. [50] M. Pil anci and M. J. W ainwrig ht. Iterati ve hessian sk etch: Fast and accura te s olutio n approximation for constraine d lea st-squares. Jou rnal of Machine Learni ng Researc h (JMLR) , (17), 2016. [51] Y . Plan and R. V ershynin. Robust 1-bit compressed sensing and sparse logisti c regression: A con vex progra mm ing app roach. IEEE T rans. Inf . Theory , 59(1):482–494, Jan. 2013. [52] T . Remez, O. Lita ny , a nd A. M. Bronstein. A picture i s w orth a bil lion bits: Rea l-time image reconstru ction from dense binary pixels. In 18th Internati onal Confere nce on Computati onal Photograp hy (ICCP) , 2015. [53] Y . Romano , J. Isido ro, and P . Mil anfar . RAISR: Rapid and accurat e image super resolution. IEEE T rans. on Computati onal Imaging , 3(1):110–1 25, Mar . 2017. [54] S. Shale v-Shwart z and N. Srebro. Svm optimizatio n: Inv erse dependence on training set siz e. In Int ernational Confer ence on Mac hine Learning (ICML) , pages 928–93 5, 2008. [55] P . Sprechmann, A. M. Bronstein, and G. Sapiro. Learning ef ﬁcient sparse and low rank models. IEEE T rans. P attern Analysis and Machin e Intell igence , 37(9):1821–1833, Sept. 2015. [56] W . S u, S. Boyd, and E . J. Cand ` es. A dif ferential equation for modeling nestero vs accel erated gradient m ethod: Theory and insights. In Advance s in Neural Inf ormation Pr ocessing Syste ms (NIPS) , 2014. [57] T . Tirer and R. G iryes. Generalizin g cosamp to signals from a union of lo w dimensiona l linear subspaces. arX iv abs/1703.01920 , 2017. [58] J. T ompson, K. Schl achter , P . Sprechmann, and K. Perlin. Accele rating Eulerian ﬂuid simulation with con voluti onal net works. Pr oceedings of the International Confe rence on Machi ne Learning (ICM L) , 2017. [59] E. T reister and I. Y avneh . A multile vel iterat ed-shrinkage approach to ℓ 1 penali zed least-squares minimization. IEE E T rans. Signal Pr ocess. , 60(12):631 9–6329, Dec. 2012. [60] S. Tu, S. V enkataraman, and M. I. Jordan B. Recht A. C. Wil- son, A. Gittens. Breakin g localit y accelera tes block Gauss-Seidel. arXiv:1701.03863 , 2017. [61] N. V aswa ni and W . Lu. Modiﬁed-CS: Modifying compressi ve sensing for problems with partial ly known support . IEEE T rans. Sig. Pr oc. , 58(9):4595 –4607, Sept 2010. [62] L. W eizman, Y . C. E ldar , and D. Ben Bashat. Compressed sensing for longitudina l MRI: An adapti ve -weighted approac h. Medical Physics journal , 42(9):5195–5208 , Aug. 2015. [63] A. Wibisono , A. C. W ilson, and M. I. Jordan. A va riational perspecti ve on acceler ated methods in opt imization. National Acade my of Scienc es (PNAS) , 230(8):E7351 –E7358, 2016. [64] A. C. W ilson, B. Recht , and M. I. Jordan. A lyapuno v analysis of momentum methods in optimizat ion. , 2016 . [65] B. Xin, Y . W ang, W . Gao, and D. Wipf . Maximal sparsity with deep netw orks? In NIPS , 2016. [66] Z. Y ang, Z. W ang, H. Liu, Y . C. Eldar , and T . Z hang. Sparse nonline ar regression: Parameter estimation and asymptotic inferenc e under noncon ve xity . In International Confe rence on Mac hine Learning (ICML) , 2016. [67] G. Y u, G. Sapiro, and S. Mallat. Solving in verse problems with piece wise linea r estimators: From G aussian m ixture models to structure d sparsity . IE EE T rans. on Ima ge Proce ssing , 21(5):2481 –2499, may 2012. [68] R. Ze yde, M. E lad, and M. Protter . On s ingle ima ge scale-up usin g sparse-repre sentations. In P r oceedings of the 7th international confer- ence on Curves and Surfaces , pages 711–730, Berlin , Heidelber g, 2012. Springer -V erlag.

Tradeoffs between Convergence Speed and Reconstruction Accuracy in Inverse Problems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment