Iterative Hard Thresholding for Compressed Sensing
Compressed sensing is a technique to sample compressible signals below the Nyquist rate, whilst still allowing near optimal reconstruction of the signal. In this paper we present a theoretical analysis of the iterative hard thresholding algorithm whe…
Authors: ** Thomas Blumensath, Mike E. Davies **
1 Iterat i v e Hard Thresholding for Compressed Sensing Thomas Blumensath and Mike E. Da vies Abstract Compressed sensing is a techniqu e to sample compr e ssible signals below the Nyquist rate, whilst still allowing near optimal reconstru ction of the signal. In this paper we presen t a theoretical analysis of the iter ativ e hard thresholdin g algorithm when applied to the com pressed sensing recovery problem. W e show that the alg orithm has the following prop erties (made m ore precise in the m ain text of the paper ) • It g iv es near-optimal error gu arantees. • It is robust to observation noise. • It succ eeds with a minimum nu mber of observations. • It can be used with any sampling operator for which the opera tor and its adjoint can be computed. • The mem ory requ irement is linear in the problem size. • Its compu tational complexity per iteration is of the same ord er as th e applicatio n o f the measuremen t oper ator or its adjoint. • It requires a fixed number of iterations dep ending only o n the logarithm of a form of sig nal to n oise ratio of the signal. • Its p erforma nce guarantees are uniform in that they on ly dep end on properties of th e sam pling o perator and signal sparsity . I . I N T RO D U C T I O N For more tha n fifty years, the Ny quist-Shanno n [1] [2] sampling theorem was ge nerally us ed as the foundation of signal acquisition systems. Using this theo ry , it was c ommon held belief that sign als have to be sampled a t twice the sign al bandwidth. Whilst this is true for gene ral b and-limited signals, this theo ry does not a count for additional sign al structures tha t might be known a priory . The recen tly emerging field of compress ed sensing, [3], [4], [5], [6] and [7] and the related the ory of signals with a finite rate of innov ations [8], start from anothe r premise. In compress ed sensing, sign als are as sumed to be s parse in s ome transform doma in. This spa rsity constraint significantly reduc es the size of the s et of po ssible signals c ompared to the s ignal space dimens ion. This is ea sily visualised if one imagines the s et of all images (say of d imension 256 × 2 56 ) with pixel values i n the rang e from zero to on e. T o get an ide a of how sca nt the set of true or inter esting images is in this sp ace, randomly draw images from this set, that is, draw the p ixels from u niform distrib utions. One could spend the rest of once life doing this, without e ncoun tering anything remotely res embling an image. In other words, the set of images gen erally of interest (such as yo u recent holiday sna pshots) occupy a minute subse t of all possible images. This i mplies that sparse signals h av e an information content mu ch sma ller than tha t implied by the Nyquist r ate. Instead of requ iring 65536 numbe rs to rep resent the 256 × 25 6 image, if the image has a sparse rep resentation, it is possible to rep resent the s ame image using ma ny fewer bits of information. This property of many signa ls is exploited in compressed sen sing. Instead of taking s amples at the Nyquist rate, compresse d sensing use s linear sampling o perators that map the signal into a small (compared to the Nyquist rate) dimensional space. This process is a co mbination of sampling and comp ression, hence the na me compresse d s ensing or comp ressive sa mpling. Contrary to the Nyq uist-Shannon theory , in compress ed sensing, reconstruction of the signal is no n-linear . One of the important contributions of the se minal work by Candes, Romber , T ao [4], [5], [6] and D onoho [7], was to sh ow that linear programming algo rithms can be us ed un der certain conditions on the sampling operator to reco nstruct the original s ignal with high accu racy . Another set of algorithms, which could be s hown to efficiently rec onstruct signals from compressed sensing observations a re greedy methods. A by now traditi onal a pproach is Orthogon al Matching Pursuit [9], which was analyse d as a reconstruction algorithm for compressed sen sing in [10]. Better theoretical prope rties were recently The authors are with Institute for Digital C ommunications & the Joint Research I nstitute for Si gnal and I mage Processing, at the Univ ersity of Edinb urgh , King’ s Building s, Mayfield Road, Edinbur gh EH9 3JL, UK (T el.: +44(0)131 6513492, Fa x.: +44(0)131 6506554, e-mail: thomas.blumensath@ed .ac.uk, mike.davies@ed.ac.u k). 2 proven for a regularised Orthogonal Matching Pu rsuit algorithm [11], [12]. Even more rece ntly , the Subspa ce Pursuit [13] and the ne arly identical Compressive Sampling Matching Pursuit (CoSaMP) [14] a lgorithms were introduced and a nalysed for c ompresse d sensing signal reconstruction. Of all of these methods, CoSaMP currently of fers the most c omprehens i ve set o f theoretic p erformance gu arantees. It w orks for general sa mpling operators, is rob ust against noise and the performance is uniform in that it only depends on a property of the sampling op erator and the spa rsity of the signal, b ut not on the size of the non-zero signa l coefficients. F urthermore, it req uires minimal storage and c omputations and works with (up to a constant) a minimal numbe r of observations. In a previous paper [15], iterati ve hard thresholding algo rithms were studied. In particular , their con ver gence to fixed points of ℓ 0 regularised (or constrained) cos t functions could be prov en. In this pa per , it is shown that one o f these algorithms (termed from no w on I H T s ) has similar performance guarantees to those of CoSaMP . A. P aper Over v iew Section II s tarts out with a definition of sparse signal mode ls and a stateme nt of the compresse d sens ing p roblem. In sec tion III, we then discus s a n iterati ve hard thresholding algorithm. The rest of this pape r shows that this algorithm is able to recover , with high accuracy , signals form compres sed se nsing obs ervations. T his resu lt is formally s tated in the theorem a nd corollary in the fi rst subsection of section IV. The rest of section IV is de voted to the proo f of this the orem and its corollary . T he n ext section, section V, takes a close r look at a s topping c riterion for the algorithm, which g uarantees a certain estimation acc uracy . Th e r esults of this paper are similar to those for the CoSa MP algorithm of [14] an d a more detailed co mparison is given in s ection VI. B. No tation The follo wing notation will b e us ed in this pape r . x is an M -dimensiona l real or complex vector . y is an N dimensional real or complex vector . Φ wil l denote an M × N real or complex matrix, wh ose transpose (hermiti an transpose) will be deno ted by Φ T . Many of the arguments in this paper will use s ub-matrixes and sub -vectors. T he letters Γ , B and Λ will denote sets of indices that enumerate the columns in Φ and the elements in the vectors y . Using these sets as su bscripts, e.g ., Φ Γ or y Γ , we mean matrixes (or vectors) formed by removing all but those columns (elemen ts) from the matrix (vector) o ther than those in the set. W e also have occ asion to refer to q uantities in a gi ven iteration. Iterations are co unted using n or k . For se ts, we us e the s implified notation Γ n whilst for vectors, the iterati on c ount is gi ven in s quare brackets y [ n ] . For the definition of the restricted isometry property used in the res earch literature, we use the Greek lett er δ s to denote the restricted isometry constan t. Howev er , in this pa per , we use a slightly modifie d defin ition of the property . T he as sociated restricted isometry constant will be labelled by β s throughout. The foll owing norms are used repeatedly . k · k 2 is the Euclidean vec tor norm or , for matrixes, the operator norm from ℓ 2 to ℓ 2 . W e will also need the vector ℓ 1 norm k · k 1 . The notation k · k 0 will denote the nu mber of no n-zero elements in a vector . For a ge neral vector y we u se y s to b e any o f the best s term approx imations to y . The dif ference between the two will be y r = y − y s . The support, that is the index set labeling the n on-zero elements in y s , is defined as Γ ⋆ = supp { y s } and similarly , Γ n = supp { y [ n ] } , where y [ n ] is the estimation of y in iteration n . Finally , the set B n = Γ ⋆ S Γ n is a su perset o f the support of the e rror r [ n ] = y s − y [ n ] . Set dif ference is de noted using · \ · . I I . S PA R S I T Y A N D C O M P R E S S E D S E N S I N G A. S p arse Signal Models In this s ection we formalise the n otion of s parsity and sparse signal mo dels. A vector will be called s -spare if no more than s of its e lements have non-zero values. W e will talk of a bes t s -sparse approx imation to a vector y to mean any one of the s -sparse vectors y s that minimise k y s − y k 2 . Most s ignals in the rea l world are no t exactly spa rse. Instead, they are often well ap proximated by an s -s parse signal. A goo d example a re image s, whose wav elet transform has mo st of its energy concen trated in relati vely few coefficients. T o model su ch behaviour , we use the following notion. Ass ume the eleme nts in a vector y a re reordered su ch that | y i | ≥ | y i − 1 | . A signal is c alled p-comp r essible , if, for some cons tant R , the coefficients sa tisfy the following property | y i | ≤ Ri − 1 /p , (1) 3 that is, the reordered coefficients decay with a power law . The importance of these p-co m p r essible s ignals is that they can be well approximated using exact sparse sign als. Let y s be any best s -term a pproximation of a p-compressible signal y , then it is easy to show that k y − y s k 2 ≤ cRs 1 / 2 − 1 /p (2) and k y − y s k 1 ≤ cRs 1 − 1 /p , (3) where c a re different constants depend ing only on p , but not on s . Note that the above two terms b ound the error in terms of the ℓ 1 as well as the ℓ 2 norm. These two norms will a ppear in the ma in result derived in this paper and can the refore be used to d eri ve performance guarantees for the recovery of p -compressible signals. B. Co mpressed Sens ing W e assume s ignals y to be repres entable as real or complex vectors of finite length. Co mpressed se nsing samp les such signals, us ing a linear map ping Φ i nto an M dimensiona l real or complex observation space . In matrix notation, the obs erved s amples x are x = Φ y + e . (4) Here, Φ is the linear sampling operator a nd e mo dels possible ob servation noise due to, for example, sensor n oise or quantisa tion errors in digital sys tems. In compresse d sens ing, two main problems ha ve to be addressed . The first problem, n ot studied in detail in this paper , is to des ign measureme nt sys tems Φ that po ssess certain desirable prop erties (s uch as the res tricted isometry property o f the next s ubsec tion), which allow for an efficient estimation of y . The second problem, whic h is at the focus of this paper , is the s tudy of conc rete algorithms for the efficient estimation of y , gi ven only x and Φ . C. Th e Restricted Isome try Pr op erty The ana lysis of algorithms for compresse d s ensing relies heavily on the following property of the observation matrix Φ . A matri x ˆ Φ satisfie s the Restricted Isometry Condition (RIP) [4] if (1 − δ s ) k y k 2 2 ≤ k ˆ Φ y k 2 2 ≤ (1 + δ s ) k y k 2 2 (5) for all s -sparse y . The r e stricted iso metry constant δ s is defined as the sma llest constant for which this prope rty holds for all s -s parse vectors y . Instead of using the above restricted isometry prop erty , we will use a re-scaled matrix Φ = ˆ Φ 1+ δ s , which satisfies the foll owing non-syme tric isometry prope rty , which is e quiv a lent to the RIP defined above. (1 − β s ) k y k 2 2 ≤ k Φ y k 2 2 ≤ k y k 2 2 (6) for all s -sparse y . Now β s = 1 − 1 − δ s 1+ δ s . W e will say that for a matrix Φ the RIP holds for sparsity s , if β s < 1 . Throughou t this p aper , wh en referring to the RIP , we mean in general this slightly modified version for wh ich we always use the letter β . If we ne ed to refer to the standard R IP , for the non-normalised matrix ˆ Φ , we use the letter δ . This is only required in order to compare our results to o thers in the literature. W e also need the following properties of the RIP deri ved in for example [14]. Note that the RIP gives an upper bound on the largest and a lower bound on the s-largest singular v alues of all sub-matirxes of Φ with s columns. Therefore, for a ll index sets Γ and all Φ for which the RIP holds with s = | Γ | , the following inequ alities h old tri vially k Φ T Γ x k 2 ≤ k x k 2 , (7) (1 − β | Γ | ) k y Γ k 2 ≤ k Φ T Γ Φ Γ y Γ k 2 ≤ k y Γ k 2 (8) The follo wing two properties are at the he art o f the proof of the main result of this paper a nd we deri ve these therefore here. Lemma 1: For all index sets Γ and all Φ for which the RIP holds with s = | Γ | k ( I − Φ T Γ Φ Γ ) y Γ k 2 ≤ β s k y Γ k 2 . (9) 4 Pr o of: The RIP gua rantees that the e igen v alues of the ma trix Φ T Γ Φ Γ fall within the interv a l [1 − β s , 1] . This can ea sily be s een to imply that the ma trix I − Φ T Γ Φ Γ has eigen values in the interval [0 , β s ] , which prov es the lemma. The next inequality is known and can be found in for example [14]. Lemma 2: For tw o disjoint sets Γ an d Λ (i.e. Γ T Λ = ∅ ) and all Φ for which the RIP ho lds with s = | Γ S Λ | k Φ T Γ Φ Λ y Λ k 2 ≤ β s k y Λ k 2 . (10) For completeness, we here repe at the proof similar to the one gi ven in [14]. Pr o of: Let Ω = Γ S Λ . Be caus e the s ets Γ a nd Λ are disjoint, the matrix − Φ T Γ Φ Λ is a submatrix of I − Φ T Ω Φ Ω . By [16, The orem 7.3.9], the lar gest singular v alue of a submatrix is bound ed by the lar g est singular value o f the full matrix. In other w ords, k Φ T Γ Φ Λ k 2 ≤ k I − Φ T Ω Φ Ω k 2 . T o bound the operator norm on the r ight, note that by the RIP , Φ T Ω Φ Ω has e igen v alues in the interval [1 − β s , 1] . Therefore I − Φ T Ω Φ Ω has e igen v alues in the interval [0 , β ] , which proves the lemma. D. De signing Measu ring Systems The RIP is a c ondition on the s ingular values of s ub-matrixes o f a matrix Φ . T o use the res ults base d on the RIP in prac tice, one would need to either , be able to calcu late the RIP for a gi ven matrix, or be able to des ign a ma trix with a pre-spe cified R IP . Unfortunately , n one of the a bove is possible in general. Ins tead, the on ly known approa ch to gene rate a ma trix Φ s atisfying the RIP with high probability , is to u se random c onstructions. For example, if the entries in Φ a re drawn independen tly fr om certain proba bility distrib utions (such a s a Gauss ian distrib ution or a Bernoulli distributi on), with a ppropriate v ariance, then Φ will sa tisfy RIP β s ≤ ǫ with probability (1 − e − cM ) if M ≥ cs log ( N /s ) /ǫ 2 , where c a nd C are c onstants depen ding on the distrib ution of the elements in Φ . Another random cons truction, w hich is often more us eful in applications, is to use a sub-matrix of an o rthogonal matrix. F or example, let Φ be a (suitably normalised ) sub -matrix of the matrix asso ciated with the discrete Fourier transform, generated by drawing ro ws of the matrix at random (without replaceme nt), then Φ will satisfy RIP β s ≤ ǫ with probability (1 − e − cM ) if M ≥ C s log 5 N log( ǫ − 1 ) /ǫ 2 , where c a nd C are c onstants. I I I . I T E R A T I V E H A R D T H R E S H O L D I N G A. De finition of the Algo rithm In [15], we introduced the follo wing Iterativ e Hard Th resholding algorithm ( I H T s ). Let y [0] = 0 and use the iteration y [ n +1] = H s ( y [ n ] + Φ T ( x − Φ y [ n ] )) , (11) where H s ( a ) is the non-linear operator that sets all b ut the largest ( in magnitude) s elements of a to zero. If there is no un ique such se t, a set can be s elected either rand omly or ba sed on a pre defined ordering of the elements. The con ver gence o f this algorithm w as proven in [15] under the condition tha t the operator norm k Φ k 2 is smaller than one. In fact, the same ar g ument can be used to show that the algorithm con verges if our version of the restricted isometry property holds . The argument follo ws the same line as that in [15] and we o mit the details here . B. Co mputational Complexity per Iteration The iterati ve hard thresh olding algo rithm is very simple. It in volv es the app lication of the op erators Φ and Φ T once in each iteration as well as tw o vector add itions. Th e operator H s in volves a partial ordering of the eleme nts of a [ n ] = y [ n ] + Φ T ( x − Φ y [ n ] ) in magnitude . The storage req uirements are small. Ap art from storage of x , we only require the storage of the vector a , which is o f leng th N . Storage of y [ n ] , which has on ly s -non-z ero elements, requires 2s numbe rs to be stored. The computational bottle nec k, both in terms of storage and computation time, is due to the operators Φ an d Φ T . If these are general ma trixes, the computational complexity and memory requirement is O ( M N ) . For large p roblems, it is commo n to us e structured ope rators, basd for example on fast fourier transforms or W avelet transforms, wh ich require substantially less memory and ca n often b e applied with O ( N log M ) o r even O ( N ) opera tions. In this case, the ab ove algorithm has minimal compu tational requirements per iteration. If L is the complexity of ap plying the operators Φ and Φ T , then the computationa l complexity of the algorithm is O ( k ⋆ L ) , where k ⋆ is the total number of iterations. 5 I V . I T E R A T I V E H A R D T H R E S H O L D I N G F O R C O M P R E S S E D S E N S I N G In this s ection we deri ve the main resu lt of this p aper . W e show tha t if β 3 s < 1 / 8 , then the iterati ve hard thresholding a lgorithm reduces the estimation error in each iteration an d is guarantee d to come within a constant factor of the best attainable estimation error . In fact, the a lgorithm needs a fixed n umber of iterations, depending only on the loga rithm of a form o f signal to nois e ratio. A. Digest: Th e Main Res u lt The main res ult of this pa per can be formally s tated in the following theorem and corollary . Theorem 1 : Gi ven a n oisy obse rv ation x = Φ y + e , were y is an arbitrary vector . Let y s be an approximation to y with no-more than s non-zero eleme nts for which k y − y s k 2 is minimal. If Φ has restricted isometry property with β 3 s < 1 / 8 , then, at iteration k , I H T s will recover an approximation y k satisfying k y − y k k 2 ≤ 2 − k k y s k 2 + 5 ˜ ǫ s . (12) where ˜ ǫ s = k y − y s k 2 + 1 √ s k y − y s k 1 + k e k 2 (13) Furthermore, after a t most k ⋆ = log 2 k y s k 2 ˜ ǫ s (14) iterations, I H T s estimates y with accuracy k y − y k ⋆ k 2 ≤ 6 k y − y s k 2 + 1 √ s k y − y s k 1 + k e k 2 . (15) For exact sparse signals, we have a slightly better corollary . Cor ollary 1: Given a noisy observati on x = Φ y s + e , were y s is s -sparse. If Φ has the restricted isome try property with β 3 s < 1 / 8 , then , at iteration k , I H T s will recover an approximation y k satisfying k y s − y k k 2 ≤ 2 − k k y s k 2 + 4 k e k 2 . (16) Furthermore, after a t most k ⋆ = log 2 k y s k 2 k e k 2 (17) iterations, I H T s estimates y with accuracy k y s − y k ⋆ k 2 ≤ 5 k e k 2 . (18) B. Disc ussion of the Main Results The ma in the orem states that the algorithm will find an approx imation that c omes close to the true vector . Howe ver , there is a limit to this. Asymptotically , we are only guaranteed to get as close as a multiple of ˜ ǫ s = k y − y s k 2 + 1 √ s k y − y s k 1 + k e k 2 (19) The q uantity ˜ ǫ s can be u nderstood as an error term. T his e rror term is compo sed of two co mponents, the obse rvati on error e and the dif ferenc e be tween the s ignal y and its best s term approx imation y s . This makes intuiti ve sense. Assume, the obs ervation e rror is ze ro and y is exactly s -sparse . In this ca se, the algorithm is guaranteed (unde r the con ditions of the theorem) to fin d y exactly . For exact sparse signals, but with noisy observation, our succes s in recov ering y is naturally limited by the size of the error . Assu ming that y is not s -sparse, there will be an error between any s -term a pproximation a nd y . T he c losest we can g et to y with any s -sparse ap proximation is therefore limited by how well y can be approximated with s -spa rse signals. T o show that this result is in fact optimal up to a cons tant, we can use the following argument. Theoretic considerations d ue to Kash in [17] and Ga rnaev a-Gluskin [18] sho w that any matrix Φ ∈ R M × N must ha ve at least M ≥ cs log ( N/s ) for some c onstant c in order for the obs ervation x = Φ y to allow a recons truction ˆ y with k y − ˆ y k 2 ≤ C / √ s k y k 1 . As discussed above, a random con struction of Φ can achieve the RIP required in the 6 theorem with high probability if M ≥ cs log( N /s ) . This shows tha t the depen dence of the error guarantee o f I H T s on 1 √ s k y − y s k 1 is optimal up to a c onstant factor . Clearly , the depe ndence on k y − y s k 2 is also unavoidable, as we cannot find any s -term approximation that beats the best s -term approximation, ev en if y would be known exactly . For a ge neral bounde d error e , the dep endenc e o n k e k is also un av oidable, even if the s upport of y s would be known, the error e , p rojected onto this subspace , can indu ce an error depending on the size o f e . In s ummary , the error must depend o n k y − y s k 2 as there is no s -sp arse ap proximation tha t could bea t this. Furthermore, when projecting a signa l into an s -dimensional observati on sp ace, the maximal information loss must lead to an error of the order 1 √ s k y − y s k 1 . Finally , the worst case estimation error must also depen d on the size of the observation error e . The overall number of iterations required to a chieve a desired accuracy depend s o n the logarithm of k y s k 2 ˜ ǫ s . W e can think of the quantity k y s k 2 ˜ ǫ s as a signal to noise ra tio a ppropriate for sparse signal estimates. This term is large, whenever the ob servation noise is small and the signal y is well app roximated by an s -sparse v ector . Bo unds on this term ca n b e deri ved for a p articular signal mode l, su ch a s, for examp le, the p-compressible signal mod el for which we have presented the required boun ds above. C. De rivation of the Err or Bound W e now turn to the deriv ation of the main result. W e start by proving the error boun d in c orollary 1. L et u s recall some no tation and introduce so me abbreviati ons. W e have 1) x = Φ y s + e , 2) r [ n ] = y s − y [ n ] , 3) a [ n +1] = y [ n ] + Φ T ( x − Φ y [ n ] ) = y [ n ] + Φ T (Φ y s + e − Φ y [ n ] ) , 4) y [ n +1] = H s ( a [ n +1] ) , where H s is the hard thresholding o perator tha t keeps the largest s (in magnitude) elements and s ets the other eleme nts to zero. 5) Γ ⋆ = supp { y s } , 6) Γ n = supp { y [ n ] } , 7) B n +1 = Γ ⋆ S Γ n +1 , Pr o of: [Proof of the error bo und in co rollary 1] Consider the error k y s − y [ n +1] k 2 ≤ k y s B n +1 − a [ n +1] B n +1 k 2 + k y [ n +1] B n +1 − a [ n +1] B n +1 k 2 . (20) Note that y s − y [ n +1] is su pported on the se t B n +1 = Γ ⋆ S Γ n +1 . By the thresholding o peration, y [ n +1] is the best s -term approximation to a [ n +1] B n +1 . In particular , it is a better approximation than y s . Th is implies that k y [ n +1] − a [ n +1] B n +1 k 2 ≤ k y s − a [ n +1] B n +1 k 2 and we h ave k y s − y [ n +1] k 2 ≤ 2 k y s B n +1 − a [ n +1] B n +1 k 2 . (21) W e now expand a [ n +1] B n +1 = y [ n ] B n +1 + Φ T B n +1 Φ r [ n ] + Φ T B n +1 e . (22) W e then ha ve k y s − y [ n +1] k 2 ≤ 2 k y s B n +1 − y [ n ] B n +1 − Φ T B n +1 Φ r [ n ] − Φ T B n +1 e k 2 ≤ 2 k r [ n ] B n +1 − Φ T B n +1 Φ r [ n ] k 2 + 2 k Φ T B n +1 e k 2 = 2 k ( I − Φ T B n +1 Φ B n +1 ) r [ n ] B n +1 − Φ T B n +1 Φ B n \ B n +1 r [ n ] B n \ B n +1 k 2 +2 k Φ T B n +1 e k 2 ≤ 2 k ( I − Φ T B n +1 Φ B n +1 ) r [ n ] B n +1 k 2 +2 k Φ T B n +1 Φ B n \ B n +1 ) r [ n ] B n \ B n +1 k 2 +2 k Φ T B n +1 e k 2 Now B n \ B n +1 is disjoint fr om B n +1 and | B n S B n +1 | ≤ 3 s . W e ca n therefore use the bound s in (7), (10) and (9) and the fact tha t β 2 s ≤ β 3 s k r [ n +1] k 2 ≤ 2 β 2 s k r [ n ] k 2 + 2 β 3 s k r [ n ] k 2 + 2 k e k 2 ≤ 4 β 3 s k r [ n ] k 2 + 2 k e k 2 (23) 7 If β 3 s < 1 8 , then k r [ n +1] k 2 < 0 . 5 k r [ n ] k 2 + 2 k e k 2 (24) Iterating this relationship, an d realising that 2(1 + 0 . 5 + 0 . 25 + · · · ) ≤ 4 and tha t y [0] = 0 , we g et k r [ k ] k 2 < 2 − k k y s k 2 + 4 k e k 2 (25) This proves corollary 1. T o prove the ma in theorem, we use the follo wing tw o lemmas from [14]. T he first lemma is stated here without proof, which follows that presented in [14], with the required correction for ou r definition of the RIP . Lemma 3 (Nee dell and T r opp, Pr o position 3.5 in [14]): Suppose the matrix Φ sa tisfies the RIP k Φ y s k 2 ≤ k y s k 2 for all y s : k y s k 0 ≤ s , then for a ll vectors y , the follo wing b ound holds k Φ y k 2 ≤ k y k 2 + 1 √ s k y k 1 . (26) Lemma 4 (Nee dell and T r opp, lemma 6.1 in [14 ]): For a ny y , let y s be (the/any) best s -term approximation to y . Le t y r = y − y s . Let x = Φ y + e = Φ y s + Φ y r + e = Φ y s + ˜ e . (27) If the RIP ho lds for spa rsity s , then the norm of the error ˜ e can be bou nded by k ˜ e k 2 ≤ k y − y s k 2 + 1 √ s k y − y s k 1 + k e k 2 (28) Pr o of: k ˜ e k 2 = k Φ y r + e k 2 = k Φ( y − y s ) + e k 2 ≤ k Φ( y − y s ) k 2 + k e k 2 ≤ k ( y − y s ) k 2 + 1 √ s k ( y − y s ) k 1 + k e k 2 , where the las t inequality follows for m L amma 3. Pr o of: [Proof of the error bo und in theo rem 1] T o bo und the error k y − y [ k ] k 2 , we note that k y − y [ k ] k 2 ≤ k r [ k ] k 2 + k y − y s k 2 ≤ k r [ k ] k 2 + k ( y − y s ) k 2 + 1 √ s k ( y − y s ) k 1 + k e k 2 . The proof of the main theorem then follows by bounding k r [ k ] k 2 using corollary 1 with ˜ e instead of e and lemma 4 to bo und k ˜ e k 2 , that is k r [ k ] k 2 ≤ 2 − k k y s k 2 + 4 k ( y − y s ) k 2 + 1 √ s k ( y − y s ) k 1 + k e k 2 . (29) D. De rivation of the It eration Count Pr o of: [Proof of the se cond part of theorem 1] The first part of theo rem 1 shows that k y − y [ k ] k 2 ≤ 2 − k k y s k 2 + 5 ˜ ǫ s , (30) where ˜ ǫ s = h k ( y − y s ) k 2 + 1 √ s k ( y − y s ) k 1 + k e k 2 i . W e are therefore guaranteed to reduc e the error to be low any multiple c of ˜ ǫ s , as long as c > 5 . For example, assume we want to recover k y k with an error of less than 6˜ ǫ s . This implies that we require that 2 − k k y s k 2 ≤ ˜ ǫ s (31) i.e. that 2 k ≥ k y s k 2 ˜ ǫ s , (32) which in turn impli es the second part of the the orem. The p roof of the c orresponding result in corollary 1 follows the same argument. 8 V . W H E N T O S T O P So far , we have giv en gua rantees on the a chiev able error and a bo und on the total nu mber of iterations to achieve this bound. Howe ver , in practice, it is necessary to monitor quantities of the algorithm and decide to stop the iterations a t so me point. From the main res ult, it is clea r that in g eneral, we cannot do any better than to find an estimate with an e rror of 5˜ ǫ s . Howe ver , how d o we know that we are getting c lose to this v alue? A possible s topping criteri on is k x − Φ y [ n ] k 2 ≤ ǫ . For this criterion, it is possible to u se the sa me ar g uments as in appe ndix A of [14], to d eri ve the follo wing result. Lemma 5: Assume that Φ satisfies the RIP with β 3 s < 1 / 8 . If at any iteration of I H T s the c ondition k x − Φ y [ n ] k 2 ≤ ǫ holds, then k y − y [ n ] k 2 ≤ 1 . 07( ǫ + 2˜ ǫ s ) , (33) where ˜ ǫ s = k y − y s k 2 + 1 √ s k y − y s k 1 + k e k 2 . (34) Con versly , if at a ny iteration of I H T s the condition k y − y [ n ] k 2 ≤ ǫ − 1 / √ s k y − y s k 1 − k e k 2 (35) holds, then k x − Φ y [ n ] k 2 ≤ ǫ . This lemma can be us ed to calculate a stopping c riterion for I H T s . For example, if we want to estimate k y k 2 with ac curacy c ˜ ǫ s , we k now that we are done as so on as k x − Φ y [ n ] 2 k 2 ≤ ( c/ 1 . 07 − 2)˜ ǫ s . Note that in general, I H T s is only guaranteed to work for c > 5 , howe ver , as so on as we o bserve that k x − Φ y [ n ] k 2 ≤ ( c/ 1 . 07 − 2) , we know that the estimation error must b e below c ˜ ǫ s . Pr o of: T o prove the first part, note that the s topping criterion implies tha t ǫ ≥ k Φ( y − y [ n ] ) + e k 2 = k Φ( y s − y [ n ] ) + ˜ e k 2 ≥ p 1 − β 2 s k y s − y [ n ] k 2 − k ˜ e k 2 , so that k y s − y [ n ] k 2 ≤ ǫ + k ˜ e k 2 √ 1 − β 2 s . (36) Furthermore, k y − y [ n ] k 2 ≤ k y s − y [ n ] k 2 + k y − y s k 2 , (37) so that (using the bound on k ˜ e k 2 from lemma 4) k y − y [ n ] k 2 ≤ ǫ + k ˜ e k 2 √ 1 − β 2 s + k y − y s k 2 ≤ ǫ + 2 k y − y s k 2 + 1 √ s k y − y s k 1 + k e k 2 √ 1 − β 2 s . ≤ ǫ + 2˜ ǫ s √ 1 − β 2 s . This proves the first part of the lemma using β 2 s ≤ β 3 s < 1 / 8 . T o prove the secon d part, note that if k y − y [ n ] k 2 ≤ ǫ − 1 / √ s k y − y [ n ] k 1 − k e k 2 . (38) holds, then ǫ ≥ k y − y [ n ] k 2 + 1 / √ s k y − y [ n ] k 1 + k e k 2 ≥ k Φ( y − y [ n ] ) k 2 + k e k 2 ≥ k Φ( y − y [ n ] ) + e k 2 . W e have here used lemma 3 to bound k Φ( y − y [ n ] ) k 2 ≤ k ( y − y [ n ] ) k 2 + 1 √ s k ( y − y [ n ] ) k 1 , where s is such that RIP holds for this s . 9 V I . C O M P A R I S O N TO C O S A M P In [14] the authors introduced a subspac e pursuit algorithm called CoSaMP , which offers similar guaran tees to the iterativ e hard thresholding ap proach of this pa per . The result for CoSa MP is as follows Theorem 2 (Nee dell & T r opp [14]): If Φ has the (normal) restricted isometry property with δ 4 s ≤ 0 . 1 , then, at iteration k , CoS aMP will recover an approximation y k satisfying k y s − y k k 2 ≤ 2 − k k y s k 2 + 15 k e k 2 (39) if x = Φ y s + e for y s s -sparse and k y − y k k 2 ≤ 2 − k k y k 2 + 20˜ ǫ s , (40) if x = Φ y + e f or a ll y . T wo remarks are in orde r . Firstly , for I H T s , we req uire β 3 s ≤ 0 . 125 , whilst for CoSaMP , δ 4 s ≤ 0 . 1 is required. Now , β a nd δ are related as follo ws β 2 − β = δ , (41) therefore, I H T s requires δ 3 s ≤ 0 . 0667 . T o compare this to the requirement for C oSaMP , we use corollary 3.4 from [14], which states tha t for integers a and s , δ as ≤ aδ 2 s . The refore, if δ 2 s ≤ 0 . 025 , the con dition for CoSaMP is satisfied, whilst for I H T s , we have a comparable condition requiring that δ 2 s ≤ 0 . 0222 . Whilst the requiremen t on δ 2 s is marginally weaker for Co SaMP as compared to I H T s , I H T s is guaran teed to get a roughly four times lower ap proximation error . For examp le, in the exac t s parse case, I H T s is guaranteed to ca lculate an e rror app roaching 4 k e k 2 , wh ich should be compared to the guaran tee of 15 k e k 2 for CoSa MP . For general signa ls, the guarantees are 5˜ ǫ s for I H T s and 20˜ ǫ s for CoSaMP . The number of iterations requ ired for I H T s is logarithmical in the signa l to n oise ratio. This means that for noiseless observations, I H T s would require an infinite numb er o f it erations to red uce the error to zero. This is a well kno wn property of algorithms that u se updates of the form y [ n ] + Φ T ( x − Φ y [ n ] ) . CoSaMP on the other hand is guaranteed to estimate y with p recision 20˜ ǫ s in at most 6( s + 1) iterations, ho wev er , to achieve this, C oSaMP requires the solution to an in verse p roblem in each iteration, which is co stly . I H T s does n ot requ ire the exact solution to an in verse problem. If CoSaMP is implemented using f ast partial solutions to the in verse problems , the iteration count gua rantees become similar to the ones deri ved h ere for I H T s . V I I . W H AT ’ S I N A T H E O R E M A word of cau tion is in order . W e have h ere sh own that the Iterati ve Hard Thres holding algorithm ha s theoretical properties, which are c omparable to those of othe r state of the art algorithms suc h as CoSa MP and h as recovery guarantees of the s ame order as ℓ 1 based approa ches. At a first glan ce, this see ms to be at odds with previously reported numerical resu lts [15], which have shown that I H T s does not perform as well a s other methods such as Orthogonal Matching Pursuit, for which there a re currently no compa rable performance guarantee s. What is more, I H T s does also pe rform less well in n umerical studies than C oS aM P or ℓ 1 based approa ches. T o und erstand this d isparity , it has to be realised that the uniform performance guarantee s derived here for I H T s and e lsewhere for CoSa MP and ℓ 1 based methods are w orst c ase bou nds, that is, they guarantee the pe rformance of the algorithms in the worst po ssible scenario. Numerical studie s on the other hand cann ot in gen eral test this worst c ase beh aviour . Th is is bec ause we do not in general kn ow what pa rticular signal would be the mo st difficult to rec overed. Nume rical experiments therefore analyse av erage behaviour of the methods , that is, they study the recovery o f typical signals. Whilst uniform guarantees can be deri ved for the I H T s algorithm, the dif feren ce in numerically observed performance between this method and other a lgorithms with similar uniform recovery gu arantees indicates that uniform guarantee s are not nec essarily a good measure to indicate goo d av erage performanc e. Th is is further confirmed when looking at the a verage ca se analysis of algorithms s uch as Orthogonal Matching Pursuit [9], for which the uniform gua rantees are currently av ailable are ma rkedly d if ferent to those of I H T s , CoSaMP and ℓ 1 methods [10], whilst in n umerical stud ies, OMP c an un der ce rtain cond itions outperform s ome of these ap proache s. 10 V I I I . C O N C L U S I O N The a bstract made e ight claims regarding the performance of the iterati ve hard thresholding algo rithm. Let us here summarise thes e in somewhat mo re detail. • Error guaran tee; W e have shown that an estimation error of 6 k ˜ e k 2 can be ac hiev ed within a finite number of iterations. • Robustness to noise; W e h ave s hown that the ach iev a ble estimation error dep ends linearly on the size of the observation error . Performance therefore degrades linearly if the nois e is increas ed. • Minimum number of obs ervations; The requirement on the isometry c onstant dictates tha t the numb er of observations g rows linearly with s and loga rithmically with N . Up to a constant, this relation is known to be the bes t attainable. • Sampling operator; The a lgorithm is simple a nd requires only the ap plication of Φ a nd Φ T . • Memory requiremen t; W e have sh own this to b e linear in the prob lem size, if we can ignore the storage requirement for Φ and Φ T . • Computational complexity; W e could shown that the c omputational complexity is of the sa me order as the application of the measureme nt op erator or its adjoint per iteration. The total nu mber o f it erations is b ounded due to the linear c on ver g ence of the algorithm and depends logarithmically o n the signal to noise ratio k y s k 2 / k ˜ e k 2 . • Number of iterations. W e have shown that after at most l log 2 k y s k 2 ˜ ǫ s m iterations, the estimation error is smaller than 6˜ ǫ s . • Uniform performanc e guaran tees. The results pres ented here only depen d on β 3 s and do n ot depe nd on the size and distribution of the largest s elements in y . This is an impres siv e list o f properties for su ch a relati vely simple algorithm. T o our k nowledge, only the CoSaMP algorithm s hares similar guaran tees. Howe ver , as discussed in section VII, uniform guaran tees a re not the only cons ideration an d in practice marked d if ferences in the average performance of dif ferent method s a re appa rent. For many sma ll problems, the restricted is ometry property of random matrices is often too lar ge to explain the behaviour of the different methods in these stud ies. Furthermore, it has lon g bee n observed, that the d istrib ution of the magn itude of the non -zero coefficients also h as an important infl uence on the pe rformance of dif ferent me thods. Whilst the the oretical g uarantees derived in this a nd similar pap ers are important to un derstand the beh aviour o f an algorithm, it is also c lear that other facts hav e to be taken into account in order to predict the typ ical performance of algorithms in many practical situations. A C K N O W L E D G M E N T S This research was suppo rted by EPSRC grant D000246/1. MED acknowledges support o f his position from the Scottish Fund ing Council and their support of the Jo int Rese arch Institute with the Heriot-W att Univ ersity as a compone nt part of the Edinbur gh Research Partnership. R E F E R E N C E S [1] H. Nyquist, “Certain topics in telegraph transmission theory , ” Tr ansactions of the A. I. E. E. , pp. 617–644, Feb. 1928. [2] C. A. S hannon and W . W eaver , The mathematical theory of communication . Univ ersity of Il linois Press, 1949. [3] I. F . Gorodnitsky , J. S. Geor ge, and B. D. Rao, “Neuromag netic source imaging with focuss: a recursi ve weighted minimum norm algorithm, ” Neur ophysiolo gy , vol. 95, no. 4, pp. 231–251, 1995. [4] E. Cand ` es, J. Romberg, and T . T ao, “Rob ust unce rtainty principles: Exact signal recon struction from highly incomplete frequency information., ” IEEE Tr ansactions on information theory , vol. 52, pp. 489–509 , Feb 2006. [5] E. Cand ` es, J. Romberg, and T . T ao, “Stable signal recov ery from incomplete and inaccurate measurements, ” Comm. Pure Appl. Math. , vol. 59, no. 8, pp. 1207–1223, 2006. [6] E. Cand ` es and J. R omberg , “Quantitative robust uncertainty principles and optimally sparse decompositions, ” F oundations of Comput. Math , vol. 6, no. 2, pp. 227 – 254, 2006. [7] D. Donoho, “Compressed sensing, ” IEEE T rans. on I nformation Theory , vol. 52, no. 4, pp. 1289–13 06, 2006. [8] M. V etterli, P . Marziliano, and T . Blu, “Sampling signals with finite rate of innov ation, ” IEEE T ransactions on Sign al Pr ocessing , vol. 50, no. 6, pp. 1417–1428, 2002. [9] S. Mallat, G. Davis, and Z. Zhang, “ Adaptiv e time-frequency decompositions, ” SPIE Jo urnal of Optical Engineering , vol. 33, pp. 2183– 2191, July 1994. [10] J. A. T ropp and A. C. Gilbert, “Signal recov ery from partial information via orthogonal matching pursuit, ” Submitted for publication, , 2006. 11 [11] D. Needell and R. V ershynin, “Uniform uncertainty principle an d signal reco very via re gularized orthogon al matching pursuit, ” submitted , 2007. [12] D. Needell and R. V ershyn in, “Signal reco very from incomplete and inacurate measu rements via regularized orthogonal matching pursuit., ” submitted , 2008. [13] W . Dai and O. Milenkovic, “Subspace pursuit for compressed sensing: Closing the gap between performance and complexity , ” submitted , 2008. [14] D. Needell and J. Trop p, “CO SAMP: Iterative signal reov ery from incomplete and i nacurate samples., ” submitted , 2008. [15] T . Blumensath and M. Davies, “Iterativ e thresholding for sparse approximations, ” to appear in J ournal of F ourier A nalysis and Applications, special issue on sparsity , 2008. [16] R . A. Horn and C. R. Johnson, Matrix Analysis . Cambridge Univ ersity P ress, 1985. [17] B . Kashin, “The width of certain finite dimensional sets and classes of smooth functions, ” Izvestia , vol. 41, pp. 334–351. [18] A. Garnaev and E. Gl uskin, “On width of the Euclidean ball, ” Sov . Math. Dokl , 1984.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment