Phase transitions and sample complexity in Bayes-optimal matrix factorization

Phase transitions and sample complexit y in Ba y es-optimal matrix factorization Y oshiyuki Kabashima 1 , Florent Krzak ala 2 , Marc M´ ezard 3 , Ayak a Sak ata 1 , 4 , and Lenk a Zdeborov´ a 5 , ∗ 1 Dep artment of Computational Intel ligenc e & Systems Scienc e, T okyo Institute of T e chnology, Y okohama 226-8502, Jap an. 2 L ab or atoir e de Physique Statistique, CNRS UMR 8550 Ec ole Normale Sup´ erieur e, PSL R ese ar ch University & Universit´ e Pierr e et Marie Curie, Sorbonne Universit´ es. 3 D´ ep artement de Physique, Ec ole Normale Sup´ erieur e, PSL R ese ar ch University, Paris. 4 The Institute of Statistic al Mathematics, T achikawa 190-8562, Jap an. 5 Institut de Physique Th ´ eorique, CEA Saclay and CNRS, Gif-sur-Yvette, F r anc e. ∗ T o whom c orr espondenc e shal l b e sent: lenka.zdeb or ova@c e a.fr (Dated: Marc h 22, 2016) W e analyse the matrix factorization problem. Giv en a noisy measurement of a pro duct of t wo matrices, the problem is to estimate bac k the original matrices. It arises in many applications such as dictionary learning, blind matrix calibration, sparse principal comp onen t analysis, blind source separation, low rank matrix completion, robust principal component analysis or factor analysis. It is also imp ortan t in machine learning: unsupervised represen tation learning can often b e studied through matrix factorization. W e use the tools of statistical mec hanics – the ca vity and replica meth- o ds – to analyze the achiev ability and computational tractability of the inference problems in the setting of Bay es-optimal inference, which amounts to assuming that the tw o matrices hav e random indep enden t elements generated from some known distribution, and this information is av ailable to the inference algorithm. In this setting, we compute the minimal mean-squared-error achiev able in principle in any computational time, and the error that can b e achiev ed by an eﬃcient approximate message passing algorithm. The computation is based on the asymptotic state-evolution analy- sis of the algorithm. The p erformance that our analysis predicts, b oth in terms of the achiev ed mean-squared-error, and in terms of sample complexity , is extremely promising and motiv ating for a further developmen t of the algorithm a . a Part of the results discussed in this pap er w ere presented at the 2013 IEEE International Symposium on Information Theory in Istanbul. 2 CONTENTS I. In tro duction 3 A. Statemen t of the problem 3 B. Ba yes-optimal inference 4 C. Statemen t of the main result 5 D. Applications of matrix factorization 6 E. Related work and p ositioning of our contribution 9 I I. Elements of the background from statistical physics 11 A. The Nishimori identities 11 B. Easy/hard and p ossible/impossible phase transitions 12 C. Wh y our results are not rigorous? Why do we conjecture that they are exact? 13 1. The statistical-physics strategy for solving the problem 13 2. What are the assumptions? 13 3. Ca vity and replica metho d for other problems 14 D. Absence of replica symmetry breaking in Bay es-optimal inference 14 E. Learning of hyper-parameters 15 I II. Approximate message passing for matrix factorization 16 A. Appro ximate b elief propagation for matrix factorization 16 B. GAMP for matrix factorization 20 C. Simpliﬁcations due to the Nishimori identities 22 D. Simpliﬁcation for matrices with random entries 23 IV. The Bethe free en tropy 24 A. Fixed-p oin t generating Bethe free entrop y 25 B. The v ariational Bethe free entrop y 26 1. Generic output channel 26 2. The A W GN output channel 27 V. Asymptotic analysis 28 A. State evolution 28 1. State evolution of the Ba yes-optimal inference 31 2. The input Nishimori iden tities 32 3. The output Nishimori iden tities 32 B. Replica metho d 33 1. Momen t assessmen t for n ∈ N 33 2. Replica symmetric free en tropy 35 3. Simpliﬁcation in the Ba yes-optimal setting 36 VI. Examples of asymptotic phase diagrams 36 A. Dictionary learning, blind matrix calibration, sparse PCA and blind source separation 37 1. Input and output functions 37 2. Phase diagram for blind matrix calibration and dictionary learning 38 B. Lo w rank matrix completion 42 C. Robust PCA 43 D. F actor analysis 44 VI I. Discussion and Conclusions 46 Ac knowledgmen ts 47 References 47 3 I. INTR ODUCTION W e study in this pap er a v ariet y of questions whic h all deal with the general problem of matrix factorization. Generically , this problem is stated as follows: Giv en a M × P dimensional matrix Y , that was obtained from noisy elemen t-wise measurements of a matrix Z , one seeks a factorization Z = F X , where the M × N dimensional matrix F and the N × P dimensional matrix X must satisfy some speciﬁc requiremen ts like sparsit y , lo w-rank or non-negativity . F rom a machine learning p oin t of view, matrix factorization can be applied to unsup ervised learning of data represen tation [1]. The success of machine learning, including recent progress such as deep learning [2], dep ends largely on data representations. Explicit approac hes to eﬃcient data representation, such as matrix factorization, are hence of wide relev ance. Other applications that can b e formulated as matrix factorization include dictionary learning or sparse co ding [3 – 5], sparse principal comp onent analysis [6], blind source separation [7], low rank matrix completion [8, 9] or robust principal comp onen t analysis [10], that will b e describ ed b elo w. Theoretical limits on when matrix factorization is p ossible and computationally tractable are still rather p oorly understo od. In this work we make a step tow ards this understanding b y predicting the limits of matrix factorization and its algorithmic tractability when Z is created using randomly generated matrices F and X , and measured element- wise via a known noisy output channel P out ( Y | Z ). Our results are derived in the limit where N , M , P → ∞ with ﬁxed ratios M / N = α , P / N = π . W e predict the existence of sharp phase transitions in this limit and pro vide the explicit formalism to lo cate them. W e use tw o types of metho ds in this pap er. The ﬁrst one is based on a generalization of approximate message passing (AMP) [11] to the matrix factorization problem, and on its asymptotic analysis which is known in statistical ph ysics as the cavit y metho d [12, 13], and has b een called state evolution in the context of compressed sensing [11]. The second metho d that w e use in the following is the replica metho d. These t wo metho ds are widely b eliev ed to b e exact in the context of theoretical statistical physics, but most of the results that we shall obtain in the present w ork are not rigorously established. Our predictions hav e thus the status of conjectures. A ﬁrst cross-c heck of the correctness of these conjectures is the fact that the tw o metho ds give identical results. This has b een understo od ﬁrst in the context of spin glasses [12]. This work builds up on some previous steps that w e describ ed in earlier reports [14, 15]. The message passing algorithm related to our analysis w as ﬁrst presen ted in [15] and is very closely related to the Big-AMP algorithm dev elop ed and tested in [16 – 18]; relations and di ﬀerences with Big-AMP will b e mentioned in several places throu ghout the pap er. Our main fo cus here, b eside the detailed deriv ation of the algorithm, is the asymptotic analysis and phase diagrams which were not studied in [16 – 18]. W e also discuss several v arian ts of the AMP algorithm that are in teresting for theoretical reasons. Sev eral of these v ariants, ho wev er, hav e con vergence problems when implemented straigh tforwardly . F or a robust implementation of the algorithm that can b e used on practical b enc hmarks we refer to the works [17, 18]. Our general metho d provides a unifying framew ork for the study of computational tractability and identiﬁabilit y of v arious matrix factorization problems. The ﬁrst step for this synergy is the formulation of the problem via a graphical mo del (see Fig. 1) that is amenable to analysis using the present metho ds. The phase diagrams that w e shall deriv e establishes for eac h problem t w o t yp es of thresholds in the plane α − π : the threshold where the problem of matrix factorization ceases to be solv able in principle, and the threshold where AMP ceases to ﬁnd the b est solution. In most existing works the computation of phase transitions was treated separately for each of the v arious problems. F or instance, redundant dictionaries for sparse representations and low rankness are usually though t as tw o diﬀerent kinds of dimensional reduction. Interestingly , in our work a wide class of problems is treated within one uniﬁed formalism; this is theoretically interesting in the context of recent dev elopments [19]. A. Statemen t of the problem In a general matrix factorization problem one measures some information ab out matrix elements of the pro duct of tw o unknown matrices F ∈ R M × N and X ∈ R N × P , whose matrix elements will b e denoted F µi and X il . Let us denote the pro duct Z = F X ∈ R M × P , with elements z µl = N X i =1 F µi X il . (1) The elemen t-wise measuremen t y µl of z µl is then speciﬁed by some kno wn probabilit y distribution function P out ( y µl | z µl ), so that: P out ( Y | Z ) = Y µ,l P µl out ( y µl | z µl ) . (2) 4 The goal of matrix factorization is to estimate b oth matrices F and X from the measuremen ts Y . In this pap er w e will treat this problem in the framework of Ba yesian inference. In particular we will assume that the matrices F and X were b oth generated from a known separable probabilit y distribution P F ( F ) = M Y µ =1 N Y i =1 P µi F ( F µi ) , (3) P X ( X ) = N Y i =1 P Y l =1 P il X ( X il ) . (4) Although we restrict to separable prior probabilit y distributions P F ( F ), P X ( X ), it turns out that these priors can enco de a broad range constraints such as, for instance sparsity . This is the reason why so many diﬀerent problems can b e studied within our scheme. The output channel P out ( y , z ) can b e of a rather generic nature, thus including v arious kinds of additive and multiplicativ e noise; in the spirit of activ ation functions from neural netw orks it can degrade the information included in the measurement b y e.g. keeping only the sign of elements of z . In the following we shall mostly study the case where the distributions P µi F are all identical (there is a single distribution P F ( F µi )), and the distributions P il X for v arious il (as well as P µl out for v arious µl ) are also all identical. Our approach can b e generalized to the case where P µi F , P il X , P µl out dep end in a known wa y on the indices µi , il , and µl , provided that this dep endence is through parameters that themselves are taken from separable probability distributions. Examples of such dep endence include the blind matrix calibration or the factor analysis. On the other hand our theory do es not co ver the case of an arbitrary matrix ˜ F µi for which we would hav e P µi F = δ ( F µi − ˜ F µi ). The p osterior distribution of F and X given the measuremen ts Y is written as P ( F , X | Y ) = 1 Z ( Y ) P F ( F ) P X ( X ) P out ( Y | F X ) = 1 Z ( Y ) Y µ,i P F ( F µi ) Y i,l P X ( X il ) Y µ,l P out y µl | X i F µi X il ! , (5) where Z ( Y ) is the normalization constant, known as the partition function in statistical physics. Notice that, while the original problem of ﬁnding F and X , given the measurements Y , is not well determined (b ecause of the possibility to obtain, from a giv en solution, an inﬁnity of other solutions through the transformation F → F U − 1 and X → U X , where U is any N × N nonsingular matrix), the fact of using well deﬁned priors P µi F and P il X actually lifts the degeneracy: the problem of ﬁnding the most probable F , X giv en the measurements and the priors is well deﬁned. In case the priors P µi F and P il X do not dep end on the indices µl and il we are left with a p erm utational symmetry b etw een the N column of F and N rows of X . Both in the algorithm and the asymptotic analysis this symmetry is broken and one of the N solutions is chosen at random. T ypically , in most applications, the distributions P out , P F and P X will dep end on a set of parameters (such as the mean, v ariance, sparsity , noise strength, etc.) that we usually will not write explicitly in the general case, in order to simplify the notations. The prior knowledge of these parameters is not necessarily required in our approach: these parameters can b e learned via an exp ectation-maximization-lik e algorithm that w e will discuss brieﬂy in section I I E. Note also that Eq. 1 can be m ultiplied b y an arbitrary constant: with a corresponding change in the output function the problem will not b e mo diﬁed. In the deriv ations of this pap er we choose the ab o v e constan t in such a wa y that the elements of matrices X , Y , and Z are of order O (1), whereas the elements of F scale in a consistent wa y , meaning that the mean of each F µi is of order O (1 / N ) and its v ariance is also of order O (1 / N ). B. Ba yes-optimal inference Our pap er deals with the general case of incomplete information. This happ ens when the reconstruction assumes that the matrices X , F and Y were generated with some distributions P X , P F and P out ( Y | Z ), whereas in reality the matrices were generated using some other distributions P X 0 , P F 0 and P 0 out ( Y | Z ). The message passing algorithm and its asymptotic evolution will be derived in this general case. Ho wev er, our most important results concern the Bay es-optimal setting, i.e. when w e assume that P X 0 = P X , P F 0 = P F , P 0 out ( Y | Z ) = P out ( Y | Z ) . (6) In this case, an estimator X ? that minimizes the mean-squared error (MSE) with resp ect to the original signal X 0 , deﬁned as MSE( X | Y ) = Z d F 0 d X 0 " 1 P N X il ( X il − X 0 il ) 2 # P ( F 0 , X 0 | Y ) , (7) 5 is obtained from marginals of X il with resp ect to the p osterior probability measure P ( F , X | Y ), i.e., X ? il = Z d X il X il ν il ( X il ) , where ν il ( X il ) ≡ Z { F µj } Z { X j n } j n 6 = il P ( F , X | Y ) , (8) is the marginal probabilit y distribution of the v ariable il . The mean squared error achiev ed by this optimal estimator is called the minim um mean squared error (MMSE) in this pap er. A similar result holds for the estimator of F 0 that minimizes the mean-squared error MSE( F | Y ) = Z d F 0 d X 0   1 M X µi ( F µi − F 0 µi ) 2   P ( F 0 , X 0 | Y ) , (9) whic h is obtained from the mean of F µi with respect to the p osterior probability measure P ( F , X | Y ). In the remainder of this article w e will b e using these estimators. C. Statemen t of the main result The main result of this paper are explicit formulas for the MMSE achiev able in the Bay es optimal setting (as deﬁned ab ov e) for the matrix factorization problem in the “thermo dynamic limit”, i.e. when N , M , P → ∞ with ﬁxed ratios M / N = α , P / N = π . When sparsity is in volv ed we consider that a ﬁnite fr action of matrix elements are non-zero. Similarly , when we treat matrices with low ranks we consider again the ranks to b e a ﬁnite fraction of the total dimension. W e also derive the AMP-MSE, i.e. the mean square error achiev able b y the approximate message passing algorithm as deriv ed in this pap er. So far w e w ere characterizing the output channel b y the conditional distribution P 0 out ( y | z 0 ) or P out ( y | z ). It will b e useful to think of the output as a deterministic function h of z and of random v ariables w , i.e. y µl = h ( z µl , w µl ) = h 0 ( z 0 µl , w 0 µl ). The random (“noise”) v ariables w and w 0 are sp eciﬁed b y their probability distributions P ( w ) and P 0 ( w 0 ). W e can relate P out to h as follows P out ( y | z ) = Z P ( w ) d w δ [ y − h ( z , w )] , (10) P 0 out ( y | z 0 ) = Z P 0 ( w 0 ) d w 0 δ [ y − h 0 ( z 0 , w 0 )] , (11) T o compute the MMSE and AMP-MSE we need to analyze the ﬁxed p oin ts of the following iterativ e equation. m t +1 X = Z d X P X ( X ) Z D ξ f 2 X " 1 αm t F ˆ m t , αm t F ˆ m t X + ξ p αm t F ˆ m t αm t F ˆ m t # , (12) m t +1 F = Z d F P F ( F ) Z D ξ f 2 F " 1 π m t X ˆ m t , π m t X ˆ m t √ N F + ξ p π m t X ˆ m t π m t X ˆ m t # , (13) ˆ m t = − Z d w P ( w ) Z d p d z e − p 2 2 m t F m t X e − ( z − p ) 2 2[ h ( z 0 ) 2 i− m t F m t X ] 2 π p m t F m t X ( h ( z 0 ) 2 i − m t F m t X ) ∂ p g out ( p, h ( z , w ) , h ( z 0 ) 2 i − m t F m t X ) , (14) where D ξ is a notation for a Gaussian probability measure d ξ e − ξ 2 / 2 / √ 2 π . W e denoted h ( z 0 ) 2 i = N E [( F 0 ) 2 ] E [( X 0 ) 2 ]. Here f X and f F are the so-called input functions, they are deﬁned using the prior distributions P X and P F as f X (Σ , T ) ≡ R d X X P X ( X ) e − ( X − T ) 2 2Σ R d X P X ( X ) e − ( X − T ) 2 2Σ (15) f F ( Z, W ) ≡ √ N R d F F P F ( F ) e − ( √ N F − W ) 2 2 Z R d F P F ( F ) e − ( √ N F − W ) 2 2 Z . (16) The output function g out is deﬁned using the output probability P out as g out ( ω , y , V ) ≡ R d z P out ( y | z ) ( z − ω ) e − ( z − ω ) 2 2 V V R d z P out ( y | z ) e − ( z − ω ) 2 2 V . (17) 6 The MSE is computed as MSE F = E [( F 0 ) 2 ] − m F and MSE X = E [( X 0 ) 2 ] − m X with m F and m X b eing ﬁxed p oin ts of (12-14). The AMP-MSE is obtained from a ﬁxed p oin t reac hed with a so-called uninformative initialization. The uninforma- tiv e initialization do es not use any information ab out the seeked matrices F 0 , X 0 , it uses only the prior distributions, it is deﬁned as m t =0 F = N E [ F ] 2 , m t =0 X = E [ X ] 2 . (18) In case the prior distribution dep ends on another random v ariables, e.g. in case of matrix calibration, we take additional av erage with resp ect to that v ariable. If the ab o v e initialization gives m t =0 F = 0 and m t =0 X = 0 then this is a ﬁxed p oin t of the ab o v e iterativ e equations. This is due to the permutational symmetry b et ween the columns of matrix F and rows of matrix X . T o obtain a nontrivial ﬁxed p oin t we initialize at m t =0 F = η for some very small η , corresp onding to an inﬁnitesimal prior information ab out the matrix elemen ts of the matrix F . T o compute the MMSE we need to initialize the iterations in an informative wa y , using the kno wledge of the true X 0 and F 0 . This informative initialization is deﬁned as an inﬁnitesimal p erturbation of m t =0 F = N E [( F 0 ) 2 ] , m t =0 X = E [( X 0 ) 2 ] . (19) If the resulting ﬁxed p oin t agrees with the AMP-MSE then this is also the MMSE. In case that this informative initialization leads to a diﬀerent ﬁxed p oint than the AMP-MSE then the MMSE is given by the one of them for whic h the following free entrop y is larger φ ( m F , m X , ˆ m F = π m X ˆ m, ˆ m X = αm F ˆ m ) = (20) απ Z d y D ξ D u 0 P out  y | q Q 0 F Q 0 X − m F m X u 0 + √ m F m X ξ  log  Z D u P out  y | q Q 0 F Q 0 X − m F m X u + √ m F m X ξ  + α  − ˆ m F m F 2 + Z D ξ d F 0 e − N ˆ m F 2 ( F 0 ) 2 + √ N ˆ m F ξF 0 P F ( F 0 ) log  Z d F e − N ˆ m F 2 F 2 + √ N ˆ m F ξF P F ( F )  + π  − ˆ m X m X 2 + Z D ξ d X 0 e − ˆ m X 2 ( X 0 ) 2 + √ ˆ m X ξX 0 P X ( X 0 ) log  Z d X e − ˆ m X 2 X 2 + √ ˆ m X ξX P X ( X )  , Sections I II, IV, V are devoted to the deriv ation of these formulas for MMSE and AMP-MSE. In section VI we then give a num ber of examples of phase diagram for sp eciﬁc applications of the matrix factorization problem. W e reiterate at this p oin t that whereas our results are exact results within the theoretical physics scop e of the ca vity/replica method, we do not pro vide their pro ofs and their mathematical status is that of conjectures. W e devote section I I C to explaining the reasons of why this is so. The situation is comparable to the predictions-conjectures that were made for the satisﬁability threshold [20], which hav e b een partly established rigorously recently [21], or the predictions-conjectures that were made for the CDMA problem [22, 23], and were later prov ed by [24, 25]. D. Applications of matrix factorization Sev eral imp ortan t problems which hav e received a lot of attention recently are sp ecial cases of the matrix factor- ization problem as set ab ov e. In this pap er w e will analyse the follo wing ones. a. Dictionary le arning. This is a basic example of ﬁnding a representation of data that is adv antageous in some w ay . Representation learning [1] is a concept b ehind many machine learning applications, including the deep learning framew ork which has b ecome very p opular in recent time [2]. In the context of representation learning dictionary learning is sometimes referred to as sp arse c o ding [1]. Dictionary learning relies on the fact that signals of interest are often sparse in some basis; this prop erty is widely used in data compression and more recen tly in compressed sensing. A lot of work has b een dev oted to analyzing bases in which diﬀerent data are sparse. The goal of dictionary learning is to infer a basis in which the data are sparse based purely on a large num ber of samples from the data. The M × P matrix Y then represents the P samples of M -dimensional data. The goal is to decomp ose Y = F X + W in to a M × N matrix F , and a N × P sparse matrix X , W is the noise. In this pap er w e will analyse the following teacher-studen t scenario of dictionary learning. W e will generate a random Gaussian matrix F 0 with iid elements of zero mean and v ariance 1 / N , and a random Gauss-Bernoulli matrix X 0 with fraction 0 < ρ < 1 of non-zero elements. The non-zero elements of X 0 will b e iid Gaussian with mean X 7 and v ariance σ . The noise W 0 with elements w 0 µl is also iid Gaussian with zero mean and v ariance ∆. W e hence ha ve P F ( F µi ) = P F 0 ( F µi ) = 1 p 2 π / N e − N F 2 µi 2 , (21) P X ( X il ) = P X 0 ( X il ) = (1 − ρ ) δ ( X il ) + ρ √ 2 π σ e − ( X il − X ) 2 2 σ , (22) P out ( y µl | z µl ) = 1 √ 2 π ∆ e − ( y µl − z µl ) 2 2∆ , (23) where δ ( X ) is the Dirac delta function. The goal is to infer F 0 and X 0 from the knowledge of Y with the smallest p ossible n umber of samples P . In the noiseless case, ∆ = 0, exact reconstruction might b e p ossible only when the observed information is larger that the information that w e wan t to infer. This pro vides a simple counting b ound on the n um b er of needed samples P : P ≥ α α − ρ N . (24) Note that the ab o v e assumptions on P F , P X and P out lik ely do not hold in any realistic problem. How ever, it is of theoretical interest to analyze the av erage theoretical p erformance and computational tractabilit y of suc h a problem, as it gives a well deﬁned b enc hmark. Moreov er, we anticipate that if we develop an algorithm w orking well in the ab o v e case it might also work well in man y real applications where the ab o v e assumptions are not satisﬁed, in the same spirit as the approximated message passing algorithm derived for compressed sensing with zero mean Gaussian measuremen t matrices [11] works also for other kinds of matrices. T ypically , w e w ould lo ok for an inv ertible basis F 0 with N = M , in that case we sp eak of a squar e dictionary . Ho wev er, with the compressed-sensing application in mind, it is also very interesting to consider that Y migh t b e under-sampled measurements of the actual signal, corresp onding then to α = M / N < 1. Hence we will b e interested in the whole range 0 < α ≤ 1. The regime of α < 1 corresp onds to an over c omplete dictionary , in which case each of the P measurements ~ y l is a sparse linear combination of the columns (atoms) of the dictionary . W e remind that the N columns of the matrix F 0 can alwa ys b e p ermuted arbitrarily and multiplied by ± 1. This is also true for rows of the matrix X 0 : all these operations do not change Y , nor the p osterior probability distribution. This is hence an in trinsic freedom in the dictionary learning problems that we hav e to keep in mind. Note that man y works consider the dictionary to b e column normalized, which lifts part of the degeneracy in some optimization form ulations of the problem. In our setting the equiv alent of column normalization is asymptotically determined by the prop erties of the prior distribution. b. Blind matrix c alibr ation. In dictionary learning one do es not hav e an y sp eciﬁc information ab out the ele- men ts F 0 µi . How ever in some applications of compressed sensing one might hav e an approximate knowledge of the measuremen t matrix: it is often p ossible to use known samples of the signal X in order to calibrate the matrix in a sup ervised manner (i.e. using known training samples of the signal). Sometimes, how ev er, the kno wn training samples are not a v ailable and hence the only wa y to calibrate is to measure a num b er of unknown samples and p erform their reconstruction and calibration of the matrix at the same time, suc h a scenario is called the blind c alibr ation . In blind matrix calibration, the prop erties of the signal and the output function are the same as in dictionary learning, eqs. (22-23). As for the matrix elements F 0 µi one knows a noisy estimation F 0 µi . In this work we will assume that this estimation w as obtained from F 0 µi as follows F 0 µi = F 0 µi + √ η ξ µi √ 1 + η , (25) where ξ µi is a Gaussian random v ariables with zero mean and v ariance 1 / N . This wa y , if the matrix elements F 0 µi ha ve zero mean and v ariance 1 / N , then the same is true for the elements F 0 µi . The control parameter η is then quantifying how well one kno ws the measuremen t matrix. It provides a wa y to interpolate b et ween the pure compressed sensing η = 0, where one knows the measuremen t matrix F 0 , and the dictionary learning problems η → ∞ . Explicitly , the prior distribution of a giv en elemen t of the matrix F is P F ( F µi ) = N  F 0 µi √ 1 + η , η N (1 + η )  , (26) where N ( a, b ) is a Gaussian distribution with mean a and v ariance b . 8 c. L ow-r ank matrix c ompletion Another sp ecial case of matrix factorization that is often studied is the lo w-rank matrix completion. In that case one “kno ws” only a small (but in our case ﬁnite when N → ∞ ) fraction  of elements of the M × P matrix Y . Also one knows which elements are known and which are not; let us call M the set on which elemen ts are known, it is a set of size M P . In this case the output function is: P out ( y µl | z µl ) = 1 √ 2 π ∆ e − ( y µl − z µl ) 2 2∆ if µl ∈ M , = 1 √ 2 π e − y 2 µl 2 if µl / ∈ M . (27) The precise choice on the function on the second line is arbitrary as long as it do es not dep end on the z µl . In what follo ws w e will assume that the M P known elements were c hosen uniformly at random. In low rank matrix completion N is small compared to M and P , hence b oth π and α are relatively large. Note, ho wev er, that the limit w e analyse in this pap er keeps π = O (1) and α = O (1) while N → ∞ , whereas in many previous works on low-rank matrix completion the rank was considered to b e O (1) and hence α and π of order O ( N ). Compared to those works the analysis here applies to “not-so-low-rank” matrix completion. The question is what fraction  of elements of Y needs to b e known in order to b e able to reconstruct the tw o matrices F 0 and X 0 . F or negligible measurement noise, ∆ = 0, a simple coun ting b ound gives that the fraction of known elemen ts we need for reconstruction is at least  ≥ α + π απ . (28) Again we will study the student-teac her scenario when a low-rank Z is generated from X 0 and F 0 ha ving iid elemen ts distributed according to eq. (21) and (22) with ρ = 1 (no sparsit y). T o construct Y we keep a random fraction  of elements of Z , and the goal is to reconstruct X 0 and F 0 from that knowledge. W e also note that v arian ts on matrix completion where the output channel P out ( y , z ) keeps only the sign of z were also considered, called 1-bit matrix c ompletion , and hav e some in teresting prop erties rep orted in [26, 27]. d. Sp arse PCA and blind sour c e sep ar ation Principal comp onen t analysis (PCA) is a well-kno wn to ol for di- mensional reduction. One usually considers the singular v alue decomp osition (SVD) of a given matrix and keeps a giv en num ber of largest v alues, thus minimizing the mean square error b et w een the original matrix and its low-rank appro ximation. The SVD is computationally tractable, and provides the minimization of the mean square error b e- t ween the original matrix and its low-rank appro ximation. Ho wev er, with additional constraints there is no general computationally tractable approach. A v arian t of PCA that is relev an t for a num b er practical application requires that one of the low-rank comp onents is sparse. The goal is then to approximate a matrix Y b y a pro duct F X where F is a tall matrix, and X a wide sparse matrix. The teac her-student scenario for sparse PCA that w e will analyse in this paper uses eq. (21-23) and the matrix dimensions are suc h that π = P / N and α = M / N are b oth large, but still of order O (1), and comparable one to the other. Hence it is only the region of interest for α and π that makes this problem diﬀerent from the dictionary learning. Note that, in the same w ay as for the matrix completion, many works in the literature consider N = O (1) whereas here we hav e N → ∞ in such a wa y that π = P / N = O (1) and α = M / N = O (1). Hence w e work with lo w rank, but not as low as most of the existing literature. F or a statistical physics study follo wing the lines of the presen t pap er where the rank is O (1) see [28]. In the zero measuremen t noise case, ∆ = 0, the simple counting b ound giv es that the rank N for which the reconstruction problem may b e solv able needs to be smaller than N ≤ M P M + ρP . (29) One imp ortan t application where sparse PCA is relev ant is the blind source separation problem. Given N ρ -sparse (in some known basis) source-signals of dimension P , they are mixed in an unknown wa y via a matrix F in to M c hannel measurements Y . In blind source separation t ypically b oth the n umber of sources N and the num ber of c hannels (sensors) M are small compared to P . When N < M we obtain an ov erdetermined problem which may b e solv able even for ρ = 1. More interesting is the undetermined case with the num ber of sensors smaller than the n umber of sources, M < N , whic h would not b e solv able unless the signal is sparse ρ < 1 (in some basis), in that case the b ound (29) applies. e. R obust PCA Another v arian t of PCA that arises often in practice is the robust PCA, where the matrix Y is v ery close to a low rank matrix F X plus a sparse full rank matrix. The interpretation is that Y was created as low rank but then a small fraction of elemen ts was distorted b y a large additive noise. The resulting Y is hence not low rank. 9 In this pap er we will analyse a case of robust PCA when F and X are generated from eq. (21-22) with ρ = 1 and the output function is P out ( y µl | z µl ) =  1 √ 2 π ∆ s e − ( y µl − z µl ) 2 2∆ s + (1 −  ) 1 √ 2 π ∆ l e − ( y µl − z µl ) 2 2∆ l , (30) where  is the fraction of elements that were not largely distorted, ∆ s  1 is the small measuremen t noise on the non-distorted elemen ts, ∆ l is the large measuremen t noise on the distorted elements. W e will require ∆ l ≈ X 2 + σ to b e comparable to the v ariance of z µl suc h that there is no reliable wa y to tell which elements were distorted b y simply lo oking at the distribution of y µl . The parameters regime we are interested in here is π and α b oth relatively large and comparable one to another. In robust PCA in the zero measurement noise case, ∆ s = 0, the simple counting b ound gives that the fraction of non-distorted elements  under which the reconstruction may still b e solv able needs to satisfy the same b ound as for matrix completion (28). Indeed this counting bound do es not distinguish b et w een the case of matrix completion when the p osition of known elements of Y is known and the case of RPCA when their p ositions are unknown. f. F actor analysis One ma jor ob jective of m ultiv ariate data analysis is to infer an appropriate mechanism from whic h observed high-dimensional data are generated. F actor analysis (F A) is a representativ e metho dology for this purp ose. Let us supp ose that a set of M -dimensional vectors y 1 , y 2 , . . . , y P is giv en and its mean is set to zero by a pre-pro cessing. Under such a setting, F A assumes that each observed vector y l is generated by N ( ≤ M )-dimensional c ommon factor X l and M -dimensional unique factor w l as y l = F X l + w l , where F ∈ R M × N is termed the lo ading matrix . The goal is to determine the entire set of F , X = ( X 1 , X 2 , . . . , X P ), and W = ( w 1 , w 2 , . . . , w P ) from only Y = ( y 1 , y 2 , . . . , y P ). Therefore, F A is also expressed as a factorization problem of the form of Y = F X + W in matrix terms. The characteristic feature of F A is to tak e in to accoun t the site dep endence of the output function as P out ( y µl | z µl , ψ µ ) = 1 p 2 π ψ µ e − ( y µl − z µl ) 2 2 ψ µ , (31) where ψ µ denotes the v ariance of the µ -th comp onen t of the unique factor. In addition, it is normally assumed that X l ( l = 1 , 2 , . . . , P ) indep enden tly ob eys a zero mean multiv ariate Gaussian distribution, the v ariance of which is set to the identit y matrix in the basic case. These assumptions make it p ossible to express the log-likelihoo d of F , Ψ given Y in a compact manner as log P ( Y | F, Ψ) = log  R Q M ,P µ =1 ,l =1 P out ( y µl | z µl , ψ µ ) Q N ,P i =1 ,l =1 P X ( X il )d X  = − P 2 log det  F F T + Ψ  − 1 2 P P l =1 y T l  F F T + Ψ  − 1 y l + const , where Ψ = ( ψ µ δ µν ) ∈ R M × M . In a standard scheme, F and Ψ, which parameterize the generative mechanism of the observed data Y , are determined b y maximizing the log-lik eliho od function [29]. After obtaining these, the p osterior distribution P ( X | Y , F ML , Ψ ML ) is used to estimate the common factor X , where F ML and Ψ ML are the maximum likelihoo d estimators of F and Ψ, resp ectiv ely . Finally , unique factor W is determined from the relation Y = F X + W . Other heuristics to minimize certain discrepancies b et w een the sample v ariance-co v ariance matrix P − 1 Y Y T and F F T + Ψ are also conv en tionally used for determining F and Ψ. As an alternative approach, we emplo y the Bay esian inference for F A. g. Non-ne gative matrix factorization In many application of matrix factorization it is kno wn that b oth the co eﬃcien t of the signals X and the co eﬃcien ts of the dictionary F must b e non-negative. In our setting this can b e tak en into account easily by imp osing the distributions P X and P F to ha ve non-negative supp ort. In the student- teac her scenario of this pap er we can hence consider the nonzero elements of X to b e P X ( X ) = 0 for X < 0 and P X ( X ) = N ( X , σ ) for X > 0, and analogously for P F . E. Related work and p ositioning of our con tribution Matrix factorization and its sp ecial cases as mentioned ab o ve are well-studied problems with extensive theoretical and algorithmic literature that w e cannot fully co v er here. W e will hence only giv e examples of relev an t w orks and will try to be exhaustiv e only concerning pap ers that are v ery closely related to our w ork (i.e. pap ers using message-passing algorithms or analyzing the same phase transitions in the same scaling limits). The dictionary learning problem was identiﬁed in the work of [3, 4] in the con text of image representation in the visual cortex, and the problem was studied extensively since [5]. Learning of ov ercomplete dictionaries for sparse represen tations of data has many applications, see e.g. [30]. MOD [31] and K-SVD [32] are tw o representativ e algorithms for the dictionary learning. Several authors studied the identiﬁabilit y of the dictionary under v arious (in general weak er than ours) assumptions, e.g. [33 – 39]. An interesting view on the place of sparse and redundant represen tations in to days signal pro cessing is given in [19]. 10 A nice ov erview of recent progress on low-rank matrix factorization from incomplete observ ations is given in [27]. The tw o closely related problems of sparse principal comp onen t analysis or blind source separation is also explored in a num b er of works, see e.g. [40 – 44]. A short survey on the topic with relev an t references can b e found in [45]. Matrix completion is another problem that b elongs to the class treated in this pap er. Again, many imp ortan t works were dev oted to this problem giving theoretical guarantees, algorithm and applications, see e.g. [8, 9, 46, 47]. Another related problem is the robust principal comp onent analysis that w as also studied by many authors; algorithms and theoretical limits were analyzed in [10, 48 – 50]. Our work diﬀers from the mainstream of existing literature in its general p ositioning. Let us mention here some of the main diﬀerences with most other works. • Most existing works concentrate on ﬁnding theoretical guarantees and algorithms that work in the w orst p ossible case of matrices F and X . In our work w e analyze the typical cases when elements of F and X are generated at random. Arguably a worst case analysis is useful for practical applications in that it provides some guarantee. On the other hand, in some large-size applications one can b e confron ted in practice with a situation which is closer to the typical case that we study here. Our t ypical-case analysis provides results that are muc h tighter than those usually obtained in literature in terms of b oth achiev ability and computational tractability . F or instance our results imply muc h smaller sample complexity , or muc h smaller mean-squared error for a given signal-to-noise ratio, etc. • Our main con tribution is the asymptotic phase diagram of Bay es-optimal inference in matrix factorization. Sp ecial cases of our result cov er imp ortant problems in signal pro cessing and machine learning. In the present w ork w e do not concentrate on the v alidation of the asso ciated approximate message-passing algorithm on practical examples, nor on its comparison with existing algorithms. A contribution in this direction can b e found in [15, 18]. • A large part of existing mac hine-learning and signal-pro cessing literature pro vides theorems. Our work is based on statistical physics metho ds that are conjectured to give exact results. While man y results obtained with these metho ds on a v ariet y of problems hav e b een indeed prov en later on, a general pro of that these metho ds are rigorous is not known yet. • Many existing algorithms are based on con vex relaxations of the corresp onding problems. In this pap er we analyze the Bay es-optimal setting. Not surprisingly , this setting gives a muc h b etter p erformance than conv ex relaxation. The reason why it is less studied is that it is often considered as hop elessly complicated from the algorithmic p ersp ectiv e; how ev er some recent results using message-passing algorithms which stand at the core of our analysis hav e shown very go o d p erformance in the Bay es-optimal problem. Besides, the Bay es-optimal oﬀers an optimal b enc hmark for testing algorithmic metho ds. • When treating low-rank matrices we consider the rank to b e a ﬁnite fraction of the total dimension, whereas most of existing literature considers the rank to b e a ﬁnite constant. The AMP algorithm derived in this paper can of course b e used as a heuristics also in the constant rank case, but our results ab out MMSE are exact only in the limit where the rank is a ﬁnite fraction of the total dimension. Also note that for the sp ecial case of rank one the AMP algorithm was derived and analysed rigorously [51, 52]. • When treating sparsity w e consider the num ber of non-zeros is a ﬁnite fraction of the total dimension, whereas existing literature often considers a constant num b er of non-zeros. Again the AMP algorithm derived in this pap er can of course b e used as a heuristics also in the c onstan t num b er of non-zeros case, but our results ab out MMSE are exact only in the limit where the density is a ﬁnite fraction of the total dimension. The pap er is organized as follo ws: First, Sec. I I, w e explain the statistical ph ysics background for the Ba y es-optimal inference. In Sec. I II w e giv e a detailed deriv ation of the approximate message passing algorithm for the matrix factorization problem. This algorithm is equiv alen t to maximization of the Bethe free entrop y , whose expression is discussed in Sec. IV. These ﬁrst tw o sections thus state our algorithmic approach to matrix factorization. The asymptotic p erformance of the AMP algorithm and the Bay es-optimal MMSE are analyzed in Sec. V using t wo tec hnics: i) the state evolution of the AMP algorithm and ii) the replica metho d. As usual, the t wo metho ds are found to agree and to give the same predictions. W e then use these results in order to study some exemples of matrix factorization problems in Sec. VI. In particular we derive the phase diagram, the MMSE and the sample complexit y of dictionary learning, blind matrix calibration, sparse PCA, blind source separation, low rank matrix completion, robust PCA and factor analysis. Our results are summarized and discussed in the conclusion in Sec. VI I. 11 I I. ELEMENTS OF THE BACK GR OUND FROM ST A TISTICAL PHYSICS A. The Nishimori identities There are imp ortan t identities that hold for the Bay es-optimal inference and that simplify man y of the calculations that follow. In the physics of disordered systems these iden tities are known as the Nishimori identities [53–55]. Basically , they follow from the fact that the true signal F 0 , X 0 is an equilibrium conﬁguration with resp ect to the Boltzmann measure P ( F , X | Y ) (5). Hence many prop erties of the true signal F 0 , X 0 can b e computed b y using a verages ov er the distribution P ( F , X | Y ) even if one do es not know F 0 , X 0 precisely (5). In order to derive the Nishimori identities, we need to deﬁne three t yp es of a verages: the thermo dynamic av erage, the double thermo dynamic av erage, and the disorder av erage. • Consider a function A ( F , X ) dep ending on a “trial” conﬁguration F and X . W e deﬁne the “thermo dynamic a verage” of A given Y as: h A ( F , X ) i F,X | Y ≡ Z d X d F A ( F , X ) P ( F , X | Y ) , (32) where P ( F , X | Y ) is given b y Eq. (5). The thermo dynamic av erage h A ( F , X ) i F,X | Y is a function of Y . • Similarly , for a function A ( F 1 , X 1 , F 2 , X 2 ) that dep ends on two trial conﬁgurations X 1 , F 1 and X 2 , F 2 , we deﬁne the “double thermo dynamic av erage” of A giv en Y as: hh A ( F 1 , X 1 , F 2 , X 2 ) ii F 1 ,X 1 ,F 2 ,X 2 | Y ≡ Z d X 1 d F 1 d X 2 d F 2 A ( F 1 , X 1 , F 2 , X 2 ) P ( F 1 , X 1 | Y ) P ( F 2 , X 2 | Y ) ; (33) this is again a function of Y . • F or a function B that dep ends on the measurement Y and on the true signal F 0 , X 0 , we deﬁne the “disorder a verage” as [ B ( F 0 , X 0 , Y )] F 0 ,X 0 ,Y ≡ Z d Y d X 0 d F 0 P X ( X 0 ) P F ( F 0 ) P out ( Y | F 0 X 0 ) B ( F 0 , X 0 , Y ) . (34) Note that if the quantit y B dep ends only on Y , then we hav e [ B ( Y )] Y ≡ Z d Y Z ( Y ) B ( Y ) . (35) This is simply b ecause the partition function Z ( Y ) is Z ( Y ) = R d X 0 d F 0 P X ( X 0 ) P F ( F 0 ) P out ( Y | F 0 X 0 ). Let us now derive the Nishimori identities. W e consider a function A ( F , X , F 0 , X 0 ) that dep ends on the trial conﬁguration F , X and on the true signal F 0 , X 0 . Its thermo dynamic av erage h A ( F , X , F 0 , X 0 ) i F,X | Y is a function of the measurement Y and the true signal F 0 , X 0 . The disorder av erage of this quantit y can b e written as [ h A ( F , X , F 0 , X 0 ) i F,X | Y ] F 0 ,X 0 ,Y = Z d F 0 d X 0 d Y P X ( X 0 ) P F ( F 0 ) P out ( Y | F 0 X 0 ) Z d F d X A ( F , X , F 0 , X 0 ) P ( F , X | Y ) = Z d Y Z ( Y ) Z d F 0 d X 0 d F d X A ( F 0 , X 0 , F , X ) P X ( X 0 ) P F ( F 0 ) P out ( Y | F 0 X 0 ) Z ( Y ) P ( F , X | Y ) . (36) In this last expression, renaming F , X to F 1 , X 1 and F 0 , X 0 to F 2 , X 2 , we see that the av erage ov er F , X , F 0 , X 0 is nothing but the double thermo dynamic a verage: [ h A ( F , X , F 0 , X 0 ) i F,X | Y ] F 0 ,X 0 ,Y = Z d Y Z ( Y ) Z d F 1 d X 1 d F 2 d X 2 A ( F 1 , X 1 , F 2 , X 2 ) P ( F 1 , X 1 | Y ) P ( F 2 , X 2 | Y ) = [ hh A ( F 1 , X 1 , F 2 , X 2 ) ii F 1 ,X 1 ,F 2 ,X 2 | Y ] Y , (37) where in the last step we hav e used the form (35) of the disorder av erage. The identit y [ h A ( F , X , F 0 , X 0 ) i F,X | Y ] F 0 ,X 0 ,Y = [ hh A ( F 1 , X 1 , F 2 , X 2 ) ii F 1 ,X 1 ,F 2 ,X 2 | Y ] Y (38) 12 is the general form of the Nishimori identit y . W ritten in this wa y , it holds for many inference problems where the mo del for signal generation is kno wn. Self-a veraging is a property that is assumed to hold for many quan tities in statistical physics. Self-av eraging means that in the thermo dynamic limit (as in troduced in the ﬁrst paragraph of section I C) the distribution of h A ( F , X , F 0 , X 0 ) i F,X | Y concen trates with resp ect to the realization of disorder Y , F 0 , X 0 . In other words for large system sizes the quantit y h A ( F , X , F 0 , X 0 ) i F,X | Y con verges with probability one to its a verage ov er disorder, [ h A ( F , X , F 0 , X 0 ) i F,X | Y ] F 0 ,X 0 ,Y . Self-av eraging implies the existence of the thermo dynamic limit and is in general v ery c hallenging to prov e rigorously . Self-av eraging makes the Nishimori iden tity v ery useful in practice. T o give one particularly useful example, let us deﬁne m X = (1 / ( P N )) P il X 0 il X il and q X = (1 / ( P N )) P il X 1 il X 2 il . The Nishimori identit y states that [ h m X i F,X | Y ] F 0 ,X 0 ,Y = [ hh q X ii ] F 1 ,X 1 ,F 2 ,X 2 | Y ] Y . Assuming that the quantities m X and q X are self-av eraging, w e obtain in the thermo dynamic limit, for almost all Y : h m X i F,X | Y = hh q X ii F 1 ,X 1 ,F 2 ,X 2 | Y . Explicitly , this gives: (1 / ( P N )) X il X 0 il h X il i F,X | Y = (1 / ( P N )) X il  h X il i F,X | Y  2 + o N (1) . (39) where o N (1) is going to zero as N → ∞ . This is a remark able identit y concerning the mean of X il with the p osterior distribution ν il ( X il ). The left-hand side meas ures the ov erlap b etw een this mean and the sought true v alue X 0 il . The righ t-hand side measures the self ov erlap of the mean, which can b e estimated without any knowledge of the true v alue X 0 il , by generating tw o indep enden t samples from P ( F , X | Y ). By symmetry , all these examples apply also to av erages and functions of the matrix F : (1 / M ) X µl F 0 µl h F µl i F,X | Y = (1 / M ) X il  h F µl i F,X | Y  2 + o N (1) . (40) V alidit y of the relation (39) also follows straightforw ardly from the self-av eraging prop ert y and an elemen tary relation for conditional exp ectations E ( E ( X | Y ) X ) = E (( E ( X | Y )) 2 ). Another example of a Nishimori identit y used in our deriv ation is eq. (108). Note that this one is crucial to maintain certain v ariance-related v ariables strictly p ositiv e as explained in section I II C. Without eq. (108) the AMP algorithm ma y ha v e pathological numerical behavior causing that some by deﬁnition strictly p ositive quantities b ecome negativ e. Deep er understanding of this problem in cases where Nishimori identities cannot b e imp osed is under inv estigation. Nishimori identities also greatly simplify the state ev olution equations of section V A 1. In general state evolution in volv es nine coupled equations. How ev er, after imp osing Nishimori identities, eqs. (187-189), we are left with only three coupled equations. B. Easy/hard and p ossible/imp ossible phase transitions Fixed p oin ts of equations (12-14) allow us to ev aluate the MMSE of the matrix F and of the matrix X . W e in vestigate tw o ﬁxed p oin ts: The one that is reached from the “uninformative” initialization (18), and the ﬁxed p oin t that is reached with the informative (planted) initialization (19). If these tw o initializations lead to a diﬀeren t ﬁxed p oin t it is the one with the largest v alue of the Bethe free entrop y (21) that corresp onds to the Bay es-optimal MMSE. The reason for this is that the Bethe free entrop y is the exp onen t of the normalization constant in the p osterior probabilit y distribution: a larger Bethe free entrop y is hence related to a ﬁxed p oin t that corresp onds to an exp onen tially larger probability . F urthermore, in cases where the uninformativ e initialization do es not lead to this Ba yes-optimal MMSE w e an ticipate that the optimal solution is hard to reach for a large class of algorithms. Dep ending on the application in question and the v alue of parameters ( α, π , ρ, , . . . ) we can sometimes identify a phase transition, i.e. a sharp change in b ehavior of the MMSE. As in statistical ph ysics it is instrumental to distinguish t wo kinds of phase transitions • Se c ond or der phase tr ansition: In this case there are tw o regions of parameters. In one, the reco v ery performance is po or (p ositiv e MMSE); in the other one the recov ery is p erfect (zero MMSE). This situation can arrive only in zero noise, with p ositiv e noise there is a smo oth “crossov er” and the transition b et ween the phase of go o d and p oor reco very is not sharply deﬁned. In terestingly , as we will see, in all examples where we observ ed the second order phase transition, its lo cation coincides with the simple counting bounds discussed in Section I D. In the phase with p o or MMSE, there is simply not enough information in the data to recov er the original signal (indep enden tly of the recov ery algorithm or its computational complexity). Exp erience from statistical ph ysics tells us that in problems where such a second order phase transition happ ens, and more generally in cases where the state evolution (12-14) has a unique ﬁxed p oin t, there is no fundamental barrier that w ould preven t go o d algorithms to attain the MMSE. And in particular our analysis suggest that in the limit of large system sizes the AMP algorithm deriv ed in this pap er should b e able to achiev e this Bay es-optimal MMSE. 13 • First or der phase tr ansition: A more subtle phenomenology can be observ ed in the low-noise regime of dictionary learning, sparse PCA and blind matrix calibration. In some region of parameters, that we call the “spino dal” region, the informativ e (plan ted) and uninformative initializations do not lead to the same ﬁxed point of the state ev olution equations. The spinodal regime itself ma y divide in to tw o parts - the one where the uninformative ﬁxed p oin t has a larger free en tropy , and the “solv able-but-hard” phase where the informativ e ﬁxed p oin t has a larger free entrop y . The b oundary b et ween these tw o parts is the ﬁrst order phase transition p oin t. W e anticipate that in the solv able-but-hard phase the Bay es-optimal MMSE is not achiev able by the AMP algorithm nor b y a large class of other known algorithms. The ﬁrst order phase transition is asso ciated with a discon tinuit y in the MMSE. The MSE reached from the uninformativ e initializations will b e denoted AMP-MSE and is also discon tinuous. In the case of the ﬁrst order phase transition it will hence b e useful to distinguish in our notations b et ween the minimal achiev able MSE that we denote MMSE, and the MSE achiev able b y the AMP-like algorithms that we denote AMP-MSE. When MMSE=AMP-MSE in the large N limit w e say that AMP is asymptotically optimal. The region where in the large size limit MMSE < AMP-MSE is called the spino dal regime/region. C. Wh y our results are not rigorous? Why do we conjecture that they are exact? In this section we aim at summarizing the assumptions that we make but do not prov e along the deriv ation of our results presented in sections I I I, IV, V. In section VI we then give the MMSE and AMP-MSE as obtained from n umerical ev aluation of eqs. (12-14). 1. The statistic al-physics str ate gy for solving the pr oblem Let us ﬁrst state the general strategy from a physics p oin t of view. W e use tw o complementary approaches, the ca vity metho d and the replica metho d. Our main goal is to deriv e the MMSE for the matrix factorization problem in the Bay es-optimal setting. Our main to ol is the cavit y metho d, and it presents tw o adv antages with resp ect to the replica metho d: 1) during the deriv ation of the MMSE we obtain the AMP algorithm as a side-pro duct and 2) based on exp erience of the last three decades, it is likely that a rigorous pro of of our conjectures will follo w v ery closely this path. In the cavit y metho d we ﬁrst assume that there is a ﬁxed p oin t of the b elief propagation equations that correctly describ es the p osterior probabilit y distribution, and that the BP equations (initialized randomly , or in the case where a ﬁrst order phase transition is presen t, initialized close to the planted conﬁguration) con verge to it. Then we are interested in the a verage evolution of the BP iterations. This can b e understo od through a statistical analysis of the BP equations in whic h we keep only the leading terms in the large N limit, an approac h called the cavit y metho d [12] in the ph ysics literature. The result of this analysis are the state ev olution equations that should b e asymptotically exact, they do not dep end on the size of the system N an ymore. There is one known w ay b y which the assumptions made in this deriv ation can fail, this w ay is called replica-symmetry-breaking (RSB). F ortunately , in the Ba yes optimal inference, RSB is excluded, as we explain b elo w. Our second analysis uses the replica metho d. In the replica approach one computes the a verage of the n -th p ow er of the normalization of the p osterior probability distribution. Then one uses the limit n → 0 to obtain the av erage of its logarithm. Doing this, one needs to ev aluate a saddle p oint of a certain p oten tial and one assumes that this saddle p oin t is restricted to a certain class that is called “replica symmetric”, eq. (220). Again this assumption is justiﬁed in the Bay es optimal inference, because we know that there is no RSB in this case. Our tw o analyses, using the cavit y metho d and the replica metho d, lead to the same set of closed equations for the MMSE of the problem. This imp ortan t crosschec k, and the very reasonable nature of the assumptions describ ed b elo w (based on our exp erience in using this kind of approach in v arious settings) lead us to conjecture that the results describ ed in this pap er are exact. 2. What ar e the assumptions? Let us start with the cavit y metho d, which is in general muc h more amenable to a rigorous analysis. Let us hence describ e the as sumptions made in section I II. • Our main assumption is that in the leading order in N the true marginal probabilities can b e obtained from a ﬁxed p oin t of the b elief propagation equations (41)-(44) as written for the factor graph in Fig. 1. F or this to 14 b e p ossible the incoming messages on the right hand side of eqs. (41)-(44) must b e conditionally indep enden t (in the thermo dynamic limit, as deﬁned is section I C ), in which case we can write the joint probabilities as pro ducts o ver the messages. If the factor graph was a tree this assumption would b e ob viously correct. On factor graphs that are lo cally tree-like these assumptions can b e justiﬁed (often also rigorously) by sho wing that the correlations decay fast enough (see for instance [13]). The factor graph in Fig. 1 is far from a tree. How ev er, from the p oin t of view of conditional indep endence b et ween messages, this factor graph resembles the one of compressed sensing for which the corresp onding results were prov en rigorously [25, 56]. It is hence reasonable to exp ect that matrix factorization b elongs to the same class of problems where BP is asymptotically exact in the ab o ve sense. • W e assume that the iterations of equations (41)-(44) conv erge for very large systems ( N → ∞ ) to this ﬁxed p oin t that describ es the true marginal probabilities, pro vided that the iteration is started from the right initial condition (in the presence of a ﬁrst order phase transition we need to consider initialization of the iteration in the planted conﬁguration in order to reach the righ t ﬁxed p oin t). Giv en these assumptions the rest of section I II is justiﬁed by the fact that in the deriv ation w e only neglect terms that do not con tribute to the ﬁnal MSE in the thermo dynamic limit. The main assumptions made in the replica calculation is that “self-av eraging” applies (this is a statemen t on the concen tration of the measure, which basically assumes that the av eraged prop erties are the same as the prop erties of a generic single instance of the problem) and that we can exchange the limits lim n → 0 and lim N →∞ . On top of these, it relies on the replica symmetric calculation, which is justiﬁed here b ecause we are interested in ev aluating Ba yes optimal MMSE. Unfortunately in the mathematical literature there is so far very little progress in proving these basic assumptions of the replica metho d, ev en in problems that are muc h easier than matrix factorization. It can b e remembered that the original motiv ation for introducing the cavit y metho d [57] w as precisely to provide an alternativ e w ay to the replica computations, whic h could b e stated in clear mathematical terms. In this pap er we follo w the physics strategy and w e hence provide a conjecture for ev aluating the MMSE in matrix factorization. W e do anticipate that further work using the path of the cavit y approach will even tually provide a pro of of our conjecture. Let us men tion that rigorous results exist for closely related problems, namely the problem of compressed sensing [25, 56], and also rank-one matrix factorization [51, 52]. None of these apply directly to our case where the rank is extensive and where the matrix F is not known. A non-trivial generalization of these works will b e needed in order to prov e our results. 3. Cavity and r eplic a metho d for other pr oblems The cavit y and replica metho ds, as we use them use in the presen t pap er, rely on the set of assumptions listed ab o v e. It is worth to note that ov er the years they hav e b een applied successfully to a wide range of problems. There is now ada ys a range of results that can b e “derived” (and in many of those cases that was the original deriv ation) using the cavit y or/and the replica metho ds and that hav e b een prov en fully rigorously . The list is long, but to cite a couple of interesting examples w e can mention the pro of of the satisﬁability threshold conjecture [21], performance of lo w-density parit y chec k co des [58], the optimal solution of the linear assignment problem [59], the analysis of the compressed sensing problem [11, 25], and many more. D. Absence of replica symmetry breaking in Bay es-optimal inference The study of the Bay es-optimal inference, i.e. the case when the prior probability distribution matc hes the true distribution with which the data hav e b een generated, is fundamentally simpler than the general case. The funda- men tal reason for this simplicity is that, in the Ba yes-optimal case, a ma jor complication referred to as static r eplic a symmetry br e aking (RSB) in statistical physics literature do es not app ear [55]. RSB is a source of computational complications in many optimization and constraint satisfaction problems suc h as the random K-satisﬁability [20] or mo dels of s pin glasses [12]. One common wa y to deﬁne RSB is via the ov erlap distribution P ( q ) [12, 13, 55]. W e deﬁne an o verlap b et w een t wo conﬁgurations X 1 and X 2 as q X ( X 1 , X 2 ) = (1 / ( P N )) P il X 1 il X 2 il . The o verlap distribution function is then deﬁned (for a given realization of the problem X 0 , F 0 , Y ) as the probabilit y distribution of q X ( X 1 , X 2 ) o ver the p osterior measure P ( X , F | Y ). F or problems that do not hav e any p erm utational symmetry (suc h as the p erm utation of the N rows of X and columns of F ), we say the system is replica symmetric when lim N →∞ P ( q ) = δ ( q − q 0 ) for almost all choices of X 0 , F 0 , Y , and that there is replica symmetry breaking otherwise. The num b er of steps of RSB then corresp ond to the num b er of additional delta p eaks in the limiting distribution P ( q ). When the limiting 15 distribution P ( q ) has a contin uous supp ort then we talk ab out full-RSB. F or systems with p erm utational symmetry the corresp onding num b er of p eaks gets simply multiplied by the size of the symmetry group. Let us further deﬁne a so-called magnetization of conﬁguration X 1 with resp ect to some reference conﬁguration X 0 as m X ( X 0 , X 1 ) = (1 / ( P N )) P il X 1 il X 2 il . Such a magnetization and its v ariance can b e computed from ﬁrst and second deriv ativ e of the free entrop y density . The deriv ative is taken with resp ect to a so-called magnetic ﬁeld whic h is an auxiliary parameter introduced in the p osterior probability for this purp ose. In statistical physics, we exp ect (and in some case, actually prov e) some standard prop erties of the free en tropy such as its s elf-a v eraging and its analyticit y (except at phase transition p oin ts) with resp ect to such parameters. F rom this reasoning it follows that whereas the mean of the magnetization is of order O (1), its v ariance is of order O (1 / N ). Hence in the limit N → ∞ the distribution (ov er P ( X, F | Y )) of magnetization is a delta p eak. If the reference conﬁguration X 0 is the planted conﬁguration, then self-av eraging implies that the distribution of m X ( X 0 , X 1 ) with resp ect to b oth its arguments con verges to a delta function. F rom the Nishimori identities it follo ws that the same has to be true for the distribution of the ov erlaps P ( q ). Therefore only a P ( q ) con verging to a delta function in the thermo dynamic limit is allo wed in the Bay es optimal inference setting. In a more technical side, it also follows from the Nishimori identities that the t wo-point correlations b et w een v ariables deca y suﬃciently fast as prov en in [60], this excludes the existence of the static RSB for the p osterior measure P ( F , X | Y ). Let us stress how ev er that when there is a mismatch b et ween the prior distributions and the distributions from whic h the signal was truly generated, then replica symmetry breaking can happ en. The distribution of o verlaps may b ecome non-trivial even if the distribution of magnetization is not, and in order to provide some exact results or conjectures for this situation our analysis would ha ve to b e revised accordingly . This might turn out as very non- trivial (see for instance the case of compressed sensing with the ` p -norm reconstruction when p < 1 [61]) and is left for future work. E. Learning of hyper-parameters T o obtain the main results of this pap er w e assume that the priors P X and P F are known, including their related parameters. In practice, ev en if one knows a parametric family of probability distributions that approximates w ell the matrix elements of F and X , one often do es not hav e the full information ab out the parameters of this distribution (t ypically its mean and v ariance). In the same manner, one might know the nature of the measurement noise, but without a precise kno wledge of its amplitude. In this subsection w e remark that learning of the parameters can b e rather straightforw ardly included in the present algorithmic approach. Ho wev er, its detail implementation and analysis, as well as the analysis of the deterioration of p erformance when the prior distributions are not known, is left for future work. In order to learn the h yper-parameters, a common technique in statistical inference is the exp ectation maximization [62], where one up dates iteratively the hyper-parameters in order to maximize the p osterior likelihoo d, i.e. the nor- malization Z ( Y ) in the p osterior probability measure. In the context of approximate message passing the exp ectation maximization has b een derived and implemented e.g. in [63, 64]. It turns out that the exp ectation maximization up date of hyper-parameters is analogous to the up date where one imp oses the Nishimori identities as was done for compressed sensing in [63]. In other words one up dates the h yp er-parameters in such a wa y that the Nishimori iden tities are satisﬁed after every iteration. F or instance, the mean and v ariance of the distribution P X is up dated to corresp ond to the empirical mean and v ariance of the estimators of elemen ts X il . The v ariance of the noise in a Gaussian output channel is up dated to b e the same as the squared diﬀerence b et w een the observed matrix Y and the estimator of the pro duct F X . Generically , for mismatched priors the Nishimori identities are not satisﬁed. At the same time the v arious hyper-parameters app ear explicitly in these identities. Therefore, one wa y of p erforming learning of parameters is iteratively setting new v alues of the hyper-parameters in order to satisfy the Nishimori iden tities. This can b e straightforw ardly implemented within the GAMP co de. In compressed sensing following this strategy is equiv alen t to an exp ectation maximization learning algorithm [63]. It is interesting to note that, as the Bay es-optimal setting has several nice prop erties in terms of simplicity of the analytical approach and in terms of conv ergence of the message-passing algorithms, bringing the iterations back on the Nishimori line by doing exp ectation maximization improv es quite generically the conv ergence of the algorithm. In our opinion this is one of the prop erties observed in [17, 18]. 16 I II. APPRO XIMA TE MESSAGE P ASSING FOR MA TRIX F A CTORIZA TION A. Appro ximate b elief propagation for matrix factorization Ba yes inference amounts to the computation of marginals of the posterior probabilit y (5). In order to make it computationally tractable we hav e to resort to appro ximations. In compressed sensing, the Bay esian approach com bined with a b elief propagation (BP) reconstruction algorithm leads to the so-called approximate message passing (AMP) algorithm. It was ﬁrst derived in [11] for the minimization of ` 1 , and subsequen tly generalized in [65, 66]. W e shall now adapt the same strategy to the case of matrix factorization. Factor nodes Signal variables Matrix variables n µi ! µl ( F µi ) ˜ m µl ! il ( x il ) m il ! µl ( x il ) ˜ n µl ! µi ( F µi ) x 11 x 21 x 31 F 11 F 12 F 13 F 23 F 22 F 21 x 12 x 22 x 32 Prior Prior y 11 y 12 y 22 y 21 P X ( x il ) P F ( F µl ) FIG. 1. F actor graph used for the belief propagation inference, here drawn using N = 3, P = 2 and M = 2. The factor no des are asso ciated to the probability P out ( y µl | P i F µi X il ). The factor graph corresp onding to the p osterior probability (5) is depicted in Fig. 1. The canonical BP iterative equations [67] are written using messages m il → µl ( X il ), n µi → µl ( F µi ) from v ariables to factors, and using messages ˜ m µl → il ( X il ), ˜ n µl → µi ( F µi ) from factors to v ariables. On tree graphical mo dels the messages are deﬁned as marginal probabilities of their arguments conditioned to the fact that the v ariable/factor to which the message is incoming is not present in the graph. The following BP equations provide the exact v alues for these conditional marginals on trees m il → µl ( t + 1 , X il ) = 1 Z il → µl P X ( X il ) M Y ν ( 6 = µ ) ˜ m ν l → il ( t, X il ) , (41) n µi → µl ( t + 1 , F µi ) = 1 Z µi → µl P F ( F µi ) P Y n ( 6 = l ) ˜ n µn → µi ( t, F µi ) , (42) ˜ m µl → il ( t, X il ) = 1 Z µl → il Z N Y j ( 6 = i ) d X j l N Y k dF µk P out ( y µl | N X k F µk X kl ) N Y k n µk → µl ( t, F µk ) N Y j ( 6 = i ) m j l → µl ( t, X j l ) , (43) ˜ n µl → µi ( t, F µi ) = 1 Z µl → µi Z N Y j d X j l N Y k ( 6 = i ) dF µk P out ( y µl | N X k F µk X kl ) N Y k ( 6 = i ) n µk → µl ( t, F µk ) N Y j m j l → µl ( t, X j l ) , (44) where Z il → µl , Z µi → µl , Z µl → il , Z µl → µi are normalization constants ensuring that all the messages are probability distributions, t ∈ N is denoting the iteration time-step, and the notation Q M ν ( 6 = µ ) means a pro duct ov er all integer v alues of ν in { 1 , . . . , M } , except the v alue µ . Of course, the factor graph of matrix factorization, shown in Fig. 1, is extremely far from a tree. The ab o v e BP equations can, how ever, still b e asymptotically exact if the dep endence b et w een the incoming messages is negligible in the leading order in N . This indeed happ ens in compressed sensing (where the matrix F is a known matrix, generated randomly with zero mean), as follows from the rigorously prov en success of approximate message passing [11, 25]. In the presen t case of matrix factorization, w e do not ha ve any rigorous pro of yet, but based on our exp erience from studies of mean ﬁeld spin glass systems [12, 13], we conjecture that the ﬁxed p oin ts of the ab o v e b elief propagation equations describ e asymptotically e xactly (in the same sense as for compressed sensing) the p erformance of Ba yes-optimal 17 inference. Hence the analysis of the ﬁxed p oin ts of the ab o v e equations leads to the understanding of information- theoretic limitations for matrix factorization. The asso ciated phase transitions describ e p ossible algorithmic barriers. This analysis is the main goal and result of the presen t pap er. The ab o v e BP iterative equations are written for probabilit y distributions ov er real v alues v ariables and the 2 N − 1- uple integrals from the r.h.s. are mathematically in tractable in this form. W e no w deﬁne means and v ariances of the v ariable-to-factor messages as ˆ x il → µl ( t ) = Z d X il m il → µl ( t, X il ) X il , (45) c il → µl ( t ) + ˆ x 2 il → µl ( t ) = Z d X il m il → µl ( t, X il ) X 2 il , (46) ˆ f µi → µl ( t ) = √ N Z d F µi n µi → µl ( t, F µi ) F µi , (47) s µi → µl ( t ) + ˆ f 2 µi → µl ( t ) = N Z d F il n µi → µl ( t, F µi ) F 2 µi . (48) Notice that the factors √ N in the deﬁnition of r , and N in the deﬁnition of s , hav e been introduced in order to ensure that all the messages a, c, r, s are of order O (1) in the thermo dynamic limit. Using this scaling, we shall now show that the BP equations can b e simpliﬁed in the thermo dynamic limit, and that they can actually b e written as a closed set of equations inv olving only messages a, c, r, s . Our general aim is to design an algorithm which in some region of parameters will asymptotically match the p erformance of the exact (computationally intractable) Bay es-optimal inference. Belief propagation pro vides such an algorithm, but in order to mak e it computationally eﬃcient, writing it in terms of the messages a, c, r, s is crucial, pro vided one is careful not to lo ose an y terms in the asymptotic analysis of the thermo dynamic limit to leading order. Let us deﬁne the F ourrier transform of the output function ˆ P out ( y , ξ ) = 1 √ 2 π Z ∞ −∞ P out ( y | z ) e − iξz d z , (49) to rewrite the up date equation for message ˜ m µl → il ( t, X il ) as ˜ m µl → il ( t, X il ) = 1 √ 2 π Z µl → il Z d ξ ˆ P out ( y µl , ξ ) Z d F µi e iξF µi X il n µi → µl ( t, F µi ) N Y j ( 6 = i )  Z d X j l d F µj e iξF µj X j l n µj → µl ( t, F µj ) m j l → µl ( t, X j l )  . (50) In order to p erform the integral in the square-brack et we recall that the elements of matrix F = O (1 / √ N ) and hence w e can expand the exp onen tial to second order, use deﬁnitions (80-83) and re-exp onen tiate the result without lo osing an y leading order terms in ˜ m µl → il ( t, X il ). The whole square brac ket then b ecomes exp  i ξ √ N ˆ f µj → µl ( t ) ˆ x j l → µl ( t ) − ξ 2 2 N h s µj → µl ( t ) c j l → µl ( t ) + s µj → µl ( t ) ˆ x 2 j l → µl ( t ) + ˆ f 2 µj → µl ( t ) c j l → µl ( t ) i  . (51) Next we p erform the integral ov er v ariable ξ which is simply a Gaussian integral. This giv es for the message ˜ m µl → il ( t, X il ) = 1 Z µl → il Z d F µi n µi → µl ( t, F µi ) 1 q 2 π V t µil Z d z P out ( y µl | z ) e − ( z − F µi X il − ω t µil ) 2 2 V t µil , (52) where we introduced auxiliary v ariables that are both of order O (1) V t µil ≡ 1 N N X j ( 6 = i ) h s µj → µl ( t ) c j l → µl ( t ) + s µj → µl ( t ) ˆ x 2 j l → µl ( t ) + ˆ f 2 µj → µl ( t ) c j l → µl ( t ) i , (53) ω t µil ≡ 1 √ N N X j ( 6 = i ) ˆ f µj → µl ( t ) ˆ x j l → µl ( t ) . (54) 18 The last integral to b e p erformed in (52) is the one o ver the matrix element F µi . Using again the fact that F µi = O (1 / √ N ), we expand the exponential in which F µi app ears to se cond order and p erform the integration to obtain ˜ m µl → il ( t, X il ) = 1 Z µl → il 1 q 2 π V t µil Z d z P out ( y µl | z ) e − ( z − ω t µil ) 2 2 V t µil ( 1 + z − ω t µil √ N V t µil ˆ f µi → µl ( t ) X il − X 2 il 2 N " 1 V t µil − ( z − ω t µil ) 2 ( V t µil ) 2 # h ˆ f 2 µi → µl ( t ) + s µi → µl ( t ) i ) . (55) F ollo wing the notation of [66] we now deﬁne the output-function as g out ( ω , y , V ) ≡ R d z P out ( y | z ) ( z − ω ) e − ( z − ω ) 2 2 V V R d z P out ( y | z ) e − ( z − ω ) 2 2 V . (56) The following useful identit y holds for the av erage of ( z − ω ) 2 /V 2 in the ab o ve measure R d z P out ( y | z ) ( z − ω ) 2 e − ( z − ω ) 2 2 V V 2 R d z P out ( y | z ) e − ( z − ω ) 2 2 V = 1 V + ∂ ω g out ( ω , y , V ) + g 2 out ( ω , y , V ) . (57) With deﬁnition (56), and re-exp onentiating the X il -dep enden t terms in (55) while keeping all the leading order terms, w e obtain ﬁnally that ˜ m µl → il ( t, X il ) is a Gaussian probability distribution ˜ m µl → il ( t, X il ) = s A t µl → il 2 π N e − X 2 il 2 N A t µl → il + B t µl → il X il √ N − ( B t µl → il ) 2 2 A t µl → il (58) with B t µl → il = g out ( ω t µil , y µl , V t µil ) ˆ f µi → µl ( t ) , (59) A t µl → il = − ∂ ω g out ( ω t µil , y µl , V t µil ) h ˆ f 2 µi → µl ( t ) + s µi → µl ( t ) i − g 2 out ( ω t µil , y µl , V t µil ) s µi → µl ( t ) . (60) In a completely analogous wa y w e obtain that the message ˜ n µl → µi ( t, F µi ) is also a Gaussian distribution ˜ n µl → µi ( t, F µi ) = s S t µl → µi 2 π e − F 2 µi 2 S t µl → µi + R t µl → µi F µi − ( R t µl → µi ) 2 2 S t µl → µi (61) with R t µl → µi = g out ( ω t µil , y µl , V t µil ) ˆ x il → µl ( t ) , (62) S t µl → µi = − ∂ ω g out ( ω t µil , y µl , V t µil )  c il → µl ( t ) + ˆ x 2 il → µl ( t )  − g 2 out ( ω t µil , y µl , V t µil ) c il → µl ( t ) . (63) A t this p oint we follow closely the deriv ation of AMP from [63] and deﬁne the probability distributions M X (Σ , T , X ) = 1 ˆ Z X (Σ , T ) P X ( X ) 1 √ 2 π Σ e − ( X − T ) 2 2Σ , (64) M F ( Z, W, F ) = 1 ˆ Z F ( Z, W ) P F ( F ) 1 √ 2 π Z e − ( √ N F − W ) 2 2 Z , (65) where ˆ Z X (Σ , T ), and ˆ Z F ( Z, W ) are normalizations. W e deﬁne the av erage and v ariance of M X and M F as f X (Σ , T ) ≡ Z d X X M X (Σ , T , X ) , f c (Σ , T ) ≡ Z d X X 2 M X (Σ , T , X ) − f 2 a (Σ , T ) , (66) f F ( Z, W ) ≡ √ N Z d F F M F ( Z, W, F ) , f s ( Z, W ) ≡ N Z d F F 2 M F ( Z, W, F ) − f 2 r ( Z, W ) . (67) These are the input auxiliary function of [66]. It is instrumen tal to notice that f X (Σ , T ) = T + Σ d d T log ˆ Z X (Σ , T ) , f c (Σ , T ) = Σ d d T f X (Σ , T ) . (68) 19 and analogously for f F and f s f F ( Z, W ) = W + Z d d W log ˆ Z F ( Z, W ) , f s ( Z, W ) = Z d d W f F ( Z, W ) . (69) With these deﬁnition w e obtain from (41,42) using (58,61) that ˆ x il → µl ( t + 1) = f X N P ν ( 6 = µ ) A t ν l → il , √ N P ν ( 6 = µ ) B t ν l → il P ν ( 6 = µ ) A t ν l → il ! , (70) c il → µl ( t + 1) = f c N P ν ( 6 = µ ) A t ν l → il , √ N P ν ( 6 = µ ) B t ν l → il P ν ( 6 = µ ) A t ν l → il ! , (71) ˆ f µi → µl ( t + 1) = f F N ( P n ( 6 = l ) S t µn → µi , √ N P n ( 6 = l ) R t µn → µi P n ( 6 = l ) S t µn → µi ! , (72) s µi → µl ( t + 1) = f s N P n ( 6 = l ) S t µn → µi , √ N P n ( 6 = l ) R t µn → µi P n ( 6 = l ) S t µn → µi ! , (73) It is clear from the ab o v e expressions that all the messages a, c, r, s and A, C, R , S scale as O (1) in the thermo dy- namic limit. F or instance, as A are p ositive, the quan tity P ν ( 6 = µ ) A t ν l → il is O (1). On the other hand, the message ˆ f µi → µl ( t ) is an estimate of √ N F µi ; this estimate is O (1), but a sum lik e (1 / √ N ) P ν ( 6 = µ ) ˆ f ν i → ν l ( t ) is an estimate of P ν F ν i . As P F has mean v ariance of order O (1 / N ), this sum is actually of O (1). The same argument suggests that (1 / √ N ) P ν ( 6 = µ ) B t ν l → il is O (1). Recalling eqs. (59,60) and (62,63), w e hav e derived that in the thermo dynamic limit the general b elief propagation equations simplify into a closed set of equations in the messages which are the means and v ariances a, r, c, s deﬁned in (45-48). T o iterate this message passing algorithm we initialize as ˆ x il → µl (0) = Z d X X P il X ( X ) , (74) c il → µl (0) = Z d X X 2 P il X ( X ) − ˆ x 2 il → µl (0) , (75) ˆ f µi → µl (0) = √ N Z d F F P µl F ( F ) , (76) s µi → µl (0) = N Z d F F 2 P µl F ( F ) − ˆ f 2 µi → µl (0) , (77) then we compute V t µil and ω t µil from (53-54), then we compute B t , A t , R t and S t according to (59-60) and (62-63) using deﬁnition of g out (56). Finally w e update the messages according to (70-73) and iterate. Notice, ho wev er, that w e w ork with O ( N 3 ) messages, each of them takes N steps to up date, and hence the computational complexity of this algorithm is relativ ely high. In the next section we will write a simpliﬁcation that reduces this complexity . F rom the ﬁxed p oin t of the b elief propagation equations one can also compute the approximated marginal proba- bilities of the p osterior, deﬁned as m il ( t + 1 , X il ) = 1 Z il P X ( X il ) M Y ν ˜ m ν l → il ( t, X il ) , (78) n µi ( t + 1 , F µi ) = 1 Z µi P F ( F µi ) P Y n ˜ n µn → µi ( t, F µi ) , (79) One again deﬁnes the mean and v ariance of these t wo messages, ˆ x il ( t + 1), c il ( t + 1), ˆ f µi ( t + 1) and s µi ( t + 1) analogously to (80-83): ˆ x il ( t + 1) = Z d X il m il ( t + 1 , X il ) X il , (80) c il ( t + 1) + ˆ x 2 il ( t + 1) = Z d X il m il ( t + 1 , X il ) X 2 il , (81) ˆ f µi ( t + 1) = √ N Z d F µi n µi ( t + 1 , F µi ) F µi , (82) s µi ( t + 1) + ˆ f 2 µi ( t + 1) = N Z d F il n µi ( t + 1 , F µi ) F 2 µi . (83) 20 Those quantities are then expressed as ˆ x il ( t + 1) = f X N P ν A t ν l → il , √ N P ν B t ν l → il P ν A t ν l → il ! , (84) c il ( t + 1) = f c N P ν A t ν l → il , √ N P ν B t ν l → il P ν A t ν l → il ! , (85) ˆ f µi ( t + 1) = f F N P n S t µn → µi , √ N P n R t µn → µi P n S t µn → µi ! , (86) s µi ( t + 1) = f s N P n S t µn → µi , √ N P n R t µn → µi P n S t µn → µi ! . (87) B. GAMP for matrix factorization The message-passing form the AMP algorithm for matrix factorization derived in the previous section uses 2 N M P messages, one b et ween each v ariable comp onen t ( il ) and ( µi ) and each measurement ( µl ), in each iteration. In fact, exploiting again the simpliﬁcations which tak e place in the thermo dynamic limit, alwa ys within the assumption that the elements of the matrix F scale as O (1 / √ N ), it is p ossible to rewrite and close the BP equations in terms of only 2 N ( P + M ) messages. In statistical physics terms, the resulting equations corresp ond to the Thouless-Anderson- P almer equations (T AP) [68] used in the study of spin glasses. In the thermodynamic limit, these are asymptotically equiv alen t to the BP equations. Going from BP to T AP is, in the compressed sensing literature, the step to go from the rBP [69] to the AMP [11] algorithm. Let us now show ho w to take this step for the present problem of matrix factorization. In the thermo dynamic limit, it is clear from (70-73) that the messages ˆ x il → µl , v il → µl and ˆ f µi → µl , s µi → µl are nearly indep enden t of µl . F or instance in the equation giving ˆ x il → µl , the only dep endence on µ is through the fact that the sum ov er ν a voids the v alue ν = µ . But this is one term in M , and therefore one might exp ect that this term is negligible. Ho w ever, one must b e careful when these small terms are summed ov er and their sum might b e of the leading order in N . Such terms are called in spin-glass theory the “Onsager reaction terms”. In the following we deriv e these Onsager terms. Let us deﬁne the following v ariables all of order O (1) on which we will close the equations T t il = √ N P ν B t ν l → il P ν A t ν l → il , Σ t il = N P ν A t ν l → il , (88) W t µi = √ N P n R t µn → µi P n S t µn → µi , Z t µi = N P n S t µn → µi , (89) V t µl = 1 N X j [ c j l → µl ( t ) s µj → µl ( t ) + c j l → µl ( t ) ˆ f 2 µj → µl ( t ) + ˆ x 2 j l → µl ( t ) s µj → µl ( t )] , (90) ω t µl = 1 √ N X j ˆ x j l → µl ( t ) ˆ f µj → µl ( t ) . (91) T o keep track of all the Onsager terms that will inﬂuence the leading order of the ﬁnal equations we notice that ˆ x il → µl ( t + 1) = f X N P ν A t ν l → il − A t µl → il , √ N P ν B t ν l → il − √ N B t µl → il P ν A t ν l → il − A t µl → il ! , = ˆ x il ( t + 1) − 1 √ N B t µl → il Σ t il ∂ f X ∂ T  Σ t il , T t il  + O (1 / N ) , = ˆ x il ( t + 1) − 1 √ N g out ( ω t µl , y µl , V t µl ) ˆ f µi ( t ) c il ( t + 1) + O (1 / N ) . (92) 21 Similarly: ˆ f µi → µl ( t + 1) = ˆ f µi ( t + 1) − 1 √ N g out ( ω t µl , y µl , V t µl ) ˆ x il ( t ) s µi ( t + 1) + O (1 / N ) , (93) c il → µl ( t + 1) = c il ( t + 1) + O  1 / √ N  , s µi → µl ( t + 1) = s µi ( t + 1) + O  1 / √ N  , (94) g out ( ω t µil , y µl , V t µil ) = g out ( ω t µl , y µl , V t µl ) − 1 √ N ˆ f µi ( t ) ˆ x il ( t ) ∂ ω g out ( ω t µl , y µl , V t µl ) + O (1 / N ) , (95) F rom these expansions w e obtain the GAMP algorithm for matrix factorization V t µl = 1 N X j [ c j l ( t ) s µj ( t ) + c j l ( t ) ˆ f 2 µj ( t ) + ˆ x 2 j l ( t ) s µj ( t )] , (96) ω t µl = 1 √ N X j ˆ x j l ( t ) ˆ f µj ( t ) − g out ( ω t − 1 µl , y µl , V t − 1 µl ) 1 N X j h ˆ f µj ( t ) ˆ f µj ( t − 1) c j l ( t ) + ˆ x j l ( t ) ˆ x j l ( t − 1) s µj ( t ) i , (97) (Σ t il ) − 1 = 1 N X µ n − ∂ ω g out ( ω t µl , y µl , V t µl ) h ˆ f 2 µi ( t ) + s µi ( t ) i − g 2 out ( ω t µl , y µl , V t µl ) s µi ( t ) o , (98) T t il = Σ t il n 1 √ N X µ g out ( ω t µl , y µl , V t µl ) ˆ f µi ( t ) − ˆ x il ( t ) 1 N X µ ˆ f 2 µi ( t ) ∂ ω g out ( ω t µl , y µl , V t µl ) − ˆ x il ( t − 1) 1 N X µ s µi ( t ) g out ( ω t µl , y µl , V t µl ) g out ( ω t − 1 µl , y µl , V t − 1 µl ) o , (99) ( Z t µi ) − 1 = 1 N X l  − ∂ ω g out ( ω t µl , y µl , V t µl )  ˆ x 2 il ( t ) + c il ( t )  − g 2 out ( ω t µl , y µl , V t µl ) c il ( t )  , (100) W t µi = Z t il n 1 √ N X l g out ( ω t µl , y µl , V t µl ) ˆ x il ( t ) − ˆ f µi ( t ) 1 N X l ˆ x 2 il ( t ) ∂ ω g out ( ω t µl , y µl , V t µl ) − ˆ f µi ( t − 1) 1 N X l c il ( t ) g out ( ω t µl , y µl , V t µl ) g out ( ω t − 1 µl , y µl , V t − 1 µl ) o , (101) ˆ x il ( t + 1) = f X (Σ t il , T t il ) , c il ( t + 1) = f c (Σ t il , T t il ) , (102) ˆ f µi ( t + 1) = f F ( Z t µi , W t µi ) , s µi ( t + 1) = f s ( Z t µi , W t iµ ) . (103) The initial condition for iterations are ˆ x il ( t = 0) = Z d X X P il X ( X ) , c il ( t = 0) = Z d X X 2 P il X ( X ) − ˆ x 2 il → µl ( t = 0) , (104) ˆ f µi ( t = 0) = √ N Z d F F P µi F ( F ) , s µi ( t = 0) = N Z d F F 2 P µi F ( F ) − ˆ f 2 µi ( t = 0) . (105) In order to compute ω t =0 , T t =0 and W t =0 use the ab o ve equations as if ˆ f µi ( − 1) = 0 and ˆ x il ( − 1) = 0. The interpretation of the terms in the GAMP for matrix factorization is the following: ω t µl is the mean of the curren t estimate of z µl = P i F µi X il and V t µl is the v ariance of that estimate; T il and Σ il is the mean and v ariance of the current estimate of X il without taking into account the prior information of X il ; the parameters ˆ x il and c il are then the mean and v ariance of the curren t estimate of X il with the prior information taken into accoun t. Analogously for W µi and Z µi b eing the mean and v ariance of the estimate for F µi b efore the prior is taken into account, and ˆ f µi with s µi are the mean and v ariance once the prior information was accounted for. A reader familiar with the AMP and GAMP algorithm for compressed sensing [11, 63, 66] will recognize that the ab o v e equations indeed reduce to the compressed sensing GAMP of [66] when one sets ˆ f µi ( t ) = √ N F µi and s µi = 0. The ab o v e algorithm is closely related to the BiG-AMP of [17]. There are how ev er three diﬀerences b et ween our algorithm and BiG-AMP: 1. W e ﬁnd a s µi -dep enden t term in the expression (98) for Σ il whic h is not present in BiG-AMP . 2. Similarly , w e ﬁnd a c il -dep enden t term in the expression (100) for Z µi whic h is not present in BiG-AMP . 3. The time indices are sligh tly diﬀeren t. 22 Considering the last p oin t, the fact of having diﬀerent time indices during the iterations do es not inﬂuence the ﬁxed p oin ts, in whic h we are mainly interested. How ev er, the use of correct time indices is crucial for the assumptions leading to the densit y evolution of this algorithm (that we derive in section V A) to hold. As for the missing terms in the BiG-AMP expressions of [17] for Σ il and Z µi , they hav e a more serious eﬀect as they can change the ﬁxed p oin t. T o the b est of our understanding, these terms ha ve b een neglected in [17], while they should b e kept. It seems to us that some of the leading order terms are missing in eqs. (15-16) from [17]. C. Simpliﬁcations due to the Nishimori identities In the previous section we deriv ed the GAMP algorithm for matrix factorization, eqs. (96-103). This algorithm can in principle b e used for any set of matrices F and X . If iterated in the form derived in Section I I I B it often shows problems of con vergence. There are wa ys to slightly improv e the con vergence of the ab o ve algorithm in a wide range of applications by a n umber of empirical metho ds suggested in [18]. W e will fo cus on the particular case when matrices X and F were indeed generated from the separable probability distributions P F ( F ) and P X ( X ) describ ed in eqs. (3-4), and the output y was generated by the assumed model. P X ( X ) = P X 0 ( X ) , P F ( F ) = P F 0 ( F ) , P 0 out ( y | z ) = P out ( y | z ) . (106) In this case the b elief propagation is a proxy for the optimal Bay es inference algorithms and a num ber of prop erties describ ed in section I B hold. In analogy with fundamental works on spin glasses [54, 55] we called these prop erties the Nishimori identities. The setting where conditions (106) hold will b e called the Bayes-optimal setting . The Nishimori identities hold and the system is on the Nishimori line when one is using the correct priors on F and X and the right output channel in the reconstruction pro cess, i.e. when conditions (106) hold. In the limit N → ∞ and thanks to self-a veraging we then hav e on the Nishimori line at every iteration step t 1 P M X µ,l R d z P out ( y µl | z )( z − ω t µl ) 2 e − ( z − ω t µl ) 2 2 V t µl R d z P out ( y µl | z ) e − ( z − ω t µl ) 2 2 V t µl = 1 M P X µ,l V t µl . (107) The meaning of this identit y is that the mean squared error of the current estimate of Z = F X computed from the curren t estimates of v ariances V t µl is equal to the mean squared diﬀerence b et w een the true Z and its current estimate ω t µl . Us ing the ab o ve expression and eq. (57) w e obtain an identit y − 1 M P X µ,l ∂ ω g out ( ω t µl , y µl , V t µl ) = 1 M P X µ,l g 2 out ( ω t µl , y µl , V t µl ) . (108) The ab o v e identit y holds also if the sum is only ov er µ or only ov er l . Finally using the conditional indep endence assumed in BP b et w een the incoming messages w e get also − 1 M X µ ∂ ω g out ( ω t µl , y µl , V t µl ) s µi ( t ) = 1 M X µ g 2 out ( ω t µl , y µl , V t µl ) s µi ( t ) . (109) Under this condition w e can simplify considerably the expressions for Σ t il and Z t µi and get (Σ t il ) − 1 = − 1 N X µ ∂ ω g out ( ω t µl , y µl , V t µl ) ˆ f 2 µi ( t ) , (110) ( Z t µi ) − 1 = − 1 N X l ∂ ω g out ( ω t µl , y µl , V t µl ) ˆ x 2 il ( t ) . (111) Note that the r.h.s. of the tw o ab o ve equations is alw ays strictly p ositiv e, whic h is reassuring giv en these expressions pla y the role of a v ariance of a probability distribution. Note also that the BiG-AMP algorithm of [17] uses expressions (110,111) instead of (98,100), how ever, without mentioning the reason. 23 D. Simpliﬁcation for matrices with random entries Relying on the deﬁnitions of order parameters (161-162) and using part of the results on section V A we can write a v ersion of the GAMP for matrix factorization that is in the leading order in N equiv alen t to (96-103) for matrices X and F ha ving iid elements. Let us deﬁne the analog of ˆ χ t and ˆ q t (168-169) as empirical means of the corresp onding functions ˜ χ t ≡ − 1 M P X µ,l ∂ ω g out ( ω t µil , y µl , V t µil ) , (112) ˜ q t ≡ 1 M P X µ,l g 2 out ( ω t µil , y µl , V t µil ) . (113) An ticipating the reasoning that we shall use later in section V A, we realize that in the leading order quantities V t il , Σ t il and Z t µi do not dep end on their indices i, l , µ . W e ha ve V t = Q t F Q t X − q t F q t X , (114) (Σ t ) − 1 = αQ t F ˜ χ t − α ( Q t F − q t F ) ˜ q t , (115) ( Z t ) − 1 = π Q t X ˜ χ t − π ( Q t X − q t X ) ˜ q t . (116) where we deﬁne q t X ≡ 1 N P X j l ˆ x 2 j l ( t ) , q t F ≡ 1 N M X µi ˆ f 2 µi ( t ) , (117) Q t X ≡ q t X + 1 N P X j l c j l ( t ) , Q t F ≡ q t F + 1 N M X µi s µi ( t ) . (118) These three equations can hence replace (96), (98) and (100) in GAMP . F urthermore, if we focus on the ﬁxed p oin t and hence disregard some of the time indices eqs. (97), (99) and (101) can b e simpliﬁed as ω t µl = 1 √ N X j ˆ x j l ( t ) ˆ f µj ( t ) − g out ( ω t − 1 µl , y µl , V t − 1 )( Q t X q t F + q t F Q t X − 2 q t F q t X ) , (119) T t il = Σ t ( 1 √ N X µ g out ( ω t µl , y µl , V t ) ˆ f µi ( t ) + α ˆ x il ( t ) ˜ χ t q t F − α ˆ x il ( t )( Q t F − q t F ) ˜ q t ) , (120) W t µi = Z t ( 1 √ N X l g out ( ω t µl , y µl , V t ) ˆ x il ( t ) + π ˆ f µi ( t ) ˜ χ t q t X − π ˆ f µi ( t )( Q t X − q t X ) ˜ q t ) . (121) Within the Bay es-optimal setting of (106), we can use the Nishimori identit y (108) to sho w that ˜ χ t = ˜ q t . Con- sequen tly we can use only one of those parameters computed either from (112) or from (113). The equations then further simplify to strictly p ositive expressions for the v ariance-parameters (Σ t ) − 1 = αq t F ˜ q t , ( Z t ) − 1 = π q t X ˜ q t . (122) The set of eqs. (102-103), (114), (119-122) w as presented for the simple output channel with white noise in [15]. W e detail this pro cedure for the generic output in Alg. 1. Algorithm 1 Bay es-optimal Approximate Message P assing for Matrix F actorization Input: y Initialize : ˆ x , ˆ f , c , s , Iter = 1 Initialize : { ω µl , V } rep eat Compute av erages, eqs. (117) and (118). AMP Up date of { ω µl , V } , eq. (119) and eq. (114). AMP Up date of Σ and Z , eq. (122). AMP Up date of { T il } , { W µi } , eq. (120) and eq. (121). AMP Up date of the estimed quantities ˆ x , ˆ f , c , s , eq. (102) and eq. (103). Iter ← Iter + 1 un til Conv ergence on ˆ x , ˆ f 24 W e wan t to stress here that all these simpliﬁcations take place for an y output channel P out ( y | z ). In contrast with the “uniform v ariance” appro ximation of [17] the ab o v e result do es not mean that the v ariances c il and s µi are indep enden t in the leading order on their indices. On the contrary , these v ariances dep end on their indices ev en in the simplest case of GAMP when the matrix F µi is known, i.e. for the compressed sensing problem. IV. THE BETHE FREE ENTROPY The ﬁxed p oin t of the b elief propagation equations or its AMP version can b e used to estimate the p osterior lik eliho od, i.e. the normalization Z ( Y ) of the p osterior probability (5). The logarithm of this normalization is called the Bethe free en tropy in statistical physics [70]. Negative logarithm of the normalization is called the free energy , in ph ysics there is usually a temp erature asso ciated to the free energy . Bethe free entrop y is computed from the ﬁxed p oin t of the BP equations (41-44) as [13, 70]: Φ Bethe = X µl log Z µl + X µi log Z µi + X il log Z il − X µil log Z il,µl − X µil log Z µi,µl , (123) where the ﬁve contributions are Z µi = Z d F µi P F ( F µi ) P Y l =1 ˜ n µl → µi ( F µi ) , (124) Z il = Z d X il P X ( X il ) M Y µ =1 ˜ m µl → il ( X il ) , (125) Z µl = Z N Y i =1 d X il N Y k =1 d F µk P out ( y µl | N X k =1 F µk X kl ) N Y i =1 m il → µl ( X il ) N Y k =1 n µk → µl ( X il ) , (126) Z il,µl = Z d X il m il → µl ( X il ) ˜ m µl → il ( X il ) , (127) Z µi,µl = Z d F µi n µi → µl ( F µi ) ˜ n µl → µi ( F µi ) . (128) The deriv ativ es of this expression for Φ Bethe with resp ect to the messages give back the full BP equations of (41-44). In this general form, the computation of Φ Bethe for the present problem is not of practical interest, and it is thus very useful to carry out the same steps that we did in Section I II A in order to obtain a more mathematically tractable form of Φ Bethe that is asymptotically equiv alent to (123) in the thermo dynamic limit, using the set of AMP message passing equations (53-54), (59-60), (62-63), and (70-73). The result is: Φ Bethe = X µl log Z µl + X µi log X µi + X il log X il + X µil log X µi → µl X µi + X µil log X il → µl X il , (129) with Z µl = Z d z e − ( ω µl − z ) 2 2 V µl p 2 π V µl P out ( y µl | z ) , (130) X µi = Z d F µi P F ( F µi ) e − N F 2 µi 2 Z µi + √ N F µi W µi Z µi , (131) X il = Z d X il P X ( X il ) e − X 2 il 2Σ il + X il T il Σ il , (132) X µi → µl = Z d F µi P F ( F µi ) e − F 2 µi 2 P n 6 = l S µn → µi + F µi P n 6 = l R µn → µi , (133) X il → µl = Z d X il P X ( X il ) e − X 2 il 2 N P ν 6 = µ A ν l → il + X il √ N P ν 6 = µ B ν l → il . (134) 25 Finally we might wan t to express the free entrop y using the ﬁxed p oin t of the GAMP eqs. (96-103). In order to do this w e need to rewrite the last t wo terms in (129). Using an expansion in 1 / N and keeping the leading order terms w e get X l log X µi → µl X µi = − W µi Z µi ˆ f µi + 1 2 Z µi ( s µi + ˆ f 2 µi ) + 1 2 N s µi P X l =1 g 2 out ( ω µl , y µl , V µl ) ˆ x 2 il , (135) X µ log X il → µl X il = − T il Σ il ˆ x il + 1 2Σ il ( c il + ˆ x 2 il ) + 1 2 N c il M X µ =1 g 2 out ( ω µl , y µl , V µl ) ˆ f 2 µi . (136) W e remind that the ab ov e expressions giv e the p osterior likelihoo d giv en a ﬁxe d p oint on the GAMP equations. T o write the ﬁnal formula in a more easily interpretable form we use the probability distributions M F and M X deﬁned in (64-65) with normalizations ˆ X µi = ˆ Z F ( Z µi , W µi ) p 2 π Z µi = Z d F µi P F ( F µi ) e − ( √ N F µi − W µi ) 2 2 Z µi , (137) ˆ X il = ˆ Z X (Σ il , T il ) p 2 π Σ il = Z d X il P X ( X il ) e − ( X il − T il ) 2 2Σ il . (138) Putting all pieces together we ﬁnd: Φ Bethe = X il  log ˆ X il ( T il , Σ il ) + c il + ( ˆ x il − T il ) 2 2Σ il  + X µi " log ˆ X µi ( W µi , Z µi ) + s µi + ( ˆ f µi − W µi ) 2 2 Z µi # + X µl   log Z µl ( ω µl , V µl ) + 1 2 N X µil g 2 out ( ω µl , V µl )  s µi ˆ x 2 il + ˆ f 2 µi c il    . (139) The ab o v e expression ev aluated at the ﬁxed point of the AMP algorithm hence gives the Bethe approximation to the log-lik eliho od. It is mainly use to decide which ﬁxed p oin t of AMP is b etter. Indeed, there are cases where there exist more than one AMP ﬁxed p oin t and it is the one with the largest Bethe entrop y that corresp onds asymptotically to the optimal Bay esian inference. A. Fixed-p oin t generating Bethe free entrop y Since the free entrop y has a meaning only at the ﬁxed p oin t, we can transform it by using any of the ﬁxed p oin t iden tities v eriﬁed b y the BP messages. A well kno wn prop ert y of Bethe free entrop y and b elief propagation [70] is that the BP ﬁxed p oin ts are stationary p oin ts of the Bethe free entrop y . In this section we sho w that also for the AMP for matrix factorization the Bethe free entrop y can b e written in a form that allows to generate the ﬁxed-p oin t BP equations as a stationary p oin t. It can b e achiev ed by writing the Bethe free entrop y eq. (139) as Φ Bethe AMP  { T il } , { Σ il } , { W µi } , { Z µi } , { ω µl } , { ˆ x il } , { c il } , { ˆ f µi } , { s µi }  = X il  log ˆ X il ( T il , Σ il ) + c il + ( ˆ x il − T il ) 2 2Σ il  + X µi " log ˆ X µi ( W µi , Z µi ) + s µi + ( ˆ f µi − W µi ) 2 2 Z µi # + X µl   log Z µl ( ω µl , V µl ) + 1 2 X µl ( ω µl − P i ˆ f µi ˆ x il / √ N ) 2 V µl − P i s µi c il / N   (140) with V µl = 1 N X j [ c j l s µj + c j l ˆ f 2 µj + ˆ x 2 j l s µj ] . In order to derive (140) from (139), we hav e substituted g 2 out b y its ﬁxed p oin t expression, and imp osed the v alues of the v ariance V . Under the presen t form, the Bethe free entrop y satisﬁes the following theorem: Theorem. (Bethe/AMP corresp ondance) The ﬁxe d p oint of the AMP e quations e qs. (96-103) ar e the stationary p oints of the c ost function Φ Bethe AMP e q. (140). 26 Pr o of. This can b e chec k ed explicitly by setting to zero the deriv atives of Φ Bethe AMP . Indeed, the deriv ativ es with resp ect to T , Σ , W and Z yield ˆ x il = T il + Σ il ∂ ∂ T log ˆ X il = f X (Σ il , T il ) , (141) c il = Σ 2 il ∂ ∂ Σ il log ˆ X il − ( ˆ x il − T il ) 2 = f c (Σ il , T il ) , (142) ˆ f µi = W µi + Z µi ∂ ∂ W log ˆ X µi = f F ( Z µi , W µi ) , (143) s µi = Z 2 µi ∂ ∂ Z µi log ˆ X µi − ( ˆ f µi − W µi ) 2 = f s ( Z µi , W µi ) . (144) Then, the stationarit y with resp ect to ω can b e expressed easily b y noting that ∂ ∂ ω µz log Z µl = g out (a consequence of Eq. (130): g out ( V µl ) + ( ω µl − P i ˆ f µi ˆ x il / √ N ) V µl − P i s µi c il / N = 0 , (145) whic h is nothing but the ﬁxed p oin t equation for ω . It is conv enien t to compute the deriv ativ e with resp ect to V (ev en though this quantit y is ev entually a function of r , s , a and c ) using ∂ ∂ V µz log Z µl = 1 2  g 2 out + ∂ ω g out  so that at the ﬁxed p oin t, when eq. (145) is satisﬁed, we hav e ∂ Φ Bethe AMP ∂ V µi = 1 2  g 2 out + ∂ ω g out  − 1 2 ( ω µl − P i ˆ f µi ˆ x il / √ N ) 2 ( V µl − P i s µi c il / N ) 2 = 1 2 ∂ ω g out . (146) Using this equation, one can ﬁnally chec k explicitly that deriving with resp ect to a, c, r and s yields the remaning AMP equations for T , Σ , W and Z . This concludes the pro of. B. The v ariational Bethe free en tropy W e hav e sho wn that the ﬁxed p oin ts of the approximate message passing equations are extrema of Φ Bethe AMP . How ev er they are in general saddle p oints of this function, and it is very useful to derive an alternative “v ariational” free en tropy , the maxima of which are the ﬁxed p oin ts. In particular, this will allow us to ﬁnd these ﬁxed p oin ts by alternativ e metho ds which do not rely on iterating the equations as was done for compressed sensing in [71]. This v ariational free entrop y can also b e used not only at the maximum, but for eac h p ossible v alues of the parameters, as the curren t estimate of the quality of reconstruction. Such a property has been used to implement a so-called adaptive damping in compressed sensing [72] and it can hence b e an ticipated that similar implementation tric k will b e useful for matrix factorization as well. 1. Generic output channel In order to derive the v ariational Bethe free entrop y , we imp ose the ﬁxed p oin t conditions, and express the free en tropy only as a function of the parameters of our trial distributions for the tw o matrices. Then, we simply hav e Φ Bethe v ar ( { T il } , { Σ il } , { W µi } , { Z µi } ) = Φ Bethe AMP  { T il } , { Σ il } , { W µi } , { Z µi } , { ω ∗ µl } , { a ∗ il } , { c ∗ il } , { r ∗ µi } , { s ∗ µi }  (147) where a ∗ , c ∗ , r ∗ , s ∗ are given in terms of the Eqs. (141-144) by: a ∗ = f X (Σ il , T il ), c ∗ = f c (Σ il , T il ), r ∗ = f F ( Z µi , W µi ), s ∗ = f s ( Z µi , W µi ), and ω ∗ is the solution of (97). In order to write this v ariational expression in a nicer form, let us notice that the Kullbac k-Leibler divergences b et w een M X , M F (64-65) and the prior distribution are − D KL ( M F || P F ) = log ˆ X µi + f s ( Z µi , W µi ) + ( f F ( Z µi , W µi ) − W µi ) 2 2 Z µi , (148) − D KL ( M X || P X ) = log ˆ X il + f c (Σ il , T il ) + ( f X (Σ il , T il ) − T il ) 2 2Σ il . (149) 27 Let us deﬁne an additional distribution M out ( ω µl , V µl , z ) = 1 Z µl P out ( y µl | z ) 1 p 2 π V µl e − ( z − ω µl ) 2 2 V µl , (150) (151) where Z µl is given by (130). Then one has − D KL ( M out || P out ) = log Z µl + 1 2 log 2 π V µl + 1 2  1 + V µl ∂ ω g out + V µl g 2 out  . (152) Starting from (139), w e ﬁnd: Φ Bethe v ar ( { T il } , { Σ il } , { W µi } , { Z µi } ) = − X µi D KL ( M F ( Z µi , W µi ) || P F ) − X il D KL ( M X (Σ il , T il ) || P X ) + X µl   log Z µl  ω ∗ µl , V ∗ µl  + g 2 out 2 ( V ∗ µl − X j s ∗ µj c ∗ j l / N )   , = − X µi D KL ( M F ( Z µi , W µi ) || P F ) − X il D KL ( M X (Σ il , T il ) || P X ) (153) − X µl   D KL ( M out ( ω ∗ µl , V ∗ µl ) || P out ) − 1 2   log 2 π V ∗ µl + 1 + V ∗ µl ∂ ω g out + g 2 out N X j s ∗ µj c ∗ j l     , with V ∗ and ω ∗ satisfying eqs. (114) and (119). Note that this expression has the same form as the one used in [73] for the simpler case of GAMP for compressed sensing and for the generalized linear problem. Our expression thus generalizes the formula of [73] to the bi-linear case. 2. The A WGN output channel In the case of the additive white Gaussian noise output channel (23) the function g out tak es the simple form: g out ( ω µl , y µl , V µl ) = y µl − ω µl ∆ + V µl , (154) hence ∂ ω g out do es not depend on the v ariable ω µl . The only explicit dep endence on ω µl in the free en tropy is through eq. (130) which b ecomes for the A WGN output channel Z µl = 1 p 2 π (∆ + V µl ) e − ( y µl − ω µl ) 2 2(∆+ V µl ) . (155) The free entrop y is deﬁned only at the ﬁxed p oin t of the GAMP equations. Giv en a ﬁxed p oin t we can express from (97) for the A W GN c hannel y µl − ω µl ∆ + V µl = y µl − 1 √ N P j ˆ x j l ˆ f µj ∆ + 1 N P j c j l s µj . (156) W e plug this last expression into (155) to obtain Z µl = 1 p 2 π (∆ + V µl ) e − ( y µl − 1 √ N P j ˆ x j l ˆ f µj ) 2 2(∆+ 1 N P j c j l s µj ) 2 (∆+ V µl ) . (157) Simplifying the last t wo terms of eq. (153) we obtain for the A W GN c hannel: Φ Bethe A WGN ( { T il } , { Σ il } , { W µi } , { Z µi } ) = − X µi D KL ( M F ( Z µi , W µi ) || P F ) − X il D KL ( M X (Σ il , T il ) || P X ) − X µl ( y µl − 1 √ N P j a ∗ j l r ∗ µj ) 2 2(∆ + 1 N P j c ∗ j l s ∗ µj ) − 1 2 X µl log  2 π (∆ + V ∗ µl )  . (158) 28 The ﬁrst three terms of this free en trop y are clearly negative and the last term cannot b e larger than − M P log (2 π ∆) / 2. Hence the free entrop y (158) is b ounded from ab o v e. This is consistent with its interpretation as a v ariational expression. The stationary p oin ts of (158) are the ﬁxed p oin ts of the GAMP algorithm and hence the ﬁxed p oin ts corresp onding to the maximum likelihoo d could also b e found by direct maximization of the expression (158). This oﬀers an in teresting algorithmic alternative to the iterative AMP algorithm that was explored for the compressed sensing problem in [71]. Another use of the expression (158) is that during the iteration of the GAMP algorithm its v alue should b e increasing, hence we can adaptively choose the step-size of the iterations to ensure this increase. Suc h an adaptive dumping was implemented in [18] using a diﬀerent form of the free entrop y that do es not corresp ond to the Bethe free entrop y but to the v ariational mean ﬁeld (VMF) free entrop y which reads Φ VMF A WGN ( { T il } , { Σ il } , { W µi } , { Z µi } ) = − X µi D KL ( M F ( Z µi , W µi ) || P F ) − X il D KL ( M X (Σ il , T il ) || P X ) − 1 2∆ X µl " ( y µl − 1 √ N X i ˆ f µi ˆ x il ) 2 + V µl # − M P 2 log 2 π ∆ . (159) It is easy to chec k that Φ VMF A WGN ( T il , Σ il , W µi , Z µi ) < Φ Bethe A WGN ( T il , Σ il , W µi , Z µi ), which could b e exp ected since the Bethe expression, whic h is asymptotically exact, should b e a b etter approximation than the mean ﬁeld appro ximation. V. ASYMPTOTIC ANAL YSIS A. State evolution In this section we derive the asymptotic ( N → ∞ ) evolution of the GAMP iterations for matrix factorization. This asymptotic analysis holds as long as all the elemen ts of the true matrix F are iid random v ariables generated from a distribution P F 0 , and all elemen ts of the true matrix X are iid random v ariables generated from a distribution P X 0 . In general we will not assume P F 0 = P F and P X 0 = P X : this sp ecial case of Bay es-optimal analysis will b e treated in the next section. In the present section we will also distinguish b etw een the true output channel characterized by the conditional probabilit y distribution P 0 out ( y µl | z 0 µl ) and the output c hannel that is b eing used in the GAMP algorithm P out ( y µl | z µl ). W e remind that z 0 µl = P N i =1 F 0 µi X 0 il , where F 0 µi and X 0 il are the elements of the actual matrices that we do not know and aim to reco ver, and z µl = P N i =1 F µi X il . Again the sp ecial case of P 0 out = P out will b e treated the next section. W e will assume that at least one of the probability distributions P F 0 and P X 0 (and also at least one of P F and P X ) has zero mean, otherwise there would b e additional terms in this asymptotic analysis, as e.g. in [74]. Let us ﬁrst recall the deﬁnition of the order parameters; all of them are ﬁnite, of order O (1), in the thermo dynamic limit: m t X ≡ 1 N P X j l ˆ x j l ( t ) X 0 j l , m t F ≡ 1 M √ N X µi ˆ f µi ( t ) F 0 µi , (160) q t X ≡ 1 N P X j l ˆ x 2 j l ( t ) , q t F ≡ 1 N M X µi ˆ f 2 µi ( t ) , (161) Q t X ≡ q t X + 1 N P X j l c j l ( t ) , Q t F ≡ q t F + 1 N M X µi s µi ( t ) . (162) Note also that the abov e sums o ver a pair of indices could also b e sums ov er only one index (and adjusted normalization) and the order parameters would not c hange in the leading order: for instance, w e exp ect that in the thermo dynamic limit, 1 N P j ˆ x j l ( t ) X 0 j l will go to the same limit as m t X deﬁned in (160). First let us compute the av erage ov er realizations of X 0 , F 0 and w 0 of the quantit y V t µl deﬁned by eq. (90). By the assumptions of the b elief propagation equations (43-44), the terms in the pro duct in (90) are statistically indep enden t and we can hence write for the a verage, to the leading order V t = Q t F Q t X − q t F q t X . (163) 29 F urther, we realize that the v ariance of this quantit y (again o ver the realizations of X 0 , F 0 and w 0 ) is E [( V t µl − V t ) 2 ] = E   ( 1 N X i [ c il → µl ( t ) − 1 N X k c kl ( t )] s µi → µl () + ... + ... ) 2   = O (1 / N ) . (164) In order to derive this result, we expand the square and obtain a double sum o ver i and j . Because of the conditional indep endence b et w een incoming messages assumed in b elief propagation, the terms with i 6 = j av erage exactly to zero. As for the terms with i = j , they add up to a contribution of order O (1 / N ). F rom this we can conclude that, to leading order in the thermo dynamic limit, the quan tity V t µl = V t do es not dep end on its indices. F urther we are interested in the av erage Σ t of the quantit y Σ t il o ver the realization of F 0 , X 0 , and w 0 . Using the deﬁnition of Σ t il eq. (88) and the expression for A t µl → il eq. (60) we obtain (Σ t il ) − 1 = − 1 N X µ n ∂ ω g out ( ω t µil , y µl , V t µil )[ ˆ f 2 µi → µl ( t ) + s µi → µl ( t )] + g 2 out ( ω t µil , y µl , V t µil ) s µi → µl ( t ) o . (165) W e pro ceed analogously for Z t µi . Using again the conditional indep endence b etw een incoming messages assumed in BP equations we obtain (Σ t ) − 1 = αQ t F ˆ χ t − α ( Q t F − q t F ) ˆ q t , (166) ( Z t ) − 1 = π Q t X ˆ χ t − π ( Q t X − q t X ) ˆ q t , (167) where we introduced new parameters ˆ χ t = − 1 M E F 0 ,X 0 ,w 0 " X µ ∂ ω g out ( ω t µil , y µl , V t µil ) # , (168) ˆ q t = 1 M E F 0 ,X 0 ,w 0 " X µ g 2 out ( ω t µil , y µl , V t µil ) # . (169) W e use y µl = h ( z 0 µl , w 0 ), and we remind that, to leading order, V t µil = V t . The function g out ab o v e hence dep ends on t wo correlated ﬂuctuating v ariables ω t µil and z 0 µl , and on w 0 . Both the v ariables ω t µil and z 0 µl are sums ov er man y indep enden t (for z this is by construction, for ω by the BP assumptions) terms. Hence, according to the central limit theorem, they are Gaussian random v ariables. Their mean is zero when at least one of the distributions P X and P F (and one of the P X 0 and P F 0 ) ha ve zero mean (which we assume in this section). The cov ariance matrix b et w een the v ariables ω t µil and z 0 µl is E ( ω 2 µil ) = 1 N X j ( 6 = i ) E ( ˆ f 2 µj → µl ˆ x 2 j l → µl ) + 1 N X j ( 6 = i ) ,k ( 6 = i,j ) E ( ˆ f µj → µl ˆ f µk → µl ˆ x j l → µl ˆ x kl → µl ) = q F q X , (170) E ( ω µil z 0 µl ) = 1 N X j ( 6 = i ) E ( F 0 µj ˆ f µj → µl X 0 j l ˆ x j l → µl ) + 1 N X k,j ( 6 = i,k ) E ( F 0 µk ˆ f µj → µl X 0 kl ˆ x j l → µl ) = m F m X , (171) where w e again used the BP assumption of independence b et ween the incoming messages, but also b et w een F 0 µi and the message ˆ x il → µl , and b et ween X 0 il and ˆ f µi → µl . As for the v ariance of z 0 , we denote it b y: E [( z 0 µl ) 2 ] = N E [( F 0 µi ) 2 ] E [( X 0 il ) 2 ] = h ( z 0 ) 2 i . (172) Altogether this gives for ˆ χ and ˆ q ˆ χ t = − Z d w P 0 ( w ) Z d p d z N [ p, z ; q t F q t X , h ( z 0 ) 2 i , m t F m t X ] ∂ p g out [ p, h ( z , w ) , Q t F Q t X − q t F q t X ] , (173) ˆ q t = Z d w P 0 ( w ) Z d p d z N [ p, z ; q t F q t X , h ( z 0 ) 2 i , m t F m t X ] g 2 out [ p, h ( z , w ) , Q t F Q t X − q t F q t X ] . (174) where N [ p, z ; σ 2 p , σ 2 z , E ( pz )] is a joint Gaussian distribution of v ariables p and z with zero means, v ariances and correlation giv en in the argument. F rom the ab o v e analysis it also follows that in the leading order the quan tities Σ t il and Z t µi do not dep end on their indices il and µl . 30 W e now study the asymptotic b eha vior of T t il deﬁned by eq. (88) T t il / Σ t il = 1 √ N X µ B t µl → ı l = 1 √ N X µ ˆ f µi → µl ( t ) g out ( ω t µil , y µl , V t µil ) , (175) where w e used deﬁnition of B t µl → ı l in eq. (59). The message ˆ f µi → µl ( t ) are uncorrelated with all the other incoming messages and also with all the F 0 µj for j 6 = i . It is, ho wev er, correlated with F 0 µi and the dep endence on F 0 µi has to b e hence treated separately . After an expansion in the leading order w e obtain T t il / Σ t il = X 0 il 1 √ N X µ ˆ f µi → µl ( t ) F 0 µi ∂ z g out ( ω t µil , h ( z µl , w µl ) , V t µil ) + 1 √ N X µ ˆ f µi → µl ( t ) g out ( ω t µil , h ( X j ( 6 = i ) F 0 µj X 0 j l , w ) , V t µil ) ≈ α X 0 il m t F ˆ m t + N (0 , 1) q αq t F ˆ q t , (176) where in the ﬁrst term we deﬁned the new parameter ˆ m as ˆ m t = 1 M E F 0 ,X 0 ,w 0 " X µ ∂ z g out ( ω t µil , h ( z µl , w µl ) , V t µil ) # . (177) Using the same kind of analysis as w e did for ˆ q and ˆ χ , w e ﬁnd that ˆ m t = Z d w P 0 ( w ) Z d p d z N [ p, z ; q t F q t X , h ( z 0 ) 2 i , m t F m t X ] ∂ z g out [ p, h ( z , w ) , Q t F Q t X − q t F q t X ] , (178) The second term of (176) when a veraged o v er realization of F 0 , X 0 and w 0 b eha v es as a Gaussian random v ariable. In (176) we moreov er assumed that E F 0 ,X 0 ,w 0   1 M X µ g out ( ω t µil , h ( X j ( 6 = i ) F 0 µj X 0 j l , w ) , V t µil )   = 0 , (179) whic h is true in all the special cases analysed in this pap er, and is also true in general under the Ba yes-optimal inference as detailed in the next section. If the zero-mean assumption (179) did not hold the density ev olution equations would con tain additional terms (similarly as if both F and X had non-zero means), see e.g. the state ev olution in compressed sensing for non-zero mean matrices [74]. Under the zero-mean assumption, the v ariance of the Gaussian v ariable is αq t F ˆ q t with ˆ q t giv en b y (174). Analogously we hav e W t µi / Z t µi ≈ π √ N F 0 µi m t X ˆ m t + N (0 , 1) q π q t X ˆ q t . (180) With the use of (166-166), (176), (176), and the expressions for messages (102-103) w e obtain Q t +1 X − q t +1 X = Z d X 0 P X 0 ( X 0 ) Z D ξ f c " 1 αQ t F ˆ χ t − α ( Q t F − q t F ) ˆ q t , αm t F ˆ m t X 0 + ξ p αq t F ˆ q t αQ t F ˆ χ t − α ( Q t F − q t F ) ˆ q t # , (181) q t +1 X = Z d X 0 P X 0 ( X 0 ) Z D ξ f 2 X " 1 αQ t F ˆ χ t − α ( Q t F − q t F ) ˆ q t , αm t F ˆ m t X 0 + ξ p αq t F ˆ q t αQ t F ˆ χ t − α ( Q t F − q t F ) ˆ q t # , (182) m t +1 X = Z d X 0 P X 0 ( X 0 ) Z D ξ X 0 f X " 1 αQ t F ˆ χ t − α ( Q t F − q t F ) ˆ q t , αm t F ˆ m t X 0 + ξ p αq t F ˆ q t αQ t F ˆ χ t − α ( Q t F − q t F ) ˆ q t # , (183) where D ξ = d ξ e − ξ 2 / 2 / √ 2 π is a Gaussian integration measure. Analogously w e hav e for the F -related order parameters Q t +1 F − q t +1 F = Z d F 0 P F 0 ( F 0 ) Z D ξ f s " 1 π Q t X ˆ χ t − π ( Q t X − q t X ) ˆ q t , π m t X ˆ m t √ N F 0 + ξ p π q t X ˆ q t π Q t X ˆ χ t − π ( Q t X − q t X ) ˆ q t # , (184) q t +1 F = Z d F 0 P F 0 ( F 0 ) Z D ξ f 2 F " 1 π Q t X ˆ χ t − π ( Q t X − q t X ) ˆ q t , π m t X ˆ m t √ N F 0 + ξ p π q t X ˆ q t π Q t X ˆ χ t − π ( Q t X − q t X ) ˆ q t # , (185) m t +1 F = Z d F 0 P F 0 ( F 0 ) Z D ξ √ N F 0 f F " 1 π Q t X ˆ χ t − π ( Q t X − q t X ) ˆ q t , π m t X ˆ m t √ N F 0 + ξ p π q t X ˆ q t π Q t X ˆ χ t − π ( Q t X − q t X ) ˆ q t # . (186) 31 The six equations (181-183) together with (173-178) are the general form of densit y ev olution for GAMP in the general case of matrix factorization. W e remind that these equations describ e the asymptotic evolution of the algorithm in the “thermo dynamic” limit of large sizes, as long as the matrices X , X 0 , F , F 0 w ere generated with iid elements, at least one of the random v ariables X and F , and at least one of the X 0 and F 0 has zero mean, and the output channel satisﬁes the condition (179). The ﬁrst of these condition is absolutely essential for our approach; the restriction to zero means is here for con venience, the non-zero means and generic form of output function can b e treated with the same formalism that w e hav e used here, with additional terms. 1. State evolution of the Bayes-optimal infer enc e T o satisfy the Nishimori iden tities w e ha v e to suppose that all the prior distributions are matching the true distribu- tions from which the signal and noise were generated, i.e. conditions (106) hold. The condition P 0 out ( y | z ) = P out ( y | z ) is equiv alen t to P 0 ( w ) = P ( w ) and h ( z , w ) = h 0 ( z , w ). In this Bay es-optimal setting, (106), the asymptotic analysis simpliﬁes considerably since w e hav e q t X = m t X , q t F = m t F , (187) Q t X = E [( X 0 ) 2 ] , Q t F = N E [( F 0 ) 2 ] , (188) ˆ χ t = ˆ q t = ˆ m t . (189) T o justify the ab o v e statement we need to prov e that if (106) is satisﬁed and (187-189) hold up to iteration t then (187-189) hold also in iteration t + 1. This is done in the next tw o subsections. In the Bay es-optimal setting, the state evolution simpliﬁes into m t +1 X = Z d X P X ( X ) Z D ξ f 2 X " 1 αm t F ˆ m t , αm t F ˆ m t X + ξ p αm t F ˆ m t αm t F ˆ m t # , (190) m t +1 F = Z d F P F ( F ) Z D ξ f 2 F " 1 π m t X ˆ m t , π m t X ˆ m t √ N F + ξ p π m t X ˆ m t π m t X ˆ m t # , (191) ˆ m t = − Z d w P ( w ) Z d p d z e − p 2 2 m t F m t X e − ( z − p ) 2 2[ h ( z 0 ) 2 i− m t F m t X ] 2 π p m t F m t X ( h ( z 0 ) 2 i − m t F m t X ) ∂ p g out ( p, h ( z , w ) , h ( z 0 ) 2 i − m t F m t X ) , (192) where D ξ is a Gaussian integral d ξ e − ξ 2 / 2 / √ 2 π . Here we chose to use the expression coming from eq. (182), (185) and (173), but we could hav e used any of the other expressions that are equiv alen t on the Nishimori line. Where m F and m X are initialized as squares of the means of the corresp onding prior distributions m t =0 F = N  Z d F F P ( F )  2 , m t =0 X =  Z d X X P ( X )  2 . (193) In case the prior distribution dep ends on another random v ariables, e.g. in case of matrix calibration, we take additional av erage with resp ect to that v ariable. If the abov e initialization gives m t =0 F = 0 and m t =0 X = 0 then this is a ﬁxed p oint of the state ev olution. This is due to the p ermutational symmetry b et w een the columns of matrix F and ro ws of matrix X . T o obtain a nontrivial ﬁxed p oint we initialize at m t =0 F = η for some very small η , corresp onding to an inﬁnitesimal prior information about the matrix elements of the matrix F . Note that this is needed only in the state ev olution, the algorithm breaks the p erm utational symmetry sp on taneously . The same situation app ears in an Ising ferromagnet at low temp erature where zero magnetization is a ﬁxed point of the equilibrium equations, but the ph ysically correct solution to which dynamical pro cedures conv erge had large magnetization in absolute v alue. Our general strategy in the asymptotic analysis of optimal Bay esian inference and related phase transition is that the corresponding ﬁxed p oin ts m ust satisfy the Nishimori iden tities, hence we will restrict our search for ﬁxed points to parameters lying on the Nishimori line, i.e. satisfying the identities (187-189). When these identities are not imp osed the iterations of the state evolution equations are not alwa ys conv erging to ﬁxed p oin ts on the Nishimori line. This is also reﬂected in problems with conv ergence in the GAMP algorithm for matrix factorization. In the algorithm the Nishimori identities are unfortunately not straightforw ard to imp ose. 32 2. The input Nishimori identities Assume that in the state evolution the Nishimori identities (187-189) hold for all iteration times smaller or equal to t . Out aim is to show that then m t +1 X and q t +1 X computed from (183) and (182) are equal. Recall that m t +1 X = Z d X 0 P X ( X 0 ) Z D ξ X 0 f X " 1 αm t F ˆ m t , αm t F ˆ m t X 0 + ξ p αm t F ˆ m t αm t F ˆ m t # , (194) q t +1 X = Z d X 0 P X ( X 0 ) Z D ξ f 2 X " 1 αm t F ˆ m t , αm t F ˆ m t X 0 + ξ p αm t F ˆ m t αm t F ˆ m t # , (195) (196) and the deﬁnition of f X (Σ 2 , R ) = R d X e − ( X − R ) 2 2Σ 2 X P X ( X ) R d X e − ( X − R ) 2 2Σ 2 P X ( X ) = R d X e − X 2 2Σ 2 + R Σ 2 X X P X ( X ) R d X e − X 2 2Σ 2 + R Σ 2 X P X ( X ) . (197) Denoting ˜ q = α m t F ˆ m t w e ha ve: m t +1 X = Z d X 0 P X ( X 0 ) Z D ξ X 0 R d X e − ˜ qX 2 2 +( ˜ q X 0 + ξ √ ˜ q ) X X P X ( X ) R d X e − ˜ qX 2 2 +( ˜ q X 0 + ξ √ ˜ q ) X P X ( X ) , (198) q t +1 X = Z d X 0 P X ( X 0 ) Z D ξ " R d X e − ˜ qX 2 2 +( ˜ q X 0 + ξ √ ˜ q ) X X P X ( X ) R d X e − ˜ qX 2 2 +( ˜ q X 0 + ξ √ ˜ q ) X P X ( X ) # 2 . (199) P erforming the change of v ariables ξ √ ˜ q + ˜ q X 0 → ξ √ ˜ q the Gaussian measure b ecome D ξ e − ˜ qX 2 0 2 + ξX 0 √ ˜ q so that m t +1 X = Z D ξ Z d X 0 P X ( X 0 ) e − ˜ qX 2 0 2 + ξX 0 √ ˜ q X 0 R d X e − ˜ qX 2 2 + ξ √ ˜ qX X P X ( X ) R d X e − ˜ qX 2 2 + ξ √ ˜ qX P X ( X ) = Z D ξ h R d X e − ˜ qX 2 2 + ξ √ ˜ qX X P X ( X ) i 2 R d X e − ˜ qX 2 2 + ξ √ ˜ qX P X ( X ) . (200) Analogously we obtain q t +1 X = Z D ξ Z d X 0 P X ( X 0 ) e − ˜ qX 2 0 2 + ξ √ ˜ qX 0 " R d X e − ˜ qX 2 2 + ξ √ ˜ qX X P X ( X ) R d X e − ˜ qX 2 2 + ξ √ ˜ qX P X ( X ) # 2 = m t +1 X . (201) The pro of of m t +1 F = q t +1 F is exactly the same. The next iden tity we wan t to prov e is Q t +1 X = E [( X 0 ) 2 ]. F rom the general state evolution equation (181) we get under conditions (187-189) that Q t +1 X − q t +1 X = Z d X 0 P X ( X 0 ) Z D ξ f c " 1 αm t F ˆ m t , αm t F ˆ m t X 0 + ξ p αm t F ˆ m t αm t F ˆ m t # . (202) Using the deﬁnition of the function f c (Σ , R ) (66), the same change of v ariables, and resulting cancelations as ab ov e w e get Q t +1 X = Z D ξ Z d X 0 X 2 0 P X ( X 0 ) e − ˜ qX 2 0 2 + ξ √ ˜ qX 0 = E [( X 0 ) 2 ] . (203) And analogously for Q t +1 F = N E [( F 0 ) 2 ]. 3. The output Nishimori identities Let us now assume that the input Nishimori iden tities (187-188) are satisﬁed and w e w ant to sho w that (189) and (179) hold. 33 W e depart from the general expressions (173-178). W e notice that for q t F = m t F and q t X = m t X the joint Gaussian measure for v ariables p and z in (170-172) can b e written as a pro duct of tw o Gaussian measures. W e hav e in that case E [ ω µl ( ω µl − z 0 µl )] = 0, hence one of the Gaussian has zero mean and v ariance m t F m t X and the other one mean p and v ariance V t = h ( z 0 ) 2 i − m t F m t X . F urthermore, p erforming the integration ov er v ariable z by parts in eq. (178) and then using the relation b et ween h ( z , w ) and P out ( y | z ) from eq. (10) and the deﬁnition of g out from eq. (56) we get ˆ m t = Z d y d p d z P out ( y | z ) N ( p, z ) z − p V t R d z 0 P out ( y | z 0 )( z 0 − p ) e − ( z 0 − p ) 2 2 V t V t R d z 00 P out ( y | z 00 ) e − ( z 00 − p ) 2 2 V t . (204) Doing analogous manipulations of expliciting the Gaussian measure and using eq. (10) and the deﬁnition of g out in equation (174) we obtain ˆ q t = ˆ m t . (205) F or ˆ χ t w e do integration with resp ect to p by parts and using steps as in the ab o v e w e obtain ˆ χ t = − 1 m t F m t X Z d y d p d z P out ( y | z ) N ( p, z ) p R d z 0 P out ( y | z 0 )( z 0 − p ) e − ( z 0 − p ) 2 2 V t V t R d z 00 P out ( y | z 00 ) e − ( z 00 − p ) 2 2 V t + ˆ q t = ˆ q t . (206) Thanks to a cancelation b et ween the integrals o ver v ariables z and z 00 w e can p erform explicitly the integral ov er y (k eeping in mind that P out ( y | z 0 ) is a normalized probability distribution). The remaining Gaussian integral is then zero. In a analogous manner w e pro ve eq. (179) by noticing that the expectation with resp ect to F 0 , X 0 and w 0 is exactly the integral R d y d p d z P out ( y | z ) N ( p, z ). T o conclude, identities (189) and (179) hold in the limit N → ∞ . How ev er, as common in statistical physics we can recall the self-av eraging prop ert y under whic h quantities on almost ev ery large ( N → ∞ ) instance are equal to their av erages av er randomness (disorder) of F 0 , X 0 and w 0 . This self-a veraging then for instance justiﬁes the use of eq. (108) on large single instances of the matrix factorization problem. B. Replica metho d The replica metho d is known as a non-rigorous approach to ev aluate the typical p erformance of v arious Bay esian inference problems. W e here show how this is employ ed for the matrix factorization problem. W e show, as exp ected from other problems, that the result of the replica analysis is fully equiv alen t to the result of the state ev olution. 1. Moment assessment for n ∈ N The expression of the partition function Z ( Y ) = Z d F d X Y µ,i P F ( F µi ) Y i,l P X ( X il ) Y µ,l P out y µl | X i F µi X il ! (207) constitutes the basis of our analysis. In statistical physics, one can generally examine prop erties of systems via ev aluation of the free en tropy log Z ( Y ), which statistically ﬂuctuates dep ending on the realization of Y in the current case. How ev er, as N , M , P → ∞ , one can exp ect that the self-av eraging prop ert y holds and, therefore, the free en tropy density N − 2 log Z ( Y ) conv erges to its typical v alue φ ≡ N − 2 [log Z ( Y )] Y with probability of unity . This is also exp ected to hold for other macroscopic quantities relev ant to the p erformance of the matrix factorization. Therefore, assessment of φ is the central issue in our analysis. This can b e systematically carried out by the replica metho d. F or this, we ﬁrst ev aluate the n -th moment of Z ( Y ), [ Z n ( Y )] Y = R d Y P 0 ( Y ) Z n ( Y ), for n ∈ N utilizing an iden tity Z n ( Y ) = Z n Y a =1 { d F a d X a Y µ,i P F ( F a µi ) Y i,l P X ( X a il ) } × { Y µ,l n Y a =1 P out ( y µl | X i F a µi X a il ) } , (208) 34 with resp ect to the generative distribution of Y P 0 ( Y ) = Z d F 0 d X 0 Y µ,i P F 0 ( F 0 µi ) Y i,l P X 0 ( X 0 il ) Y µ,l P 0 out ( y µl | X i F 0 µi X 0 il ) , (209) where we assumed that the functional forms of P F 0 ( F µi ), P X 0 ( X il ) and P 0 out ( y µl | P i F µi X il ) may b e diﬀerent from those of the assumed mo del P F ( F µi ), P X ( X il ) and P out ( y µl | P i F µi X il ) for generality . When they are equal, which corresp ond to the Bay es-optimal setting, P 0 ( Y ) = Z ( Y ) holds. In p erforming the integrals of 2( n + 1) matrices ( F 0 , { F a } n a =1 ), ( X 0 , { X a } n a =1 ) and Y that come out in ev aluating [ Z n ( Y )] Y , we insert trivial iden tities with resp ect to all com binations of replica indices a ≤ b = 0 , 1 , 2 , . . . , n 1 = M Z d q ab F δ  F a · F b − M q ab F  (210) and 1 = N P Z d q ab X δ  X a · X b − N P q ab X  (211) to the in tegrand, where F a · F b ≡ P µ,i F a µi F b µi and similarly for X a · X b . Let us denote Q F ≡ ( q ab F ) and Q X ≡ ( q ab X ), and introduce tw o joint distributions P F ( { F a } ; Q F ) = 1 V F ( Q F ) Y µ,i P F 0 ( F 0 µi ) n Y a =1 P F ( F a µi ) ! Y a ≤ b δ  F a · F b − M q ab F  (212) and P X ( { X a } ; Q X ) = 1 V X ( Q X ) Y i,l P X 0 ( X 0 il ) n Y a =1 P X ( X a il ) ! Y a ≤ b δ  X a · X b − N P q ab X  , (213) where V F ( Q F ) and V X ( Q X ) are the normalization constants. These yield an expression of [ Z n ( Y )] Y as [ Z n ( Y )] Y = Z d Y d( M Q F )d( N P Q X ) { V F ( Q F ) V X ( Q X ) ×   Y µ,l P 0 out ( y µl | X i F 0 µi X 0 il ) n Y a =1 P out ( y µl | X i F a µi X a il ) !   Q F , Q X    , (214) where [ · · · ] Q F , Q X denotes the av erage with resp ect to (212) and (213). In computing [ · · · ] Q F , Q X , it is noteworth y that { F a } and { X a } follow statistically indep enden t distributions, and either of them has zero mean and b oth of them hav e ﬁnite v ariances from our assumption. These allow us to handle z a µl ≡ P i F a µi X a µi ( a = 0 , 1 , 2 , . . . , n ; µ = 1 , 2 , . . . , M ; l = 1 , 2 , . . . , P ) as multiv ariate Gaussian random v ariables whose distribution is given by P Z ( { z a µl }|Q F , Q X ) = Y µ,l 1 p (2 π ) n +1 det T exp   − 1 2 X a,b z a µl  T − 1  z b µl   , (215) where T = ( q ab F q ab X ) ∈ R ( n +1) × ( n +1) . Emplo ying this and ev aluating the integrals of Q F and Q X b y means of the saddle p oin t metho d yield an expression 1 N 2 log [ Z n ( Y )] Y = extr Q F , Q X { απ I F X ( Q F , Q X ) + α I F ( Q F ) + π I X ( Q X ) } , (216) where I F X ( Q F , Q X ) = log Z Z P Z ( { z a }|Q F , Q X ) P 0 out ( y | z 0 ) n Y a =1 P out ( y | z a ) ! n Y a =0 d z a ! d y ! , (217) I F ( Q F ) = 1 N M log V F ( Q F ) 35 = extr ˆ Q F ( 1 2 T r ˆ Q F Q F + log Z P F 0 ( F 0 ) n Y a =1 P F ( F a ) exp  − N 2 F T ˆ Q F F  n Y a =0 d F a !) , (218) I X ( Q X ) = 1 N P log V X ( Q X ) = extr ˆ Q X ( 1 2 T r ˆ Q X Q X + log Z P X 0 ( X 0 ) n Y a =1 P X ( X a ) exp  − 1 2 X T ˆ Q X X  n Y a =0 d X a !) , (219) and extr Θ {· · · } denotes the op eration of extremization with resp ect to Θ. ˆ Q F = ( ˆ q ab F ) ∈ R ( n +1) × ( n +1) are introduced for the s addle p oin t ev aluation of V F ( Q F ) and F = ( F a ) ∈ R n +1 , and similarly for ˆ Q X ∈ R ( n +1) × ( n +1) and X = ( X a ) ∈ R n +1 . 2. R eplic a symmetric fr e e entr opy Literally ev aluating (216) yields the correct leading order estimate of N − 2 log[ Z n ( Y )] Y for each n ∈ N . How ev er, w e here restrict the candidate of the dominant saddle p oin t to ( q ab F , q ab X , ˆ q ab F , ˆ q ab X ) =        ( Q 0 F , Q 0 X , ˆ Q 0 F , ˆ Q 0 X ) , a = b = 0 , ( Q F , Q X , ˆ Q F , ˆ Q X ) , a = b ( a, b 6 = 0) , ( q F , q X , − ˆ q F , − ˆ q X ) , a 6 = b ( a, b 6 = 0) , ( m F , m X , − ˆ m F , − ˆ m X ) , a = 0 , b 6 = 0 , (220) in order to obtain an analytic expression with resp ect to n ∈ R . This restricted form is called the replica sym- metric ansatz, b ecause the diﬀerent non-zero replica-indices a and b play the same role. It turns out that this ansatz is equiv alen t to the assumptions made in the cavit y metho d [12]. After some algebra utilizing a formula exp( A P a 0, the linear stability analysis of eqs. (230-232) around the informative ﬁxed p oin t m X = ρ , m F = 1 (i.e. the MMSEs E F = E X = 0) leads to an up date for the p erturbations δ t +1 F = ( δ t F ρ + δ X ) / ( απ ) and δ t +1 X = ρ ( δ t F ρ + δ X ) /α . By computations of the largest eigenv alue of the corresp onding 2 × 2 matrix we obtain that this informativ e ﬁxed point is stable if and only if π > π ∗ ( α, ρ ) ≡ α/ ( α − ρ ) . (235) In other words the zero MSE ﬁxed p oin t is lo cally stable abov e the counting low er b ound (24). F urther we notice that in the low noise limit ∆ → 0, for all p ositive and ﬁnite η , and for m X = ρ − δ X , m F = 1 − δ F with δ X and δ F b eing small p ositiv e constants of the same order as ∆ the free entrop y (234) b ecomes in the leading order ( π α − π ρ − α ) log(∆ + δ X + ρδ F ) / 2. F or π > π ∗ ( α, ρ ) this is a large p ositiv e v alue, and a large negative v alue for π < π ∗ ( α, ρ ). Hence for the noiseless measurements ∆ = 0 the asymptotic Bay es-optimal MMSE is E X = 0 and E F = 0 for π > π ∗ ( α, ρ ) for all η > 0. This is a remark able result as it implies that in the Bay es-optimal setting the dictionary is identiﬁable (or an exact calibration of the matrix p ossible) as so on as the num b er of samples P p er signal elemen t is larger than the v alue π ∗ ( α, ρ ) giv en b y the trivial counting b ound (24). 39 b. Identiﬁability versus achievability gap: The next question is whether this MMSE is ac hiev able in a computa- tionally tractable wa y . T o answ er this we study the state ev olution starting in the uninformativ e initialization. First w e analyze the b eha vior of the state ev olution when m X = δ X , and m F = δ F where b oth δ X , and δ F are p ositiv e and small, while we also consider η being very large. The linear expansion of state evolution up date then leads to δ t +1 X = ρ 2 αδ F / (∆ + ρ ) and δ F = 1 /η + πδ X / (∆ + ρ ). Hence for η → ∞ the uninformative initialization is in fact a stable ﬁxed p oin t of the state ev olution equations as long as π ≤ π F ≡ (∆ + ρ ) 2 / ( αρ 2 ) . (236) This means that for π ∗ ( ρ, α ) < π < π F (∆ , ρ, α ) the MMSE is not achiev able in the dictionary learning (e.g. when α < 1 π ∗ (0 , α ) < π F (0 , 0 , α )) with the approximate message passing presen ted in this pap er. This simple analysis leads us to the conclusion that a ﬁrst order phase transition is in pla y in the dictionary learning problem, and as we will see also in the blind calibration ( η < ∞ ) and sparse PCA ( α > 1). As a side remark let us remind that the limit η → 0 should lead to results known from Ba y esian compressed sensing. In particular in compressed sensing for low noise the matrix X is iden tiﬁable if and only if α > ρ . T o reconcile this with the previous results notice that indeed for η = c ∆ → 0 with c = O (1) the leading term of the free en tropy b ecomes π ( α − ρ ) log (∆). Hence compressed sensing result is recov ered. Whereas for 1  η  ∆ it is the dictionary learning static phase transition π ∗ = α/ ( α − ρ ) that is the relev ant one. 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 +0 0 0.5 1 1.5 2 2.5 3 3.5 4 MMSE and AMP-MSE π =P/N π * η =100 η =0.1 η =0.0001 E F E X 1 1.5 2 2.5 3 3.5 4 0.0001 0.01 1 100 10000 1e+06 π s η ρ =0.05 ρ =0.2 FIG. 2. Left: Blind matrix calibration: The predicted MSE E F (for the matrix estimation) and E X (for the signal estimation) corresp onding to ρ = 0 . 2, α = 0 . 5, ∆ = 0, for three v alues of η . MMSE is in full lines, AMP-MSE is in dashed lines. The MMSE jumps abruptly from a ﬁnite v alue to zero at the phase transition p oin t π ∗ (235). How ev er, the AMP-MSE matches the MMSE only when the sample complexity is larger than the spino dal transition, that takes place at a larger v alue π s ( η ) > π ∗ . The AMP-MSE is zero for π > π s ( η ). Right: W e plot the v alue of the spino dal transition at which the AMP-MSE has a discon tinuit y as function of η for α = 0 . 5, ∆ = 0 and tw o diﬀeren t v alues of the sparsity ρ . Arrows on the left mark the static transition π ∗ and we see that lim η → 0 π s ( η ) = π ∗ . In the limit of dictionary learning, η → ∞ , the spino dal transition con verges to a ﬁnite v alue. Interestingly , for small v alues of the densit y , e.g. ρ = 0 . 05, we see a sharp phase transition in the threshold π s ( η ), in this case at η ≈ 68. c. Phase diagr ams for blind c alibr ation and dictionary le arning Due to the close link betw een the tw o problems, w e shall describ ed the results for dictionary learning together with the case of blind matrix calibration. In b oth these cases we are t ypically trying to learn (calibrate) an o vercomplete dictionary α < 1 and a sparse signal X from as few samples P as p ossible. W e hence ﬁrst plot in Fig. 2 (left) the MSE for F (in red) and for X (in blue) as a function of π = P / N and ﬁxed (representativ e) v alue of undersampling ratio α = 0 . 5, and densit y ρ = 0 . 2 in zero (or negligible) measuremen t noise, ∆ = 0. W e consider several v alues of matrix uncertain ty η (the larger the v alue the less we know ab out the matrix). The AMP-MSE achiev ed from the uninformative initialization is depicted in dashed lines, the MMSE achiev ed in this zero noise case from the planted (informative) initialization is in full lines. Fig. 2 (right) shows the v alue of the spino dal transition π s as a function of the matrix uncertaint y η . W e see that, as exp ected, lim η → 0 π s ( η ) = π ∗ . A result that is less intuitiv e is that the large η limit is also well deﬁned and ﬁnite lim η →∞ π s ( η ) = π s DL < 0. This means that even in the dictionary learning where no prior information ab out the matrix elements is av ailable the dictionary is identiﬁable with AMP for large system sizes ab o ve the spino dal transition π s DL . In Fig. 2 (right) we also see an interesting b eha vior in the function π s ( η ) for low v alues of ρ - there is a sharp phase transition from low η regime, where the AMP-MSE at the transition has a w eak discontin uit y tow ards a relatively lo w v alue of MSE, and a high η regime where the discon tinuit y is v ery abrupt tow ards a v alue of MSE that is close 40 1 10 100 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 π =P/N ρ =K/N π * π s , η = ∞ π s , η =100 π s , η =1 π s , η =0.1 π s , η =0.01 π s , η =0.0001 FIG. 3. Left: The AMP-MSE of the signal matrix X is plotted against π and η for the blind matrix calibration at ρ = 0 . 05, α = 0 . 5 and ∆ = 0. The transition as a function of the num ber of samples π for ﬁxed matrix uncertaint y η is alw ays discon tinuous in this ﬁgure, for larger v alues of η this discontin uity is, ho wev er, muc h more pronounced. Right: The phase diagram of dictionary learning and blind matrix calibration for α = 0 . 5 and ∆ = 0. In b oth cases, the matrix F is identiﬁable b y the Bay es-optimal inference ab o ve the full red line π ∗ = α α − ρ . How ev er, the system undergo es the spino dal transition π s ( η ) sho wn here with a dashed line for dictionary learning ( η = ∞ ) and for blind calibration ( η ﬁnite). Below the spino dal transition, the MMSE is not achiev ed with AMP . Notice that for η → 0 the spinodal line conv erges to π ∗ for ρ < ρ CS BP . F or all v alues of η the spino dal line div erges as ρ → ρ CS BP . The threshold v alue ρ CS BP = 0 . 317 for α = 0 . 5 is the (spino dal) phase transition of pure compressed sensing (when the matrix F is fully kno wn), from [63]. Notice also that in the limit of learning a dictionary with a v ery small sparsity , η → ∞ and ρ → 0, the spino dal line con verges to π s = 2 whereas the information theoretic transition is at π = 1. Such a gap in the low ρ b eha vior is striking. 0.0001 0.001 0.01 0.1 1 1 2 3 4 5 6 MMSE and AMP-MSE π ∆ =0 ∆ =10 -4 ∆ =10 -3 ∆ =10 -2 FIG. 4. Dictionary learning with noisy measurements. Left: The MMSE (dashed lines) and the AMP-MSE (full lines) for the signal matrix X as a function of the num ber of samples P = π N for α = 0 . 5, ρ = 0 . 2 for v arious v alues of the measurement noise ∆. The b eha vior is qualitatively the same as for the noiseless case, the only diﬀerence is that the MMSE at π > π ∗ (∆), and the AMP-MSE at π > π s (∆) are no longer strictly zero but rather O (∆). If the additive noise ∆ is large enough, how ev er, the phase transition disapp ears and the MMSE=AMP-MSE is contin uous (this is the case in this plot for ∆ = 10 − 2 ). Right: A surface plot of the MMSE of the signal E X in the ∆, π -plane in the case α = 0 . 5 and ρ = 0 . 2. Notice how the sharp transition disapp ears at large noise where it is replaced by a smo oth evolution of the MMSE. to the completely uninformative v alue. This b eha vior is also illustrated in Fig. 3 (left) where we plot the AMP-MSE as a function of π and η (for ﬁxed ρ = 0 . 05, α = 0 . 5, ∆ = 0). Fig. 3 (right) depicts the phase diagram of dictionary learning ( η → ∞ ) and blind matrix calibration (ﬁnite η ) we plot the static threshold π ∗ ( η -independent) ab ov e which the matrix F is identiﬁable in full red, and the spino dal 41 FIG. 5. Comparison of the MMSE (left) to the AMP-MSE (right) for noisy dictionary learning with α = 0 . 5 and ρ = 0 . 2, the MSE of the signal matrix X is plotted on the ∆, π -plane. The color-scale is in decadic logarithm of the MSEs. W e clearly see the region where MMSE < AMP-MSE and the one where the t wo are equal. threshold π s (for v arious v alues of η ) ab ov e which the AMP iden tiﬁes asymptotically the original matrix F is dashed line. Notice that π s ( ρ ) diverges as ρ → ρ CS BP , where ρ CS BP is the AMP phase transition in compressed sensing, see e.g. [63]. This is exp ected as for ρ > ρ CS BP the sparse signal cannot b e recov ered with AMP even if the matrix F is fully kno wn. F rom now on (also in following subsections) we will discuss only the case η → ∞ when no prior information ab out the dictionary F is av ailable. In Fig. 4 (left) we illustrate the results for dictionary learning with non-zero measuremen t noise. The situation is qualitatively similar to what happ ens in noisy compressed sensing [63]. The ﬁrst order phase transition is b ecoming weak er as the noise grows, until some v alue ∆ ∗ ab o v e which there is no phase transition an ymore and the AMP-MSE=MMSE for the whole range of π . In Fig. 4 (righ t) w e then plot the AMP-MSE and the MMSE of the signal X as a function of the noise v ariance ∆ and π for α = 0 . 5 and ρ = 0 . 2. This surface plot demonstrates how the sharp transition disapp ears and is replaced by a contin uous evolution of the MSE when the noise is large. This MMSE is compared to the AMP-MSE for the same case in Fig. 5. 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 α = π ρ π s π * 1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 1 1.2 1.4 1.6 1.8 2 MMSE and AMP-MSE π α =3 α =2 α =1.5 FIG. 6. Phase diagram for sparse PCA. Left: Spino dal transition π s (upp er red, where AMP-MSE go es to zero) and the optimal Bay es inference transition (low er blue, π ∗ = α ∗ = 1 + ρ , where MMSE go es to zero) lines in zero measurement noise and when α = ρ . Right: AMP-MSE and MMSE of the sparse matrix X in sparse PCA for ρ = 0 . 5, ∆ = 10 − 10 and several v alues of ratio α as a function of π . d. Phase diagr am for sp arse PCA Sparse PCA as we set it in Section I D is closely related to dictionary learning. Except that from the three sizes of matrices M , N and P the smallest one is the N – corresp onding to the matrix Y to b e of relatively low rank. Hence for sparse PCA we should only really consider α > 1, π > 1. Beha vior of the state evolution is for this range of parameters qualitativ ely very similar to the one we just observed in 42 the previous section. In Fig. 6 left we treat the case of M = P , i.e. π = α , with zero measurement noise, and compute the smallest v alue for whic h the matrices F and ρ -sparse X are recov erable for a given ρ . Just as in the dictionary learning w e obtain that the Bay es optimal MMSE is zero everywhere ab o v e the coun ting b ound π = α > 1 + ρ , blue line in Fig. 6 left. The AMP-MSE is, how ev er, zero only ab o v e the spino dal line, depicted in red in the ﬁgure. The gap b et ween the t wo lines is not very large in this case. The righ t part of Fig. 6 sho ws the AMP-MSE and the MMSE of the signal matrix X for measurement noise v ariance ∆ = 10 − 10 and density ρ = 0 . 5. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ρ α ρ s ρ * FIG. 7. Blind source separation. At the measurement noise ∆ = 0, π = 10 we plot the spino dal ρ s as a function of α . Ab o v e this line AMP-MSE > 0, whereas b elo w this line AMP-MSE= 0. The static transition line is ρ ∗ = 0 . 9 α . e. Phase diagr am for blind sour c e sep ar ation In blind source separation, P corresp onds to the length of the signal, N is the num ber of sources, and M the num b er of sensors. Typically the signal is very long, i.e. one has P  N and P  M . The particularly interesting case is when there is more sources than sensors α < 1 in that case the signal can b e reconstructed only if the density of the signal ρ is smaller than a certain v alue. The counting b ound gives us ρ < α ( π − 1) /π and this also corresp onds to the v alue under which the MMSE drops to zero under zero measurement noise. As in the previous case also here we observe a ﬁrst order phase transition and with AMP w e can reach zero error (in the noiseless case) only b elo w ρ s that we depict for π = 10 as a function of α in Fig. 7. B. Lo w rank matrix completion In the remaining examples we treat cases in which neither F nor X are sparse, ρ = 1, we start with lo w rank matrix completion. In matrix completion the output function g out is eq. (228) for the known matrix elements µl (there is M P of them), and g out ( ω , y , V ) = 0 for the unknown elements µl (there is (1 −  ) M P of them). The input function f X (226) for non-sparse matrix X , ρ = 1, b ecomes f X (Σ , T ) = X Σ + T σ Σ + σ . (237) In the state ev olution, using the Nishimori identities, w e then hav e ˆ m t =  ∆ + X 2 + σ − m t F m t X , (238) where  is the fraction of known elements of Y . Moreov er for the input function (237) the state evolution equation for m X b ecomes m t +1 X = X 2 + ( X 2 + σ ) σαm t F ˆ m t 1 + σ αm t F ˆ m t . (239) The equation for m F is the one of (231). It is instrumen tal to analyze the local stability of the informative and the uninformativ e initialization for low rank matrix completion. F or the informative initialization we consider m t X = 1 − δ t X , and m t F = 1 − δ t F , where δ t X , and δ t F 43 are small p ositiv e num b ers. The state evolution up date equations at zero noise ∆ = 0 lead to δ t +1 X = ( δ t X + δ t F ) / ( α ), and δ t +1 F = ( δ t X + δ t F ) / ( π ). The largest eigenv alue of this 2 × 2 system is ( α + π ) / ( απ ) and hence the informative ﬁxed p oin t is stable for  >  ∗ = ( α + π ) / ( απ ), which coincides with the counting bound eq. (28). This means that for  >  ∗ the noiseless matrix completion, the matrices F and X can b e recov ered without error (asymptotically). F or the uninformativ e initialization w e consider m t X = δ t X , and m t F = δ t F , where δ t X , and δ t F are again small positive n umbers. This time the state ev olution equations give δ t +1 F = π δ t X / (1 + ∆) and δ t +1 X = αδ t F / (1 + ∆). Hence the uninformativ e ﬁxed p oint is stable for  < (∆ + 1) / √ π α . This can indeed b e veriﬁed in Fig. 8. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 MMSE X ε α =2.2 α =3 α =4 α =6 FIG. 8. Low rank matrix completion. The MMSE of the signal matrix X is plotted for α = π (for four diﬀeren t v alues of this parameter) in the noiseless (full line) and noisy (dashed lines), with ∆ = 10 − 2 , cases. In zero noise there is a second order phase transition which coincides with the coun ting b ound, eq. (28), marked by short v ertical lines. With non-zero noise there is no phase transition. In matrix completion we treat matrices Y of low rank, hence N is muc h smaller than b oth P and M . The main questions concerns the fraction  of elements that need to b e known in order for the recov ery of X and F to b e p ossible. In this case w e did not identify ﬁrst order phase transition, as a result w e ha ve MMSE=AMP-MSE. At zero measuremen t noise w e observ ed a phase transition from a phase of p erfect reco very to a phase with p ositiv e MMSE, its p osition coincides with the counting b ound (28),  ∗ = ( α + π ) /απ . With non-zero measuremen t noise, in the scaling of noise and rank we consider in this pap er, the b eha vior of MMSE as a function of the other parameters is smo oth and deriv able (no phase transition). In Fig. 8 we plot an example of the MMSE as a function of the fraction of known elements  for squared matrix Y , i.e α = π . W e generated the signal elements with zero mean and unit v ariance, X = 0, σ = 1. Our analysis suggest that compared to the cases with non-zero sparsit y the low-rank matrix completion is a muc h easier problem, at least in the random setting considered in the present pap er. The fact that the “counting” threshold can be saturated or close to saturated in the noiseless case b y sev eral algorithms can be seen e.g. in the data presen ted in [18]. C. Robust PCA The input functions are the eq. (226) for matrix F and eq. (237) for matrix X . In robust PCA as deﬁned by (30) w e get for the output function g out ( ω , y , V ) = y − ω V  1 −  ∆ s ∆ s + V − (1 −  ) ∆ l ∆ l + V  . (240) The state evolution in the Bay es-optimal setting, i.e. using the Nishimori identities, b ecomes ˆ m t = 1 X 2 + σ − m t F m t X " 1 −  ∆ s ∆ s + X 2 + σ − m t F m t X − (1 −  ) ∆ l ∆ l + X 2 + σ − m t F m t X # . (241) Equation for m t F is (231), and for m t X is (239). In robust PCA the informative initialization is again m t X = 1 − δ t X , and m t F = 1 − δ t F , where δ t X , and δ t F are small p ositiv e n um b ers. F or small noise ∆ s and ∆ l = O (1) the corresp onding ﬁxed point is stable under the same conditions as for low rank matrix completion, i.e. for  >  ∗ = ( α + π ) / ( απ ). 44 The uninformativ e ﬁxed p oin t is m t X = δ t X , and m t F = δ t F , where δ t X , and δ t F are again small p ositiv e num b ers. This ev olves as δ t +1 X = αδ t F ˆ m , δ t +1 F = π δ t X ˆ m , with ˆ m = (1 + δ s =  ∆ l −  ∆ s ) / [(1 + ∆ s )(1 + ∆ l )] in this limit. Hence for instance for ∆ s → 0, ∆ l = 1 and π = α we hav e that the uninformative ﬁxed p oin t is stable for  < 2 /α − 1, which again corresp onds to what we observ e in Fig. 9. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 MMSE X ε α =2.2 α =3 α =4 α =6 FIG. 9. MMSE of the matrix X in robust PCA with α = π in the noiseless (full line, ∆ s = 0, ∆ l = 1), and the noisy (dashed lines, with ∆ s = 10 − 2 , ∆ l = 1) cases. The situation is not very diﬀerent from low rank matrix completion. Indeed in the noiseless case there is a second order phase transition and the MMSE is zero b ey ond the counting b ound, marked by short v ertical lines. In presence of noise there is no phase transition. In the example of Fig. 9 we plot the MMSE as a function of the fraction of undistorted elements  in the case of squared matrix Y , α = π , the v ariance of the large distortions ∆ l = 1 and t wo diﬀerent v alues of the small measuremen t noise ∆ s . W e see a second order phase transition at the counting b ound for ∆ s = 0 and a smo oth deca y of the MMSE for ∆ X > 0. It is interesting to compare how well robust PCA can b e solved with resp ect to the matrix completion. In b oth cases  is the fraction of known elements. The diﬀerence is that in matrix completion their p osition is known, whereas in robust PCA it is not. Intuitiv ely the R-PCA should thus b e a muc h harder problem. This is not conﬁrmed in our analysis that instead suggest that robust PCA is as easy as matrix completion, since the zero noise phase transitions in the tw o coincide. Moreo v er, whereas at  → 0 there is no information left in matrix completion (that is why the MMSE= 1), in robust PCA the largely distorted elements can still b e explored and the MMSE < 1. Note, how ev er, that algorithmically it seems less easy to saturate this theoretical asymptotic p erformance in R-PCA, see e.g. Figure 7 in [18]. D. F actor analysis In factor analysis, the input functions f F and f X are the same as the dictionary learning (226) at ρ = 1. The output function (56) for factor analysis (31) is given by g out ( ω µl , y µl , V µ ) = y µl − ω µl ψ µ + V µl , (242) where ψ µ is the v ariance of the µ -th comp onen t of the unique factor. The v ariance of unique factor ψ µ dep ends here on the index µ and do es not on the index l , which leads to a sligh t mo diﬁcation in the deriv ation of the state ev olution from section V A. F or simplicity , w e ass ume that ψ µ ’s are known ; in practice, these should b e estimated by the exp ectation and maximization scheme in conjunction with GAMP . Then, we obtain, on the Nishimori line ˆ m t µ = 1 ψ µ + Q 0 F,µ ( X 2 + σ ) − m t F,µ m t X . (243) m t +1 F,µ = π Q 0 F,µ m t X ˆ m t µ 1 + π Q 0 F,µ m t X ˆ m t µ , (244) m t +1 X = ασ 2 h m t F,µ ˆ m t µ i µ 1 + ασ h m t F,µ ˆ m t µ i µ , (245) 45 ε = 0 ε = 0.5 ε = 0.9 ε = 1 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 MMSE π E F E X α = 2 α = 2.2 α = 3 α = 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 MMSE X ε FIG. 10. Left: MMSE in factor analysis with ψ 1 = 0 . 5 and ψ 2 = 2, α = 2. Righ t: MMSE of the signal matrix X in factor analysis at four diﬀerent v alues of α = π at ψ 1 = 10 − 2 and ψ 2 = 1 (dashed lines), and ψ 1 = 10 − 5 and ψ 2 = 1 (solid lines). where h·i µ means the av erage ov er ψ µ , the v ariance of the elemen ts of the matrix F is denoted Q 0 F,µ . T o analytically solve (245), one has to sp ecify the distributions of ψ µ and Q 0 F,µ . W e set Q 0 F,µ = Q 0 F = 1 for all µ , and supp ose a tw o-peak distribution for ψ as P ψ ( ψ ) = δ ( ψ − ψ 1 ) + (1 −  ) δ ( ψ − ψ 2 ) . (246) Let us also assume X = 0 and σ = 1. In this case the state ev olution can b e summarized as m F 1 = π m X ˆ m t 1 1 + π m X ˆ m t 1 , m F 2 = π m X ˆ m t 2 1 + π m X ˆ m t 2 (247) m X = α { m F 1 ˆ m t 1 + (1 −  ) m F 2 ˆ m t 2 } 1 + α { m F 1 ˆ m t 1 + (1 −  ) m F 2 ˆ m t 2 } , (248) where ˆ m t 1 = 1 ψ 1 + 1 − m F 1 m X , ˆ m t 2 = 1 ψ 2 + 1 − m F 2 m X . (249) The total MMSE is given by E F = 1 − ( m F 1 + (1 −  ) m F 2 ) and E X = 1 − m X . Fig. 10 (left) shows the π -dep endence of the MMSE at α = 2, ψ 1 = 0 . 5, and ψ 2 = 2 for  = 0, 0.5, 0.9 and 1. W e analyze again the stabilit y of the uninformative ﬁxed p oin t, ( m X , m F 1 , m F 2 ) = (0 , 0 , 0), of the state ev olution. Small p ositiv e n umbers δ X , δ F 1 , and δ F 2 that give the uninformative initialization m X = δ X , m F 1 = δ F 1 , and m F 2 = δ F 2 ev olve under the state evolution as δ t +1 F 1 = π δ t X ψ 1 + 1 , δ t +1 F 2 = π δ t X ψ 2 + 1 , (250) δ t +1 X = απ h  ( ψ 1 + 1) 2 + 1 −  ( ψ 2 + 1) 2 i δ t X . (251) These expressions indicate that the uninformativ e ﬁxed point becomes unstable when απ h  ( ψ 1 +1) 2 + 1 −  ( ψ 2 +1) 2 i > 1. The critical v alues of π given b y this condition coincides with the transition p oin t where the MMSE departs from 1 shown in Fig. 10 (left). As an example of  -dep endence, we show the MMSE of X at ψ 1 = 10 − 2 , ψ 1 = 10 − 5 and ψ 2 = 1 for four diﬀerent v alues of α = π in Fig. 10 (right). Consistently with our analysis, in these cases the uninformative initialization is alwa ys unstable. The transition asso ciated with the stability of the informative ﬁxed p oin t o ccurs only when at least one of ψ s tends to b e zero. F or instance when ψ 1 = 0, the informative initialization corresp onds to δ F 1 = 1 − m F 1 and δ X = 1 − m X that are given by δ t +1 F 1 = δ t X + δ t F 1 π , δ t +1 X = δ t F 1 + δ t X α (252) without dep ending on ψ 2 . These expressions mean that when  >  ∗ = π / [ α ( π − 1)], the informative ﬁxed p oin t is stable, consistently with Fig. 10 (right). The state ev olution of factor analysis for the tw o-p eak case is qualitatively similar to that of robust PCA and low rank matrix completion, but the v alues of the phase transition p oin ts diﬀer. 46 VI I. DISCUSSION AND CONCLUSIONS In this pap er we hav e analyzed v arious examples of the matrix factorization problem. W e obtain a matrix Y that is an element-wise noisy measuremen t of an unknown matrix Z = F X , where both Y , and Z are M × P matrices, F is a M × N matrix, and X is a N × P matrix. W e hav e considered the computational tractability of this problem in the large size limit N → ∞ while π = P / N = O (1), and α = M / N = O (1). Our analysis concerns the teacher-studen t scenario where X and F are generated with random indep enden t elements of some known probability distributions and we employ the Ba yes-optimal inference sc heme to recov er F and X from Y . Let us summarize our con tribution: W e deriv ed the appro ximate message passing algorithm for matrix factorization. One version of the algorithm —for calibration and dictionary learning— was rep orted in [15], and a v ery related algorithm called Big-AMP was discussed b y [17, 18]. This algorithm is deriv ed from b elief propagation. W e hav e presen ted the AMP for matrix factorization in sev eral forms in Sections I II A, and I II B. W e hav e also discussed simpliﬁcations that arise in the Bay es-optimal setting when we can use the Nishimori iden tities (Section I II C), or when the matrix is large and one uses self-av eraging of some of the quantities app earing in GAMP (Section I II D). W e fo cused on the theoretical prop erties of the AMP algorithm, for a robust practical implementation w e refer the reader to the w orks [17, 18] that include a very complete rep ort on its p erformance on a range of b enc hmarks. Next to the AMP algorithm we hav e also derived the corresp onding Bethe free entrop y in Section IV. The Bethe free en tropy ev aluated at a ﬁxed p oint of the GAMP equations approximates the log-lik eliho od of the corresp onding problem. W e mainly use it in situations when we hav e more than one ﬁxed p oin t of GAMP , it is then the one with the largest v alues of the Bethe en tropy that asymptotically given the MMSE of the Bay es-optimal inference. W e also deriv ed a v ariational Bethe free entrop y in Section IV B. This is a useful quantit y that can serv e in controlling the con vergence of the AMP approac h. Alternatively , a direct maximization of this expression is a promising algorithm itself (see [71] for an inv estigation of this idea for compressed sensing). The AMP algorithm for matrix factorization is amenable to asymptotic analysis via the state ev olution technique that was carried out rigorously for approximate message passing in compressed sensing [25]. W e derive the state ev olution analysis for matrix factorization using to ols of statistical mechanics. In particular w e use tw o approaches leading to equiv alen t results the cavit y method (Section V A) and the replica metho d (Section V B). Our deriv ation of the state evolution is not rigorous, but w e conjecture that it is nevertheless asymptotically exact as is the case in man y other systems of this type including the compressed sensing. The main result of this state evolution are simple iterativ e equations that provide a wa y to compute the MMSE of the Bay es-optimal inference as well as the MSE reac hed theoretically in the large size limit by the AMP algorithm. The rigorous pro of of the formulas derived in this pap er is obviously an imp ortan t topic for future work. The main results of this pap er concern analysis of MMSE and AMP-MSE for v arious interesting examples of the matrix factorization problem. W e analyze the asymptotic phase diagrams for dictionary learning, blind matrix calibration, sparse PCA, blind source separation, matrix completion, robust PCA and factor analysis. Earlier results on this analysis app eared in [14, 15]. W e ﬁnd that when one of the matrices F or X is sparse the problems undergo a ﬁrst order phase transition whic h is related to an interesting algorithmic barrier kno wn for instance from compressed sensing [63]. It is a generic observ ation that for most of the problems we analyzed the theoretically ac hiev able p erformance is muc h b etter than the one achiev able by existing algorithms. The AMP algorithm should b e able to match this p erformance for very large systems which is the most exciting p erspective for further developmen t of this w ork. If successful it could lead to an algorithmic rev olution in v arious application of the matrix factorization. In this pap er we concentrate on the theoretical analysis and not on the p erformance of the algorithm itself. Some studies of the performance of some versions of the algorithm can b e ﬁnd in [15, 18]. W e, how ev er, observ ed that the p erformance dep ends strongly on the implementation details and we did not y et found a wa y to matc h the theoretically predicted p erformance for systems of treatable (practical) size in all cases. It is worth discussing some of these algorithmic issues in this conclusion. One of the main problems it that the GAMP algorithm with parallel up date presents instabilities that driv e its evolution a wa y from the so-called Nishimori line; see a recent study of this issue in the compressed sensing problem [74]. This can b e seen even in the state ev olution when we do not assume explicitly that the result of the Bay es-optimal inference corresp onds to a ﬁxed p oin t that b elongs to the Nishimori line. There are wa ys how to av oid these issues, e.g. we observed that the diﬃculties basically disapp ear when the sequential up date of the message passing algorithm from Section I II A is used instead of the parallel one. This, how ev er, does not scale very well with the systems size and our results w ere hence sp oiled by v ery strong ﬁnite size eﬀects. When learning the M × N matrix F , and the N × P matrix X we also often observed that a (very small) num ber of the P signals X l w ere not correctly reconstructed, and these “rogue” vectors in X l w ere p olluting the reconstruction of F . An exemple of these ﬁnite size eﬀects can b e observed in [15]. In the w ork of [17, 18] a part of the problems with conv ergence of the corresp onding algorithm was mitigated by adaptiv e damping (though maybe with not the most suitable cost function, see Sec. IV) and exp ectation maximization 47 learning. Why this is helpful is theoretically explained in the recent work [74] for compressed sensing. How ev er, the implemen tation of [18] do es not match the theoretical p erformance predicted in this paper either (we hav e explicitly tried for the dictionary learning and robust PCA examples). This shows that more work is needed in order to reach a practical algorithm able to achiev e the prediction at mo derate sizes. A more prop er understanding of these problems, and further developmen ts of the algorithm are therefore the main direction of our future work. v ariable no des deﬁnition X deﬁnition F v ariables for elements (1) X il (1) F µi true elements (7) X 0 il (9) F 0 µi prior distribution (4) P X ( X il ) (3) P F ( F µi ) incoming BP message (43) ˜ m µl → il ( X il ) (44) ˜ n µl → µi ( F µi ) outgoing BP message (41) m il → µl ( X il ) (42) n µi → µl ( F µi ) appro ximate-BP mean (45) ˆ x il → µl (47) ˆ f µi → µl appro ximate-BP v ariance (46) c il → µl (48) s µi → µl incoming-BP mean (59) B µl → il (62) R µl → µi incoming-BP v ariance (60) A µl → il (63) S µl → µi mean-input function (66) f X (67) f F v ariance-input function (66) f c (67) f s incoming mean (88) T il (89) W µi incoming v ariance (88) Σ il (89) Z µi GAMP estimate of mean (102) ˆ x il (103) ˆ f µi GAMP estimate of v ariance (102) c il (103) s µi SE magnetization (160) m X (160) m F SE ov erlap (161) q X (161) q F SE v ariance (162) Q X − q X (162) Q F − q F factor no des deﬁnition output matrix to factorize (1) z µl output (2) y µl output probability (2) P out ( y µl | z µl ) output realization function (10) h ( z , w ) ca vity v ariance of z µl (53) V µil ca vity estimation of z µl (54) ω µil output function (56) g out ( ω , y , V ) v ariance of z µl (90) V µl estimation of z µl (91) ω µl SE output q (169) ˆ q SE output χ (168) ˆ χ SE output m (177) ˆ m T ABLE I. Glossary of notations with the num ber of equation where the quantit y was deﬁned or ﬁrst used. A CKNOWLEDGMENTS The research leading to these results has received funding from the Europ ean Research Council under the Euro- p ean Union’s 7 th F ramew ork Programme (FP/2007-2013)/ERC Grant Agreement 307087-SP AR CS. Financial supp ort b y the JSPS Core-to-Core Program “Non-equilibrium dynamics of soft matter and information” and JSPS/MEXT KAKENHI Nos. 23-4665 and 25120013 is also ac knowledged. [1] Bengio Y., Courville A. & Vincent P . Represen tation learning: A review and new p erspectives. Pattern Analysis and Machine Intel ligenc e, IEEE T r ansactions on 35 , 1798–1828 (2013). [2] LeCun Y., Bengio Y. & Hinton G. Deep learning. Natur e 521 , 436–444 (2015). [3] Olshausen B. A. et al. Emergence of simple-cell receptive ﬁeld prop erties by learning a sparse co de for natural images. Natur e 381 , 607–609 (1996). 48 [4] Olshausen B. A. & Field D. J. Sparse Co ding with an Ov ercomplete Basis Set: A Strategy Emplo yed by V1. Vision R ese ar ch 37 , 3311 (1997). [5] Kreutz-Delgado K. et al. Dictionary learning algorithms for sparse represen tation. Neur al c omputation 15 , 349–396 (2003). [6] Zou H., Hastie T. & Tibshirani R. Sparse principal comp onen t analysis. Journal of c omputational and gr aphic al statistics 15 , 265–286 (2006). [7] Belouchrani A., Ab ed-Meraim K., Cardoso J.-F. & Moulines E. A blind source separation technique using second-order statistics. Signal Pr o c essing, IEEE T r ansactions on 45 , 434–444 (1997). [8] Cand` es E. J. & Rec ht B. Exact matrix completion via conv ex optimization. F oundations of Computational mathematics 9 , 717–772 (2009). [9] Cand` es E. J. & T ao T. The p o w er of conv ex relaxation: Near-optimal matrix completion. Information The ory, IEEE T r ansactions on 56 , 2053–2080 (2010). [10] Cand` es E. J., Li X., Ma Y. & W right J. Robust principal comp onent analysis? Journal of the ACM (JACM) 58 , 11 (2011). [11] Donoho D. L., Maleki A. & Mon tanari A. Message-passing algorithms for compressed sensing. Pr o c. Natl. A c ad. Sci. 106 , 18914–18919 (2009). [12] M´ ezard M., Parisi G. & Virasoro M. A. Spin-Glass The ory and Beyond , vol. 9 (W orld Scientiﬁc, Singap ore, 1987). [13] M´ ezard M. & Montanari A. Information, Physics, and Computation (Oxford Press, Oxford, 2009). [14] Sak ata A. & Kabashima Y. Sample complexity of Ba yesian optimal dictionary learning. In Information The ory Pr o c e e dings (ISIT), 2013 IEEE International Symp osium on , 669–673 (2013). [15] Krzak ala F., M´ ezard M. & Zdeb oro v´ a L. Phase diagram and approximate message passing for blind calibration and dictionary learning. In IEEE International Symp osium on Information The ory Pr o c e e dings (ISIT) , 659–663 (2013). [16] Schniter P ., Park er J. & Cevher V. Bilinear Generalized Approximate Message Passing (BiG-AMP) for Matrix Recov ery Problem. In Workshop on Information The ory and Applic ations (IT A), San Die go CA (2012). [17] Park er J., Sc hniter P . & Cevher V. Bilinear generalized appro ximate message passing—Part I: Deriv ation. Signal Pr o c essing, IEEE T r ansactions on 62 , 5839–5853 (2013). [18] Park er J., Schniter P . & Cevher V. Bilinear generalized appro ximate message passing—Part I I: Applications. Signal Pr o c essing, IEEE T r ansactions on 62 , 5854–5867 (2014). [19] Elad M. Sparse and Redundant Representation Mo deling—What Next? Signal Pro c essing Letters 19 (2012). [20] M´ ezard M., Parisi G. & Zecchina R. Analytic and Algorithmic Solution of Random Satisﬁabilit y Problems. Scienc e 297 , 812–815 (2002). [21] Ding J., Sly A. & Sun N. Pro of of the satisﬁability conjecture for large k. arXiv pr eprint arXiv:1411.0650 (2014). [22] T anak a T. A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors. IEEE T r ans. Infor. The ory 48 , 2888–2910 (2002). [23] Guo D. & V erd ´ u S. Randomly spread CDMA: Asymptotics via statistical physics. IEEE T r ans. Infor. The ory 51 , 1983–2010 (2005). [24] Montanari A. & Tse D. Analysis of b elief propagation for non-linear problems: The example of CDMA (or: How to prov e T anak a’s formula). In Information Theory Workshop, 2006. ITW’06 Punta del Este. IEEE , 160–164 (IEEE, 2006). [25] Bay ati M. & Mon tanari A. The Dynamics of Message Passing on Dense Graphs, with Applications to Compressed Sensing. IEEE T r ansactions on Information The ory 57 , 764 –785 (2011). [26] Dav enport M. A., Plan Y., v an den Berg E. & W o otters M. 1-bit matrix completion. Information and Infer enc e 3 , 189–223 (2014). [27] Dav enport M. & Romberg J. An ov erview of low-rank matrix recov ery from incomplete observ ations (2015). [28] Lesieur T., Krzak ala F. & Zdeb oro v´ a L. MMSE of probabilistic low-rank matrix estimation: Universalit y with resp ect to the output channel. arXiv pr eprint arXiv:1507.03857 (2015). [29] J¨ oresk og K. G. A general approach to conﬁrmatory maximum likelihoo d factor analysis. Psychometrika 34 , 183–202 (1969). [30] Lewicki M. S. & Sejnowski T. J. Learning ov ercomplete representations. Neur al c omputation 12 , 337–365 (2000). [31] Engan K., Aase S. O. & Husoy J. H. Metho d of optimal directions for frame design. In Pr o c e edings of the 1999 IEEE International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing , 2443–2446 (IEEE, 1999). [32] Aharon M., Elad M. & Bruckstein A. M. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Represen tation. IEEE T r ansactions on Signal Pr o c essing 54 , 4311 (2006). [33] Michal Aharon, Michael Elad A. M. B. On the uniqueness of ov ercomplete dictionaries, and a practical wa y to retrieve them. Line ar A lgebr a and its Applications 416 , 48–67 (2006). [34] V ainsencher D., Mannor S. & Bruckstein A. M. The Sample Complexity of Dictionary Learning. Journal of Machine L e arning R ese arch 12 , 3259–3281 (2011). [35] Jenatton R., Gribonv al R. & Bach F. Lo cal stability and robustness of sparse dictionary learning in the presence of noise. arXiv:1210.0685 (2012). [36] Spielman D. A., W ang H. & W right J. Exact recov ery of sparsely-used dictionaries. In Pr o ce e dings of the Twenty-Third international joint c onfer enc e on A rtiﬁcial Intel ligenc e , 3087–3090 (AAAI Press, 2013). [37] Arora S., Ge R. & Moitra A. New algorithms for learning incoherent and ov ercomplete dictionaries. arXiv pr eprint arXiv:1308.6273 (2013). [38] Agarwal A., Anandkumar A. & Netrapalli P . Exact Reco very of Sparsely Used Overcomplete Dictionaries. arXiv pr eprint arXiv:1309.1952 (2013). 49 [39] Grib on v al R., Jenatton R., Bach F., Kleinsteub er M. & Seib ert M. Sample complexity of dictionary learning and other matrix factorizations. arXiv pr eprint arXiv:1312.3790 (2013). [40] Lee T.-W., Lewicki M. S., Girolami M. & Sejnowski T. J. Blind source separation of more sources than mixtures using o vercomplete representations. Signal Pr o c essing L etters, IEEE 6 , 87–90 (1999). [41] Zibulevsky M. & Pearlm utter B. A. Blind source separation by sparse decomposition in a signal dictionary . Neur al c omputation 13 , 863–882 (2001). [42] Boﬁll P . & Zibulevsky M. Underdetermined blind source separation using sparse representations. Signal pr o c essing 81 , 2353–2362 (2001). [43] Georgiev P ., Theis F. & Cichocki A. Sparse comp onen t analysis and blind source separation of underdetermined mixtures. Neur al Networks, IEEE T r ansactions on 16 , 992–996 (2005). [44] d’Aspremont A., El Ghaoui L., Jordan M. I. & Lanckriet G. R. A direct form ulation for sparse PCA using semideﬁnite programming. SIAM r eview 49 , 434–448 (2007). [45] Grib on v al R., Lesage S. et al. A survey of sparse comp onen t analysis for blind source separation: principles, persp ectives, and new challenges. In ESANN’06 pr o c e e dings-14th Eur op e an Symp osium on A rtiﬁcial Neur al Networks , 323–330 (2006). [46] Candes E. J. & Plan Y. Matrix completion with noise. Pr o c e e dings of the IEEE 98 , 925–936 (2010). [47] Keshav an R. H., Montanari A. & Oh S. Matrix completion from a few en tries. Information The ory, IEEE T r ansactions on 56 , 2980–2998 (2010). [48] Chandrasek aran V., Sanghavi S., Parrilo P . A. & Willsky A. S. Sparse and low-rank matrix decompositions. In Commu- nic ation, Contr ol, and Computing, 2009. Al lerton 2009. 47th Annual Al lerton Confer enc e on , 962–967 (IEEE, 2009). [49] Y uan X. & Y ang J. Sparse and low-rank matrix decomp osition via alternating direction metho ds. pr eprint (2009). [50] W right J., Ganesh A., Rao S., Peng Y. & Ma Y. Robust principal comp onen t analysis: Exact recov ery of corrupted lo w-rank matrices via conv ex optimization. In A dvanc es in neur al information pr o c essing systems , 2080–2088 (2009). [51] Rangan S. & Fletcher A. K. Iterativ e estimation of constrained rank-one matrices in noise. In Information Theory Pr o c e e dings (ISIT), 2012 IEEE International Symp osium on , 1246–1250 (IEEE, 2012). [52] Deshpande Y. & Montanari A. Information-theoretically Optimal Sparse PCA. arXiv pr eprint arXiv:1402.2238 (2014). [53] Opp er M. & Haussler D. Calculation of the Learning Curve of Bay es Optimal Classiﬁcation Algorithm for Learning a Perceptron With Noise. In In Computational Le arning The ory: Pr o c e edings of the F ourth A nnual Workshop , 75–87 (Morgan Kaufmann, 1991). [54] Iba Y. The Nishimori line and Bay esian statistics. Journal of Physics A: Mathematic al and Gener al 32 , 3875 (1999). [55] Nishimori H. Statistic al Physics of Spin Glasses and Information Pr o c essing (Oxford Universit y Press, Oxford, 2001). [56] Donoho D. L., Ja v anmard A. & Mon tanari A. Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing. In Pr o c. of the IEEE Int. Symp osium on Information The ory (ISIT) (2012). [57] M´ ezard M., Parisi G. & Virasoro M. A. SK Mo del: The Replica Solution without Replicas. Eur ophys. L ett. 1 , 77–82 (1986). [58] Richardson T. & Urbank e R. Modern Co ding The ory (Cambridge Universit y Press, 2008). [59] Aldous D. J. The ζ (2) limit in the random assignment problem. R andom Structur es & Algorithms 18 , 381–418 (2001). [60] Montanari A. Estimating random v ariables from random sparse observ ations. Eur op e an T r ansactions on T ele c ommunic a- tions 19 , 385–403 (2008). [61] Kabashima Y., W aday ama T. & T anak a T. A typical reconstruction limit of compressed sensing based on Lp-norm minimization. J. Stat. Mech. L09003 (2009). [62] Dempster A., Laird N. & Rubin D. Maxim um Likelihoo d from Incomplete Data via the EM Algorithm. Journal of the R oyal Statistic al So ciety 38 , 1 (1977). [63] Krzak ala F., M´ ezard M., Sausset F., Sun Y. & Zdeb oro v´ a L. Probabilistic Reconstruction in Compressed Sensing: Algo- rithms, Phase Diagrams, and Threshold Achieving Matrices. J. Stat. Me ch. (2012). [64] Vila J. P . & Schniter P . Exp ectation-Maximization Bernoulli-Gaussian Appro ximate Message Passing. In Pr o c. Asilomar Conf. on Signals, Systems, and Computers (Paciﬁc Gr ove, CA) (2011). [65] Donoho D., Maleki A. & Mon tanari A. Message passing algorithms for compressed sensing: I. motiv ation and construction. In IEEE Information The ory Workshop (ITW) , 1 –5 (2010). [66] Rangan S. Generalized approximate message passing for estimation with random linear mixing. In IEEE International Symp osium on Information The ory Pro c e e dings (ISIT) , 2168 –2172 (2011). [67] Kschisc hang F. R., F rey B. & Lo eliger H.-A. F actor graphs and the sum-pro duct algorithm. IEEE T r ans. Inform. The ory 47 , 498–519 (2001). [68] Thouless D. J., Anderson P . W. & Palmer R. G. Solution of ‘Solv able Mo del of a Spin-Glass’. Phil. Mag. 35 , 593–601 (1977). [69] Rangan S. Estimation with random linear mixing, b elief propagation and compressed sensing. In Information Scienc es and Systems (CISS), 2010 44th Annual Conferenc e on , 1 –6 (2010). [70] Y edidia J., F reeman W. & W eiss Y. Understanding Belief Propagation and Its Generalizations. In Exploring Artiﬁcial Intel ligenc e in the New Mil lennium , 239–236 (Morgan Kaufmann, San F rancisco, CA, USA, 2003). [71] Krzak ala F., Mano el A., T ramel E. W. & Zdeb oro v´ a L. V ariational F ree Energies for Compressed Sensing. In IEEE International Symp osium on Information The ory Pr o c e e dings (ISIT) , 1499 – 1503 (2014). [72] Vila J., Schniter P ., Rangan S., Krzak ala F. & Zdeb orov´ a L. Adaptive Damping and Mean Remov al for the Generalized Appro ximate Message Passing Algorithm (2014). [73] Rangan S., Schniter P ., Riegler E., Fletcher A. & Cevher V. Fixed p oin ts of generalized appro ximate message passing with arbitrary matrices. In Information The ory Pr o c e e dings (ISIT), 2013 IEEE International Symp osium on , 664–668 (IEEE, 50 2013). [74] Caltagirone F., Krzak ala F. & Zdeb oro v´ a L. On Conv ergence of Approximate Message Passing. In IEEE International Symp osium on Information The ory Pro c e e dings (ISIT) , 1812–1816 (2014).

Phase transitions and sample complexity in Bayes-optimal matrix factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment