Robust Matrix Estimation with Side Information

Robust Matrix Estimation with Side Information ∗ Anish Agarw al, Jung jun Choi † , Ming Y uan Departmen t of IEOR, Colum bia Univ ersit y Departmen t of CS & Statistics, Univ ersit y of Rho de Island Departmen t of Statistics, Colum bia Univ ersit y Marc h 27, 2026 Abstract W e in tro duce a ﬂexible framew ork for high-dimensional matrix estimation to in- corp orate side information for b oth rows and columns. Existing approaches, such as inductive matrix completion, often imp ose restrictive structure—for example, an exact lo w-rank cov ariate interaction term, linear cov ariate eﬀects, and limited ability to exploit comp onen ts explained only by one side (ro w or column) or b y neither— and frequently omit an explicit noise comp onent. T o address these limitations, we prop ose to decomp ose the underlying matrix as the sum of four complementary com- p onen ts: (p ossibly nonlinear) interaction b et w een row and column c haracteristics; ro w characteristic-driv en component, column characteristic-driv en comp onen t, and residual low-rank structure unexplained b y observed c haracteristics. By combining siev e-based pro jection with nuclear-norm penalization, eac h comp onen t can b e esti- mated separately and these estimated components can then b e aggregated to yield a ﬁnal estimate. W e deriv e con v ergence rates that highlight robustness across a range of mo del conﬁgurations dep ending on the informativ eness of the side information. W e further extend the metho d to partially observ ed matrices under b oth missing- at-random and missing-not-at-random mechanisms, including blo c k-missing patterns motiv ated b y causal panel data. Sim ulations and a real-data application to tobacco sales show that leveraging side information improv es imputation accuracy and can enhance treatment-eﬀect estimation relativ e to standard lo w-rank and spectral-based alternativ es. Keywor ds: Matrix c ompletion, Nucle ar norm p enalization, Causal infer enc e, Non-line ar estimation, Covariate information ∗ This researc h was supported b y NSF grant DMS-2052955. † Address for Corresp ondence: Computer Science and Statistics Department, Tyler Hall, 9 Greenhouse Road, Kingston, RI, 02881, Email: jungjun.choi@uri.edu 1 1 In tro duction Recen t technological progress has made it p ossible to gather and pro cess high-volume data that are con v eniently organized in matrix form, often with b oth dimensions scaling up rapidly . Accordingly , high-dimensional matrix estimation problems suc h as matrix denois- ing and matrix completion hav e attracted considerable attention, and man y impressive results hav e b een obtained from b oth statistical and computational p ersp ectiv es. Ho wev er, although side information is often a v ailable in addition to the target outcome data, tradi- tional approac hes t ypically use only the outcome data for matrix estimation. Incorp orating side information can enric h the underlying mo del and improv e estimation and prediction accuracy . As our abilit y to access auxiliary co v ariate data contin ues to grow, develop- ing matrix estimation metho ds that eﬀectively leverage side information has b ecome an imp ortan t and timely researc h direction. A n um b er of computational algorithms, along with their statistical prop erties, hav e b een prop osed recently . Arguably , the most p opular mo del that incorp orates additional information in matrix estimation is the Inductive Matrix Completion (IMC) mo del (e.g., Xu et al. , 2013 ; Jain and Dhillon , 2013 ; Zhang et al. , 2018 ). The standard IMC model tak es the form: Y = M = X LZ ⊤ where Y = [ y it ] i ≤ N ,,t ≤ T is the outcome matrix, X = [ x 1 , . . . , x N ] ⊤ is the N × d 1 ro w-feature matrix, Z = [ z 1 , . . . , z T ] ⊤ is the T × d 2 column-feature matrix, and M is the target matrix. Here, L is assumed to b e a lo w-rank d 1 × d 2 matrix. A typical estimation approach is to solv e min L ∈ R d 1 × d 2   P Ω ( X LZ ⊤ − Y )   2 F + λ ∥ L ∥ ∗ , where || · || ∗ denotes the nuclear norm, P Ω ( A ) = Ω ◦ A , and Ω is the N × T indicator matrix 2 for observ abilit y in matrix completion. Although IMC is a p opular and useful approac h to matrix estimation with side infor- mation, it has sev eral imp ortant limitations. First, it requires that the features b e present on b oth sides and also in teract linearly . Moreov er, it predicates upon the informativ eness of b oth row and column features and can break do wn if features are weak or irrelev ant. Sev eral extensions of IMC ha v e b een prop osed in recen t y ears to address these shortcom- ings. Leden t et al. ( 2023 ) incorp orate a noise comp onen t E in to the IMC mo del and derive b ounds on the exp ected ℓ -risk. Zhong et al. ( 2019 ) allow a nonlinear relationship betw een ( X , Z ) and M . W ang and Elhamifar ( 2018 ) consider settings in whic h the rank of L can b e large. Notably , Chiang et al. ( 2015 ) prop ose the so-called “dirty” IMC mo del, which augmen ts the standard IMC form ulation with an additional lo w-rank term R and estimates ( L, R ) by solving min L,R   P Ω ( X LZ ⊤ + R − Y )   2 F + λ 1 ∥ L ∥ ∗ + λ 2 ∥ R ∥ ∗ . Ho w ever, this mo del do es not include a noise term E and comp onen ts explained only b y one-sided c haracteristics. In addition, it do es not allo w a nonlinear relationship betw een ( X , Z ) and M , and it still requires L to b e low-rank. Ov erall, eac h extension addresses only part of the limitations and still leav es other issues unresolv ed. Another notable line of researc h on matrix completion with cov ariates includes Mao et al. ( 2019 ) and Ma et al. ( 2025 ). These pap ers consider the mo del Y = X B ⊤ + R + E where B is an unkno wn co eﬃcien t matrix and R is a lo w-rank matrix. Because this approac h do es not incorp orate column characteristics Z , it cannot capture interaction terms in v olving b oth X and Z (such as X LZ ⊤ ) or comp onen ts explained solely by Z . In addition, it do es not allo w for nonlinear eﬀects. As a result, the mo del still has some 3 limitations. Lastly , a related strand of work studies PCA or factor analysis in settings without missing data (see, e.g., F an et al. , 2016 ; Chiang et al. , 2016 ; Niranjan et al. , 2017 ; Zh u et al. , 2016 ; Xue et al. , 2017 ). F or example, F an et al. ( 2016 ) consider the mo del, Y = ( G ( X ) + Γ) F ⊤ + E = G ( X ) F ⊤ + Γ F ⊤ + E , where G ( X ) is a part of loading deﬁned b y an unkno wn function of X , and F denotes unobserv ed factors. In contrast, Zh u et al. ( 2016 ) study the mo del, Y = X B ⊤ + AZ ⊤ + R + E , where R is low-rank. The former framew ork cannot capture interaction terms in volving b oth X and Z (suc h as X LZ ⊤ ) or comp onents explained solely by Z , whereas the latter does not include an in teraction term b etw een X and Z . Moreo v er, this mo del primarily emphasizes linear relationships. T o ov ercome the limitations of existing approaches, we consider the following mo del: Y = M + E , M = M 1 + M 2 + M 3 + M 4 , (1) M 1 = G 1 ( X ) Q 1 ( Z ) ⊤ , M 2 = G 2 ( X ) V ⊤ 1 , M 3 = W 1 Q 2 ( Z ) ⊤ , M 4 = W 2 V ⊤ 2 , where G 1 , G 2 , Q 1 , and Q 2 are unknown matrix-v alued functions, and W 1 , W 2 , V 1 , and V 2 are unobserv ed matrices. This mo del is more general and nests the ab o ve mo dels. F or example, the mo dels in Xu et al. ( 2013 ); Jain and Dhillon ( 2013 ); W ang and Elhamifar ( 2018 ) corresp ond to the sp ecial case M = M 1 , and the mo del in Chiang et al. ( 2015 ) corresp onds to M = M 1 + M 4 . In addition, the mo dels in F an et al. ( 2016 ); Mao et al. ( 2019 ); Ma et al. ( 2025 ) can b e view ed as sp ecial cases of M = M 2 + M 4 . F or instance, the mo del in F an et al. ( 2016 ) can b e represented as M = M 2 + M 4 with V 1 = V 2 . Hence, our estimation approach under this mo del is less lik ely to suﬀer from mo del missp eciﬁcation. Moreo v er, if the data con tain comp onen ts that existing mo dels do not account for, our estimator is exp ected to p erform b etter than estimators based on those restricted mo dels. As discussed in Section 3 , the conv ergence rates of our estimator demonstrate the robustness of our metho d across mo dels, and the simulation results in Section 5 are consisten t with 4 these theoretical ﬁndings. Our estimation is based on a siev e pro jection metho d. Using pro jection matrices con- structed from siev e bases for X and Z , w e estimate eac h component of M separately and then obtain an estimator of M by summing these estimates. Thanks to the siev e pro jection, our metho d can accommo date p oten tially nonlinear eﬀects of X and Z on M . In addition, estimating each comp onen t separately allows us to fully exploit the mo del structure in ( 1 ). T ogether, these features mak e our estimator more lik ely to outp erform metho ds based on more restrictiv e mo dels when the data con tain comp onen ts that those restrictiv e mo dels do not accoun t for. Another imp ortan t feature of our approach is the use of nuclear-norm p enalization, whic h corresp onds to a soft-thresholding pro cedure. Hence, if some of M 2 , M 3 , and M 4 are exactly zero, then our estimators for those comp onen ts are also exactly zero with high probabilit y . This prop ert y enhances the robustness of our estimator. In contrast, if we use a sp ectral metho d to estimate each comp onent, we must estimate the rank of each part, and existing rank estimators ma y produce incorrect (nonzero) esti- mates when the corresp onding comp onent is w eak due to a lo w signal-to-noise ratio. As a result, sp ectral methods ma y perform p o orly when some of M 2 , M 3 , and M 4 are zero or close to zero. By comparison, our n uclear-norm–p enalized estimator do es not require estimating the rank of eac h comp onent or the signal strength of eac h comp onen t; therefore, small v alues of M 2 , M 3 , and M 4 do not p ose a problem. Another imp ortan t con tribution of this pap er is that we also consider a setting with missing en tries, where the missingness is not at random. While many papers use side in- formation for imputation under MAR (missing at random), to the b est of our knowledge, no existing matrix completion w ork incorporates side information under MNAR (missing not at random). Since the seminal w ork of A they et al. ( 2021 ), whic h demonstrated that matrix completion techniques can be very useful for causal panel data mo dels, matrix com- pletion has b ecome a p opular to ol for estimating unobserv ed p oten tial outcomes under 5 the un treated (con trol) condition. How ever, the p otential-outcome matrix under the un- treated condition usually exhibits a missingness pattern that do es not follow random miss- ingness. Consequently , matrix completion under MNAR—and its applications to causal inference—has b een actively studied recently (see, e.g., Athey et al. ( 2021 ); Bai and Ng ( 2021 ); Agarw al et al. ( 2023 ); Choi and Y uan ( 2024 ); Y an and W ainwrigh t ( 2024 )). W e pro- p ose a nov el matrix completion metho d that leverages side information under MNAR. As sho wn in our real-data experiment in Section 5.2 , our method outperforms standard matrix- completion approaches in imputing unobserv ed p otential outcomes and demonstrates its usefulness for treatmen t-eﬀect estimation. The remainder of this pap er is organized as follo ws. Section 2 introduces our mo del and our estimation metho d, which uses sieve pro jection with nuclear norm p enalization. Section 3 presents asymptotic error b ounds for the estimator and discusses the robustness of our metho d across diﬀerent mo dels. Section 4 extends our estimation strategy to the case in which the outcome matrix is partially observed. Imp ortan tly , we consider the MNAR setting as w ell as the MAR setting. Section 5 presents n umerical studies using simulated and real data to demonstrate the adv antages of our metho d. All pro ofs are relegated to the supplemen t due to space limitations. 2 Mo del and Estimation In this pap er, w e consider the following panel mo del: Y = M + E , M = M 1 + M 2 + M 3 + M 4 , where Y = ( y it ) i ≤ N , t ≤ T is the outcome matrix, E = ( ϵ it ) i ≤ N , t ≤ T is the noise matrix, and M = ( m it ) i ≤ N , t ≤ T is the matrix of interest. W e decomp ose M into four parts: (i) M 1 , a comp onen t w ell explained b y b oth X and Z ; (ii) M 2 , a comp onen t explained b y X but irrelev ant to Z ; (iii) M 3 , a comp onent explained b y Z but irrelev ant to X ; and (iv) M 4 , a 6 comp onen t irrelev an t to b oth X and Z , where X = ( x i ) i ≤ N and Z = ( z t ) t ≤ T are observ able c haracteristics corresp onding to the row and column indices, resp ectiv ely . More sp eciﬁcally , each part can b e represented as M 1 = G 1 ( X ) Q 1 ( Z ) ⊤ , M 2 = G 2 ( X ) V ⊤ 1 , M 3 = W 1 Q 2 ( Z ) ⊤ , M 4 = W 2 V ⊤ 2 , (2) where G 1 ( X ) = ( g 1 ,k ( x i )) i ≤ N , k ≤ K 1 , G 2 ( X ) = ( g 2 ,k ( x i )) i ≤ N , k ≤ K 2 , Q 1 ( Z ) = ( q 1 ,k ( z t )) t ≤ T , k ≤ K 1 , and Q 2 ( Z ) = ( q 2 ,k ( z t )) t ≤ T , k ≤ K 3 for some unkno wn functions g 1 ,k ( · ), g 2 ,k ( · ), q 1 ,k ( · ), and q 2 ,k ( · ). Here, W 1 = ( w 1 ,ik ) i ≤ N , k ≤ K 3 and W 2 = ( w 2 ,ik ) i ≤ N , k ≤ K 4 capture the comp onen ts not explained b y X , while V 1 = ( v 1 ,tk ) t ≤ T , k ≤ K 2 and V 2 = ( v 2 ,tk ) t ≤ T , k ≤ K 4 capture the comp onen ts not explained b y Z . This mo del is general and encompasses many existing mo dels. Estimation. T o prop erly accommo date and exploit the structure of our mo del in ( 2 ), w e prop ose estimating M using a sieve pro jection metho d. F or t wo sets of basis func- tions { ϕ 1 ( x ) , . . . , ϕ J ( x ) } and { ψ 1 ( z ) , . . . , ψ J ( z ) } (e.g., B-splines, F ourier series, wa velets, or p olynomial series), deﬁne ϕ ( x i ) =  ϕ 1 ( x i 1 ) , . . . , ϕ J ( x i 1 ) , . . . , ϕ 1 ( x id 1 ) , . . . , ϕ J ( x id 1 )  ⊤ ∈ R J d 1 , ψ ( z t ) =  ψ 1 ( z t 1 ) , . . . , ψ J ( z t 1 ) , . . . , ψ 1 ( z td 2 ) , . . . , ψ J ( z td 2 )  ⊤ ∈ R J d 2 , where d 1 and d 2 are the dimensions of x i and z t , resp ectiv ely . The corresp onding pro jection matrices are P X = Φ( X )  Φ( X ) ⊤ Φ( X )  − 1 Φ( X ) ⊤ , P Z = Ψ( Z )  Ψ( Z ) ⊤ Ψ( Z )  − 1 Ψ( Z ) ⊤ , where Φ( X ) =  ϕ ( x 1 ) , . . . , ϕ ( x N )  ⊤ and Ψ( Z ) =  ψ ( z 1 ) , . . . , ψ ( z T )  ⊤ . Note that as long as g 1 ,k ( · ), g 2 ,k ( · ), q 1 ,k ( · ), and q 2 ,k ( · ) are suﬃcien tly smo oth, for any ι ∈ (1 , 2) w e hav e P X G ι ( X ) ≈ G ι ( X ) , P Z Q ι ( Z ) ≈ Q ι ( Z ) . (3) 7 Moreo v er, ∥ P X E P Z ∥ F can b e m uc h smaller than ∥ E ∥ F due to the orthogonality b etw een ( X , Z ) and E . Lev eraging this prop erty , we prop ose estimating M as follo ws. Algorithm 1 Estimation procedure Step 1: Compute c M 1 = P X Y P Z . Step 2: Compute the following n uclear-norm-p enalized estimators: c M 2 : = arg min A ∈ R N × T   P X Y ( I T − P Z ) − A   2 F + ν 2 ∥ A ∥ ∗ , c M 3 : = arg min A ∈ R N × T   ( I N − P X ) Y P Z − A   2 F + ν 3 ∥ A ∥ ∗ , c M 4 : = arg min A ∈ R N × T   ( I N − P X ) Y ( I T − P Z ) − A   2 F + ν 4 ∥ A ∥ ∗ , where ν 2 = C 2 √ T , ν 3 = C 3 √ N , and ν 4 = C 4 √ N + T for some suﬃcien tly large constants C 2 , C 3 , C 4 > 0. Step 3: F orm the ﬁnal estimator c M = c M 1 + c M 2 + c M 3 + c M 4 . T o understand how this estimator works, note that by ( 3 ) and basic prop erties of n uclear-norm p enalization, w e hav e c M 1 ≈ G 1 Q ⊤ 1 + G 2 V ⊤ 1 P Z + P X W 1 Q ⊤ 2 + P X W 2 V ⊤ 2 P Z , c M 2 ≈ G 2 V ⊤ 1 ( I T − P Z ) + P X W 2 V ⊤ 2 ( I T − P Z ) , c M 3 ≈ ( I N − P X ) W 1 Q ⊤ 2 + ( I N − P X ) W 2 V ⊤ 2 P Z , c M 4 ≈ ( I N − P X ) W 2 V ⊤ 2 ( I T − P Z ) , under suitable conditions on the noise and on the smo othness of g 1 ,k ( · ), g 2 ,k ( · ), q 1 ,k ( · ), and q 2 ,k ( · ). Imp ortan tly , the terms in volving P X W ι or P Z V ι cancel out when w e sum the four estimators. Hence, without imposing any orthogonalit y conditions betw een X and W (or b et ween Z and V ), our ﬁnal estimator c M can estimate M w ell. In addition, because n uclear-norm p enalization acts as a thresholding estimator, when some (or all) of M 2 , M 3 , and M 4 are zero or suﬃcien tly small, it helps us obtain a tigh ter b ound. 8 3 Asymptotic Results In this section, w e presen t the con vergence rate of our estimator. W e b egin b y imp osing the follo wing conditions. Assumption 3.1 (Noise) . The random v ariables ( ϵ it ) i ≤ N , t ≤ T are indep enden t, mean-zero, sub-Gaussian, and satisfy E [ ϵ 2 it ] ≤ σ 2 ≤ C 1 and E [exp( sϵ it )] ≤ exp( C 2 s 2 σ 2 ) for all s ∈ R , for some constan ts C 1 , C 2 > 0. In addition, ( ϵ it ) i ≤ N , t ≤ T are indep enden t of X and Z . The indep endence and sub-Gaussianit y assumptions are used to derive tight b ounds for ∥ P X E P Z ∥ , ∥ P X E ∥ , and ∥ E P Z ∥ . W e can generalize this condition to weakly dep enden t noise with suitable momen t conditions, at the cost of additional J -dep enden t terms in the b ound in Theorem 3.1 . Assumption 3.2 (Basis functions) . (i) There exist constants c, C > 0 such that, with probabilit y approaching one, c < λ min  N − 1 Φ( X ) ⊤ Φ( X )  ≤ λ max  N − 1 Φ( X ) ⊤ Φ( X )  < C , c < λ min  T − 1 Ψ( Z ) ⊤ Ψ( Z )  ≤ λ max  T − 1 Ψ( Z ) ⊤ Ψ( Z )  < C , where λ max ( A ) and λ min ( A ) denote the largest and smallest singular v alues of A , resp ec- tiv ely . (ii) max j ≤ J, i ≤ N , l ≤ d 1 E [ ϕ j ( x il ) 4 ] < ∞ and max j ≤ J, t ≤ T , l ≤ d 2 E [ ψ j ( z tl ) 4 ] < ∞ . This condition is standard in the siev e-estimation literature (e.g., F an et al. ( 2016 ); Chen et al. ( 2023 )). Because we fo cus on the case where J d 1 ≪ N and J d 2 ≪ T , it follo ws from the la w of large num b ers and therefore is not o verly restrictive. Assumption 3.3 (Sieve approximation) . (i) There exist constants γ G 1 , γ G 2 , γ Q 1 , γ Q 2 ≥ 2 suc h that, for some siev e coeﬃcient vectors b 1 ,k , b 2 ,k ∈ R J d 1 and a 1 ,k , a 2 ,k ∈ R J d 2 , the siev e appro ximations satisfy sup x ∈X   g ι,k ( x ) − b ⊤ ι,k ϕ ( x )   = O  J − γ G ι  , sup z ∈Z   q ι,k ( z ) − a ⊤ ι,k ψ ( z )   = O  J − γ Q ι  , 9 where X and Z are the supports of x i and z t , resp ectiv ely . (ii) The siev e dimension J satisﬁes √ T K 1 /J γ G 1 → 0 , √ N K 1 /J γ Q 1 → 0 , max { √ N , √ T } K 2 /J γ G 2 → 0 , max { √ N , √ T } K 3 /J γ Q 2 → 0 . Condition (i) is a standard assumption in sieve estimation. F or example, if g ι,k ( · ) has an additiv e form g ι,k ( x i ) = P d 1 l =1 g ι,kl ( x il ) and eac h g ι,kl ( · ) b elongs to the H¨ older class H ( ρ G ι , τ G ι ), where H ( ρ, τ ) =  h :   h ( ρ ) ( s ) − h ( ρ ) ( t )   ≤ C | s − t | τ  for some C > 0, then γ G ι = ρ G ι + τ G ι for t ypical c hoices of basis functions (see, e.g., Chen ( 2007 )). On the other hand, condition (ii) requires suﬃcient smo othness of the functions g ι,k ( · ) and q ι,k ( · ). Note that γ G ι and γ Q ι can b e viewed as smo othness parameters for g ι,k ( · ) and q ι,k ( · ), resp ectively . Therefore, if g ι,k ( · ) and q ι,k ( · ) are suﬃciently smo oth, then γ G ι and γ Q ι will b e large, and condition (ii) can b e satisﬁed even when J increases slowly . Lastly , we imp ose the following moment conditions. Assumption 3.4 (Momen ts) . (i) F or all i and t , E [ m 4 it ] is b ounded. (ii) There exists a constant C 1 > 0 suc h that for all i, t, k , E  g 2 1 ,k ( x i )  , E  g 2 2 ,k ( x i )  , E  q 2 1 ,k ( z t )  , E  q 2 2 ,k ( z t )  ≤ C 1 . (iii) There exists a constant C 2 > 0 suc h that for all i, t, k , E [ w 2 1 ,ik ] , E [ w 2 2 ,ik ] , E [ v 2 1 ,tk ] , E [ v 2 2 ,tk ] ≤ C 2 . W e are now in a p osition to state the statistical prop erties of our estimators. The follo wing theorem provides the conv ergence rate of the prop osed estimator. 10 Theorem 3.1 (Con v ergence rate) . Supp ose that Assumptions 3.1 – 3.4 hold. Then, ∥ c M − M ∥ F = O p J + p K 2 + K 4 min n √ T , ∥ M 2 ∥ F + ∥ P X M 4 ∥ F o + p K 3 + K 4 min n √ N , ∥ M 3 ∥ F + ∥ M 4 P Z ∥ F o + p K 4 min n √ N + T , ∥ M 4 ∥ F o + √ N T  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  ! . Some immediate remarks are in order. First, note that the dominating part of the error b ound for our estimator do es not dep end on K 1 . Th us, we can allow K 1 to b e large as long as g 1 ,k ( · ) and q 1 ,k ( · ) are suﬃcien tly smo oth. The last term, √ N T  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  , arises from the sieve approximation (smo othing) error. When the functions g ι,k ( · ) and q ι,k ( · ) are suﬃciently smo oth (i.e., γ G ι and γ Q ι are large), this term is small and dominated b y the other terms. Theorem 3.1 illustrates the robustness of our estimator. First, consider the most fa v or- able case, M = M 1 , where M is w ell explained by X and Z . In this case, the con vergence rate of our estimator is O p ( J ) pro vided that g ι,k ( · ) and q ι,k ( · ) are suﬃciently smo oth. This matc hes the rate of the “double pro jection” estimator P X Y P Z , which is the most suit- able estimator when w e know a priori that M = M 1 . By contrast, if we estimate M using a standard lo w-rank metho d such as n uclear-norm p enalization in this setting, the con v ergence rate w ould b e O p ( p K 1 ( N + T )), whic h is m uch larger than ours. Moreov er, ev en when M con tains an additional component M 2 + M 3 + M 4 b ey ond M 1 , as long as this comp onen t is small (in the sense that ∥ M 2 + M 3 + M 4 ∥ F ≪ √ N + T ), our rate O p  J + √ K ∗ ∥ M 2 + M 3 + M 4 ∥ F  is smaller than that of the usual low-rank estimator O p  p K ( N + T )  , where K ∗ = max { K 2 , K 3 , K 4 } and K is the rank of M . Next, consider the least fav orable case, M = M 4 , where M is unrelated to X and 11 Z and the side information is uninformativ e. In this case, the conv ergence rate of our estimator is O p ( p K 4 ( N + T )), provided that g ι,k ( · ) and q ι,k ( · ) are suﬃcien tly smo oth and J ≪ p K 4 ( N + T ). Note that this rate coincides with that of standard low-rank estimators. Hence, even in the least fa v orable case, the error b ound for our estimator is comparable to that of a typical lo w-rank metho d. In con trast, the “double pro jection” estimator P X Y P Z is inconsisten t in this case. Moreo ver, if M contains an additional small component M 1 + M 2 + M 3 b ey ond M 4 , the conv ergence rate of our estimator b ecomes O p ( p K ∗ ( N + T )), whereas that of a t ypical lo w-rank estimator remains O p ( p K ( N + T )). Thus, when K 1 is large, our metho d can yield a tighter b ound. Lastly , consider the case M = M 2 . In this case, the con vergence rate of our estimator is O p ( √ K 2 T ), provided that g ι,k ( · ) and q ι,k ( · ) are suﬃciently smo oth and J ≪ √ K 2 T . By comparison, the conv ergence rate of a t ypical low-rank estimator is O p ( p K 2 ( N + T )). Hence, when N ≫ T , our metho d yields a tigh ter b ound. In addition, if M con tains an additional small comp onen t M 1 + M 3 + M 4 b ey ond M 2 , the con vergence rate of our estimator b ecomes O p ( p ( K 2 + K 4 ) T ), whereas that of a typical low-rank estimator is O p ( p K ( N + T )). Thus, when K 1 is large or N ≫ T , our metho d can yield a b etter b ound. A similar discussion applies to the case M = M 3 . T able 1 summarizes the conv ergence rates of the estimators in the cases discussed abov e. W e can see that, in every case, the conv ergence rate of our estimator is at least as go o d as that of the other estimators, provided that g ι,k ( · ) and q ι,k ( · ) are suﬃcien tly smo oth. M = M 1 M = M 2 M = M 3 M = M 4 Our estimator J √ K 2 T √ K 3 N p K 4 ( N + T ) Double pro jection J √ N T √ N T √ N T Lo w-rank estimation p K 1 ( N + T ) p K 2 ( N + T ) p K 3 ( N + T ) p K 4 ( N + T ) T able 1: Con vergence rates of matrix estimators Lastly , as a corollary , w e presen t con vergence rates for the estimated singular vectors, since the factors and loadings are often of in terest to researc hers (see, e.g., Bai et al. ( 2008 )). 12 Let b U ∈ R N × K and b V ∈ R T × K b e the left and righ t singular vectors of c M , resp ectively . Similarly , let U and V denote the left and righ t singular vectors of M , resp ectively . F or notational con venience, denote the upp er b ound on ∥ c M − M ∥ F in Theorem 3.1 by R = J + p K 2 + K 4 min n √ T , ∥ M 2 ∥ F + ∥ P X M 4 ∥ F o + p K 3 + K 4 min n √ N , ∥ M 3 ∥ F + ∥ M 4 P Z ∥ F o + p K 4 min n √ N + T , ∥ M 4 ∥ F o + √ N T  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  . Then, w e obtain the following conv ergence rates. Corollary 3.2. Supp ose that Assumptions 3.1 – 3.4 hold. In addition, assume that R /λ min → p 0 , wher e λ min denotes the smal lest nonzer o singular value of M . Then, max  min R ∈O K × K ∥ b U − RU ∥ F , min R ∈O K × K ∥ b V − R V ∥ F  = O p  R λ min  . 4 Extension to Missing Case Next, we extend our estimation strategy to the case where the outcome matrix is only partially observ ed. The base mo del is the same as in Section 2 , and w e additionally assume that researc hers observe Ω ◦ Y instead of Y , where Ω = ( ω it ) i ≤ N , t ≤ T ∈ { 0 , 1 } N × T . 4.1 Missing A t Random Case In this section, w e consider the case where outcome entries are missing at random. Sp ecif- ically , we assume that ( ω it ) i ≤ N , t ≤ T are i.i.d. Bernoulli random v ariables with mean p , as is common in the matrix completion literature. Estimation. Similarly to the fully observ ed case ab ov e, w e use the pro jection metho d to exploit the structure of the mo del in ( 2 ). Ho wev er, when entries are missing, a key diﬃcult y is that we cannot directly observe Ω ◦  P X Y ( I T − P Z )  and Ω ◦  ( I N − P X ) Y P Z  when we aim to estimate P X M ( I T − P Z ) or ( I N − P X ) M P Z via nuclear-norm p enalization. 13 On the other hand, we can still estimate P X M P Z accurately using the pro jection estimator p − 1 P X (Ω ◦ Y ) P Z . Hence, in the presence of missing entries, we pro ceed as follo ws. Algorithm 2 Estimation procedure for MAR case Step 1: Deriv e c M 1 = p − 1 P X (Ω ◦ Y ) P Z . Step 2: Apply the nuclear norm p enalization to Ω ◦ ( Y − c M 1 ): c M rest : = arg min A : ∥ A ∥ ∞ ≤ M max    Ω ◦ ( Y − c M 1 − A )    2 F + ν ∥ A ∥ ∗ , where M max > 0 is some large constant and ν = C p 1 / 2 √ N + T with a constan t C > 0. Step 3: Get the ﬁnal estimator, c M = c M 1 + c M rest . Note that, by the pro jection relation ( 3 ) and the usual prop erties of nuclear-norm p enalization in matrix completion, we hav e c M 1 ≈ G 1 Q ⊤ 1 + G 2 V ⊤ 1 P Z + P X W 1 Q ⊤ 2 + P X W 2 V ⊤ 2 P Z , c M rest ≈ G 2 V ⊤ 1 ( I T − P Z ) + ( I N − P X ) W 1 Q ⊤ 2 + W 2 V ⊤ 2 − P X W 2 V ⊤ 2 P Z , under conditions on the noise and on the smo othness of g 1 ,k ( · ), g 2 ,k ( · ), q 1 ,k ( · ), and q 2 ,k ( · ) similar to those in the previous section. As ab o ve, the terms in volving P X W ι or P Z V ι cancel out when we add the tw o estimators. Hence, our ﬁnal estimator c M can estimate M well. In particular, because w e estimate the M 1 comp onen t using the pro jection metho d rather than a low-rank estimator, our approac h can ha ve adv antages when K 1 is large or when M 1 is large relativ e to M 2 , M 3 , and M 4 . Asymptotic result. W e no w presen t the conv ergence rate of our estimator. W e begin b y introducing several additional assumptions. Assumption 4.1 (Random missing) . The random v ariables ( ω it ) i ≤ N , t ≤ T are i.i.d. Bernoulli with E [ ω it ] = p . In addition, Ω is indep enden t of E , X , Z , and M . 14 In addition to Assumption 4.1 , we require a slightly diﬀeren t condition on the sieve appro ximation error than in the fully observed case. Assumption 4.2 (Siev e approximation) . (i) Assumption 3.3 (i) holds. (ii) The siev e approximation satisﬁes min { √ N , √ T } √ p  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  → 0 . The follo wing theorem provides the conv ergence rate of our estimator. Theorem 4.1 (Conv ergence rate for the MAR case) . Supp ose that Assumptions 3.1 , 3.2 , 3.4 , 4.1 , and 4.2 hold. In addition, assume that J ≪ p √ N + T . Then, if M max ≥   M rest − M 2 P Z − P X M 3 − P X M 4 P Z   ∞ , we have ∥ c M − M ∥ F = O p J √ p + √ K ∗ min  √ N + T (1 + M max ) √ p , ∥ M 2 ∥ F + ∥ M 3 ∥ F + ∥ M 4 ∥ F  + √ N T  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  ! , wher e K ∗ = max { K 2 , K 3 , K 4 } . Similar discussions to the fully observed case apply here. The error b ound for our estimator do es not dep end on K 1 as long as g 1 ,k ( · ) and q 1 ,k ( · ) are suﬃcien tly smo oth, b ecause we estimate M 1 using the pro jection metho d rather than a low-rank estimator. Hence, w e can allow K 1 to b e large. In addition, the estimator enjoys a robustness prop ert y . F or simplicit y , assume that M max and ∥ M ∥ ∞ are bounded. First, consider the most fa vorable case, M = M 1 . In this case, the con v ergence rate of our estimator is O p ( J / √ p ) provided that g ι,k ( · ) and q ι,k ( · ) are suﬃciently smo oth. By contrast, if we estimate M using standard low-rank completion 15 metho ds (e.g., n uclear-norm p enalization), the conv ergence rate w ould b e O p ( p K 1 ( N + T ) /p ), whic h is m uch larger than ours. Moreov er, even when M con tains an additional comp onen t M 2 + M 3 + M 4 b ey ond M 1 , as long as this comp onent is small (i.e., ∥ M 2 + M 3 + M 4 ∥ F ≪ p ( N + T ) /p ), our rate O p  J + √ K ∗ ∥ M 2 + M 3 + M 4 ∥ F  is smaller than that of the usual low-rank completion metho ds O p  p K ( N + T ) /p  , where K is the rank of M and K ∗ = max { K 2 , K 3 , K 4 } . Next, consider the least fav orable case, M = M 4 , where M is unrelated to X and Z . In this case, the con vergence rate of our estimator is O p ( p K 4 ( N + T ) /p ), provided that g ι,k ( · ) and q ι,k ( · ) are suﬃciently smo oth. Note that this rate coincides with that of standard low-rank completion metho ds. Hence, ev en in the least fa vorable case, our estimator is comparable to typical low-rank completion metho ds. Moreov er, if M con tains an additional small comp onen t M 1 + M 2 + M 3 b ey ond M 4 , the con vergence rate of our estimator b ecomes O p ( p K ∗ ( N + T ) /p ), whereas that of a typical lo w-rank completion metho d is O p ( p K ( N + T ) /p ). Thus, when K 1 is large, our metho d can yield a tighter b ound. Similar discussions apply to the cases M = M 2 and M = M 3 : our estimator attains the same rate as standard lo w-rank completion metho ds, and it can yield a better b ound when M also contains an M 1 comp onen t with large K 1 . 4.2 Missing Not A t Random Case Although the missing-at-random assumption is common in the matrix completion litera- ture, it can b e inappropriate for some imp ortan t applications, suc h as imputing control p oten tial outcomes in causal panel mo dels, where treatment is assigned to a subset of units starting at a certain time (or in a staggered fashion). In suc h settings, it may b e more appropriate to treat the missingness pattern as ﬁxed (i.e., nonrandom). F ollowing the literature on matrix completion under missing not at random (MNAR), e.g., Bai and Ng ( 2021 ); Choi and Y uan ( 2024 ); Y an and W ainwrigh t ( 2024 ), w e assume 16 that the missingness pattern tak es the form shown in Figure 1 . In this setting, all (or some) entries in the “miss” submatrix are unobserv ed, while all en tries in the “tall” and “wide” submatrices are observ ed. This pattern is prev alen t in causal panel data: the “wide” submatrix corresponds to observ ations for the control group o ver all time p eriods, and the “tall” submatrix corresp onds to observ ations for all units in the pre-treatmen t p eriod, where the outcome is the potential outcome under control. Figure 1: Missing pattern in MNAR case Estimation. Note that the en tries in the “tall” and “wide” submatrices are fully ob- serv ed. Hence, w e can apply Algorithm 1 to the “tall” and “wide” submatrices to estimate M tall = ( m it ) i ≤ N , t ≤ T 0 and M wide = ( m it ) i ≤ N 0 , t ≤ T . Then, as noted in Corollary 3.2 , w e can estimate the left and righ t singular v ectors of M tall and M wide , resp ectively . Imp ortan tly , the left singular vectors of M tall and M span the same space. Similarly , the right singular v ectors of M wide and M span the same space. Hence, b y com bining the estimator of the left singular v ectors of M tall with the estimator of the right singular vectors of M wide , with an appropriate rotation adjustment, w e can estimate M . Sp eciﬁcally , we estimate M as follo ws. 17 Algorithm 3 Estimation procedure for the MNAR case Step 1: F rom the “tall” submatrix Y tall = ( y it ) i ≤ N , t ≤ T 0 , obtain c M tall using Algorithm 1 and compute its left singular vectors b U tall ∈ R N × K . Step 2: F rom the “wide” submatrix Y wide = ( y it ) i ≤ N 0 , t ≤ T , obtain c M wide using Algorithm 1 and compute its left and right singular v ectors b U wide ∈ R N 0 × K and b V wide ∈ R T × K , along with the corresp onding singular v alues b D wide ∈ R K × K . Step 3: Obtain the rotation matrix b H adj ∈ R K × K b y regressing b U wide on the submatrix of b U tall corresp onding to i ≤ N 0 . Step 4: F orm the ﬁnal estimator c M = b U tall b H adj b D wide b V ⊤ wide . Because this estimator is built on Algorithm 1 for the fully observ ed case, we exp ect it to share similar adv antages to those discussed in Section 3 . Asymptotic result. T o make this p oin t more precise, we present the conv ergence rate of our estimator. W e b egin with an additional assumption. Assumption 4.3 (Blo c k incoherence) . Denote the i -th column of U ⊤ and the t -th column of V ⊤ b y u i and v t , resp ectively . Then there exist constants c 1 , c 2 > 0 such that, with probabilit y approaching one, c 1 ≤ λ min N N 0 X i ≤ N 0 u i u ⊤ i ! ≤ λ max N N 0 X i ≤ N 0 u i u ⊤ i ! ≤ c 2 , c 1 ≤ λ min T T 0 X t ≤ T 0 v t v ⊤ t ! ≤ λ max T T 0 X t ≤ T 0 v t v ⊤ t ! ≤ c 2 . This assumption can be view ed as an incoherence condition ensuring that the left singu- lar vectors of M are not dominated b y either treated or un treated units, and that the right singular v ectors are not dominated b y either pre-treatment or post-treatment p eriods. It is common in the MNAR matrix completion literature (e.g., Assumption D in Bai and Ng ( 2021 ) and Theorem 3.1(v) in Choi and Y uan ( 2024 )) and allows us to relate the prop erties of the submatrices M tall and M wide to those of M . F or example, if { u i } i ≤ N is stationary , 18 then N N 0 X i ≤ N 0 u i u ⊤ i ≈ 1 N X i ≤ N u i u ⊤ i = I K , and the condition is satisﬁed. W e are now in a p osition to state the conv ergence rate of the estimator. Let R tall and R wide denote the upp er b ounds on ∥ c M tall − M tall ∥ F and ∥ c M wide − M wide ∥ F , resp ectively , as giv en by Theorem 3.1 : R tall = J + p K 2 + K 4 min n p T 0 , ∥ M 2 , tall ∥ F + ∥ P X M 4 , tall ∥ F o + p K 3 + K 4 min n √ N , ∥ M 3 , tall ∥ F + ∥ M 4 , tall P Z, sub ∥ F o + p K 4 min n p N + T 0 , ∥ M 4 , tall ∥ F o + p N T 0  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  , R wide = J + p K 2 + K 4 min n √ T , ∥ M 2 , wide ∥ F + ∥ P X, sub M 4 , wide ∥ F o + p K 3 + K 4 min n p N 0 , ∥ M 3 , wide ∥ F + ∥ M 4 , wide P Z ∥ F o + p K 4 min n p N 0 + T , ∥ M 4 , wide ∥ F o + p N 0 T  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  , where P X, sub = Φ sub ( X )  Φ sub ( X ) ⊤ Φ sub ( X )  − 1 Φ sub ( X ) ⊤ , P Z, sub = Ψ sub ( Z )  Ψ sub ( Z ) ⊤ Ψ sub ( Z )  − 1 Ψ sub ( Z ) ⊤ , Φ sub ( X ) =  ϕ ( x 1 ) , . . . , ϕ ( x N 0 )  ⊤ , and Ψ sub ( Z ) =  ψ ( z 1 ) , . . . , ψ ( z T 0 )  ⊤ . In addition, let δ N = N 0 / N and δ T = T 0 /T . The following theorem pro vides the conv ergence rate of our estimator. Theorem 4.2 (Con vergence rate for the MNAR case) . Supp ose that Assumptions 3.1 – 3.4 hold for the submatric es Y tall and Y wide , and that Assumption 4.3 holds. In addition, assume that max {R wide , R tall } λ min √ δ N δ T → p 0 , 19 wher e λ min is the smal lest nonzer o singular value of M . Then, ∥ c M − M ∥ F = O p  κ max {R wide , R tall } √ δ N δ T  , wher e κ = λ max /λ min . Theorem 4.2 highligh ts the adv antage of our estimator. A discussion similar to that in Section 3 applies. F or simplicity , consider a typical case in which κ is b ounded and N 0 ≥ cN and T 0 ≥ cT for some c > 0. First, consider the most fa vorable case, M = M 1 . In this case, the conv ergence rate of our estimator is O p ( J ) pro vided that g ι,k ( · ) and q ι,k ( · ) are suﬃcien tly smo oth. By contrast, if we estimate the submatrices M tall and M wide using an MNAR lo w-rank metho d (e.g., Bai and Ng ( 2021 )), the con vergence rate w ould b e O p ( p K 1 ( N + T )), whic h is muc h larger than ours. Next, consider the least fa v orable case, M = M 4 . In this case, the con vergence rate of our estimator is O p ( p K 4 ( N + T )), pro vided that g ι,k ( · ) and q ι,k ( · ) are suﬃciently smo oth and J ≪ p K 4 ( N + T ). This rate coincides with that obtained b y applying an MNAR lo w-rank estimator to the submatrices M tall and M wide . In addition, if M = M 2 , the con vergence rate of our estimator is O p ( √ K 2 T ) provided that g ι,k ( · ) and q ι,k ( · ) are suﬃcien tly smo oth and J ≪ √ K 2 T , whereas a lo w-rank approac h yields the rate O p ( p K 2 ( N + T )). Similarly , if M = M 3 , the conv ergence rate of our estimator is O p ( √ K 3 N ), whereas a lo w-rank approac h yields O p ( p K 3 ( N + T )). Therefore, when M is at least partially explained b y observ able c haracteristics X and Z , incorporating this information can substantially impro ve estimation accuracy . 5 Sim ulated Exp erimen ts T o demonstrate the practical merits and ﬁnite-sample p erformance of our metho dology , we conducted sev eral sets of simulation exp erimen ts. 20 5.1 Change in the relativ e size of eac h part First, to study how the relative adv antage of our estimator ov er existing metho ds v aries with the contribution of each comp onent, we change the comp onen t w eights and compare the estimation p erformance across estimators. Sp eciﬁcally , we consider the mo del M = α 1 M 1 + α 2 M 2 + α 3 M 3 + α 4 M 4 , where P 4 r =1 α r = 1 and ∥ M r ∥ F = 2 √ N T for all 1 ≤ r ≤ 4. W e v ary the v alues of α r and ev aluate the mean squared error of the estimators. Data generating pro cess. W e consider eigh t c haracteristics, with x i = [ x 1 ,i , x 2 ,i , x 3 ,i , x 4 ,i ] ⊤ and z t = [ z 1 ,t , z 2 ,t , z 3 ,t , z 4 ,t ] ⊤ . W e dra w x 1 ,i i . i . d . ∼ Unif [ − 1 , 1], x 2 ,i i . i . d . ∼ Unif [ − 0 . 5 , 0 . 5], x 3 ,i i . i . d . ∼ N (0 , 0 . 2 2 ), and x 4 ,i i . i . d . ∼ N (0 , 0 . 3 2 ). W e generate z 1 ,t , z 2 ,t , z 3 ,t , z 4 ,t in the same wa y . F or the matrix M 1 , w e set g 1 ,k ( x i ) = b (1 ,k ) 0 + 4 X d =1  b (1 ,k ) d, 1 x d,i + b (1 ,k ) d, 2 x 2 d,i + b (1 ,k ) d, 3 x 3 d,i + b (1 ,k ) d, 4 x 4 d,i  , k ≤ K 1 = 17 , and draw the co eﬃcient s b (1 ,k ) 0 and b (1 ,k ) d,j from the standard normal distribution. Similarly , w e set q 1 ,k ( z t ) = a (1 ,k ) 0 + 4 X d =1  a (1 ,k ) d, 1 z d,t + a (1 ,k ) d, 2 z 2 d,t + a (1 ,k ) d, 3 z 3 d,t + a (1 ,k ) d, 4 z 4 d,t  , k ≤ K 1 = 17 , and dra w the co eﬃcien ts a (1 ,k ) 0 and a (1 ,k ) d,j from the standard normal distribution. F or the matrix M 2 , w e set K 2 = 3 and generate G 2 ( X ) using the same sp eciﬁcation as ab o ve. In addition, w e generate v 1 ,t ∈ R 3 i.i.d. from N  0 , diag(0 . 5 , 1 , 1 . 5)  , and stac k them in to V 1 = [ v 1 , 1 , . . . , v 1 ,T ] ⊤ . F or the matrix M 3 , we set K 3 = 3 and generate Q 2 ( Z ) using the same sp eciﬁcation as ab o ve. W e generate w 1 ,i ∈ R 3 i.i.d. from N  0 , diag(0 . 5 , 1 , 1 . 5)  , and stac k them in to 21 W 1 = [ w 1 , 1 , . . . , w 1 ,N ] ⊤ . F or the matrix M 4 , w e dra w w 2 ,i ∈ R 3 i.i.d. from N  0 , diag(0 . 5 , 1 , 1 . 5)  and v 2 ,t ∈ R 3 i.i.d. from N  0 , 1 . 5 2 I 3  . Lastly , we normalize all four matrices so that ∥ M r ∥ F = 2 √ N T for r = 1 , 2 , 3 , 4. W e generate the noise entries i.i.d. from N (0 , 0 . 5 2 ). F or the fully observ ed case, we set N = T = 200. F or the MAR (missing at random) case, w e set N = T = 400 and the observ ation probabilit y p = 0 . 6. F or the MNAR (missing not at random) case, we set N = T = 400 and N 0 = T 0 = 200. Results. Here, w e use a polynomial siev e with J = 5, and w e set the n umber of iterations to 100. W e v ary α r under the restrictions P 4 r =1 α r = 1 and α 2 = α 3 . In addition, to k eep the rank of M constant, we restrict attention to cases in whic h α r ≥ 0 . 01 for all r = 1 , 2 , 3 , 4. Figure 2: AMSE under diﬀerent v alues of α r F o otnote: W e v ary α r under the restrictions P 4 r =1 α r = 1 and α 2 = α 3 . W e ﬁrst study the fully observed case. Figure 2 rep orts the AMSE (a v erage mean squared error) of the estimators. Here, “Oracle” denotes the sp ectral estimator with kno wn K ; “Nuclear norm” denotes the plain n uclear-norm-p enalized estimator; “Double pro jec- tion” denotes P X Y P Z ; and “Sp ectral” denotes the sp ectral estimator with an estimated K . F or rank estimation, w e use the eigen v alue-ratio metho d of Ahn and Horenstein ( 2013 ). F rom the left panel, we see that, in general, the double pro jection estimator and the 22 sp ectral estimator with an estimated rank p erform p o orly relativ e to the other estimators. F rom the right panel, w e ﬁnd that our metho d p erforms b etter than the sp ectral estimator with kno wn K . The AMSE of the oracle estimator is quite stable and is not sensitiv e to c hanges in α r . In con trast, the AMSEs of our metho d and the n uclear-norm-p enalized estimator are strongly aﬀected by α r . Ov erall, our metho d outp erforms nuclear-norm p enalization except when α 1 is v ery small and α 4 is large. When α 1 is large and α 4 is small, our estimator p erforms particularly w ell. Figure 3: ( AM S E other − AM S E our ) / AM S E our under diﬀeren t v alues of α r F o otnote: In the left panel, the v alue at α 1 = 1 and α 4 = 0 . 01 is 16 . 66. In the right panel, the v alue at α 1 = 1 and α 4 = 0 . 01 is 27 . 12. T o assess the relativ e adv an tage of our estimator ov er others, Figure 3 plots ( AM S E other − AM S E our ) / AM S E our . Relativ e to n uclear-norm p enalization, the adv antage of our estima- tor increases as α 1 increases. Roughly sp eaking, the adv an tage also b ecomes larger as α 4 decreases. In particular, when α 1 is close to zero (e.g., α 1 = 0 . 01), the relative p erformance impro v es as α 2 = α 3 increases and α 4 decreases. In the righ t panel, which compares our estimator with the oracle estimator, the dep endence on α 4 is less clear; nev ertheless, w e still observ e that the relative adv antage increases with α 1 . Next, w e study the MAR (missing at random) case. Figure 4 rep orts the AMSE (a verage mean squared error) of the estimators as well as their relative p erformance. Here, we 23 Figure 4: Performance comparison in the MAR case F o otnote: In the right panel, the v alue at α 1 = 1 and α 4 = 0 . 01 is 3 . 77. include the double pro jection metho d  p − 1 P X (Ω ◦ Y ) P Z  and the n uclear-norm-p enalized estimator, whic h is a standard approach in the MAR setting. W e ﬁnd that the double pro jection metho d p erforms very p oorly except when α 1 = 1. Relative to nuclear-norm p enalization, the adv antage of our metho d increases as α 1 increases. How ever, the pattern with resp ect to α 4 is less clear than in the fully observed case. This may b e b ecause, in the MAR setting, w e cannot separately estimate M 2 , M 3 , and M 4 . When α 1 is very small and α 4 is relativ ely large, nuclear-norm p enalization p erforms b etter than our method. Figure 5: Performance comparison in the MNAR case F o otnote: In the right panel, the v alue at α 1 = 1 and α 4 = 0 . 01 is 15 . 58. Lastly , we study the MNAR (missing not at random) case. Figure 5 rep orts the AMSE 24 of the estimators and the relative adv antage of our estimator. Here, w e assume the rank is known and compare our metho d with the standard sp ectral-based estimator for the MNAR setting in Bai and Ng ( 2021 ); Y an and W ainwrigh t ( 2024 ). W e ﬁnd that when α 1 is large and/or α 4 is small, our metho d generally outperforms the sp ectral-based estimator. Con v ersely , when α 1 is small and/or α 4 is large, the sp ectral-based estimator typically p erforms b etter. How ever, the relative adv antage when α 1 is large (or α 4 is small) is substan tially greater than the relativ e disadv antage when α 1 is small (or α 4 is large). Moreo v er, ev en when α 1 is close to zero, our metho d can still p erform b etter when α 2 and α 3 are large. In summary , across all settings, our metho d p erforms substan tially b etter than the comp eting estimators when α 1 is large and/or α 4 is small. When α 1 is small and/or α 4 is large, the disadv an tage of our metho d is relatively mild compared to the gains achiev ed when α 1 is large and/or α 4 is small. Additionally , to examine ho w our estimator’s relative adv antage ov er existing metho ds v aries with the rank of eac h comp onent, we v ary the ranks and compare estimation per- formance. Ov erall, our metho d p erforms markedly b etter than comp eting estimators when K 1 is large. In con trast, when K 1 is small and K 4 is large, the adv antage is more mo dest and p erformance is comparable to that of other estimators. F or details, please refer to Section B in the App endix. 5.2 Sim ulated tobacco sales exp erimen t In this section, w e conduct a real-data experiment using the tobacco sales data in Abadie et al. ( 2010 ), whic h is widely used in the literature. In 1988, California in tro duced the ﬁrst ma jor anti-tobacco legislation in the United States (Prop osition 99). T o study the eﬀect of this legislation on tobacco sales, Abadie et al. ( 2010 ) used p er-capita cigarette sales data collected from 1970 to 2000 across 38 U.S. states with no an ti-tobacco legislation prior to 2000 ( N = 38 , T = 31). W e enco de these data into a 38 × 31 matrix Y , where the en try 25 y it represen ts the “potential” outcome of per-capita cigarette sales (in pac ks) for state i in year t under “con trol,” i.e., in the absence of any in terv ention. T o generate missing en tries, w e artiﬁcially introduce in terv entions (i.e., missingness) for a subset of states: in eac h iteration, w e randomly select 8 states to adopt an interv ention (e.g., a tobacco con trol program) starting from p erio d T 0 + 1. After rearranging the matrix, this yields the blo c k- missing pattern sho wn in Figure 1 , with an 8 × ( T − T 0 ) missing submatrix. F or state-lev el characteristics, we use the time-av eraged retail price of cigarettes, log p er- capita state p ersonal income, the p ercen tage of the p opulation aged 18–24, the p ercentage of adults completing four y ears of college or more, and p er-capita b eer consumption. Most of these v ariables are av eraged ov er the 1970–2000 p erio d. In addition, as a proxy for a state’s general preference for tobacco, w e use p er-capita cigarette sales in 2001. F or y ear-lev el c haracteristics, w e use log p er-capita real GDP and the state-av erage retail price of cigarettes in each year. W e also use the av erage p er-capita cigarette sales of Florida and Mic higan as a proxy for general tobacco preference in each y ear. Although Florida and Mic higan are not included among the abov e 38 states because they had in terven tions b efore 2000, the eﬀects of those interv entions were relativ ely mild compared to other treated states. App endix Section A pro vides additional details on the construction of these c haracteristics. W e compare the p erformance of our estimator with that of the sp ectral-based estimator in Bai and Ng ( 2021 ); Y an and W ainwrigh t ( 2024 ), whic h is a standard metho d for blo ck- missing patterns. This approac h estimates the “tall” and “wide” submatrices using a sp ectral estimator. F or rank estimation, w e use the eigen v alue-ratio metho d of Ahn and Horenstein ( 2013 ). F or the pro jection step in our metho d, w e use a second-order p olynomial siev e ( J = 2). W e set the num b er of iterations to 100. W e ﬁrst compare the AMSE (av erage mean squared error) o v er the missing en tries. Sp eciﬁcally , in eac h iteration, w e sum the squared estimation errors ov er all 8( T − T 0 ) missing en tries and divide b y 8( T − T 0 ), and then av erage this quan tity across iterations. W e compute the estimation error as the diﬀerence b etw een the estimated v alue and the 26 observ ed p er-capita cigarette sales for eac h missing en try . The ﬁrst t w o rows of T able 2 rep ort the results for diﬀeren t adoption times T 0 . W e ﬁnd that our metho d outperforms the sp ectral-based estimator in all cases. In particular, when the n umber of observ ed p eriods is relativ ely small (i.e., T 0 is small), the p erformance gap is larger. T able 2: Av erage mean squared errors T arget parameter Metho d T 0 = 10 15 20 25 Eac h missing element Ours 239.28 219.69 217.05 181.76 Sp ectral 268.28 238.57 233.12 194.19 Av erage of missing elements in each year Ours 43.56 43.67 32.19 27.01 Sp ectral 50.45 48.19 34.91 27.91 Av erage of all missing elements Ours 32.89 35.77 25.28 23.75 Sp ectral 41.35 40.41 28.69 25.47 Moreo v er, w e consider the AMSE of (i) the a verage of the missing en tries in eac h year and (ii) the a verage of all missing en tries. F or the AMSE of the year-b y-year av erages, in eac h iteration we compute the a verage of the missing en tries in each post-interv ention y ear, compute the squared estimation error for eac h such av erage, sum these squared errors, and divide b y T − T 0 . W e then av erage this quan tity across iterations. F or the AMSE of the o v erall av erage, in each iteration w e compute the av erage of all missing entries, compute its squared estimation error, and then a v erage it across iterations. These av erage-type targets ha v e the adv antage that the noise in outcomes is a veraged out; therefore, av erages of y it are close to a verages of m it . The last four ro ws of T able 2 rep ort the results for diﬀerent adoption times T 0 . W e ﬁnd that our method outp erforms the spectral-based estimator, and the p erformance gap increases when the num b er of observed p erio ds is small (i.e., when T 0 is small). As an alternative, we also consider diﬀeren t proxies for tobacco preference. F or the pro xy of each state’s general preference for tobacco, we use the time av erage of p er-capita cigarette sales ov er the full sample p erio d when estimating the “wide” submatrix, and w e use the av erage o ver the pre-in terven tion p erio d 1 , . . . , T 0 when estimating the “tall” 27 submatrix. Similarly , for the pro xy of eac h y ear’s general preference for tobacco, w e use the a v erage p er-capita cigarette sales across the 30 con trol states in each y ear when estimating the “wide” submatrix, and we use the av erage across all 38 states in each year when estimating the “tall” submatrix. T able 3: Av erage mean squared errors T arget parameter Metho d T 0 = 10 15 20 25 Eac h missing element Ours 218.32 211.43 212.12 175.56 Sp ectral 268.28 238.57 233.12 194.19 Av erage of missing elements in each year Ours 40.26 42.67 32.02 25.35 Sp ectral 50.45 48.19 34.91 27.91 Av erage of all missing elements Ours 30.78 34.83 25.82 22.84 Sp ectral 41.35 40.41 28.69 25.47 T able 3 rep orts the results when we use these alternativ e pro xies as characteristics. W e ﬁnd that the p erformance of our metho d improv es in most cases, and its relative adv an tage b ecomes larger. This t yp e of pro xy is not fully consistent with the theory b ecause it ma y violate the exogeneity condition; ho wev er, b ecause the pro xy a v erages out outcome noise, the resulting endogeneity ma y b e negligible in practice. In our exp erimen t, the results using these pro xies are indeed fav orable. In summary , the empirical results suggest that incorp orating side information can im- pro v e estimation of the con trol p oten tial outcomes in causal panel settings, relative to t ypical low-rank metho ds. 6 Concluding Remarks This pap er prop oses a ﬂexible framework for high-dimensional matrix estimation that sys- tematically incorp orates rich side information on b oth rows and columns. By decomp osing the signal into comp onen ts explained join tly b y ( X , Z ), by X alone, b y Z alone, and b y neither, and by estimating these components using siev e pro jection com bined with n uclear- norm p enalization, our approac h accommo dates nonlinear cov ariate eﬀects, av oids explicit 28 rank selection for eac h comp onen t, and automatically thresholds w eak or negligible sig- nals. W e establish con vergence rates that demonstrate robustness across diverse mo del conﬁgurations, matc hing sp ecialized pro cedures in fav orable settings while remaining com- p etitiv e when side information is uninformativ e. W e further extend the method to partially observ ed matrices under b oth MAR and MNAR mechanisms, including blo ck-missing pat- terns motiv ated b y causal panel data, and sho w through simulations and a tobacco-sales application that leveraging side information can substan tially impro ve imputation accuracy and enhance treatmen t-eﬀect estimation. 29 References Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic con trol metho ds for comparativ e case studies: Estimating the eﬀect of california’s tobacco control program. Journal of the Americ an statistic al Asso ciation , 105(490):493–505. Agarw al, A., Dahleh, M., Shah, D., and Shen, D. (2023). Causal matrix completion. In The thirty sixth annual c onfer enc e on le arning the ory , pages 3821–3826. PMLR. Ahn, S. C. and Horenstein, A. R. (2013). Eigenv alue ratio test for the n umber of factors. Ec onometric a , 81(3):1203–1227. A they , S., Bay ati, M., Doudchenk o, N., Imbens, G., and Khosravi, K. (2021). Matrix completion metho ds for causal panel data mo dels. Journal of the A meric an Statistic al Asso ciation , 116(536):1716–1730. Bai, J. and Ng, S. (2021). Matrix completion, coun terfactuals, and factor analysis of missing data. Journal of the A meric an Statistic al Asso ciation , 116(536):1746–1763. Bai, J., Ng, S., et al. (2008). Large dimensional factor analysis. F oundations and T r ends ® in Ec onometrics , 3(2):89–163. B ¨ uhlmann, P . and V an De Geer, S. (2011). Statistics for high-dimensional data: metho ds, the ory and applic ations . Springer Science & Business Media. Chen, Q., Roussano v, N., and W ang, X. (2023). Semiparametric conditional factor mo dels: Estimation and inference. T echnical rep ort, National Bureau of Economic Research. Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handb o ok of e c onometrics , 6:5549–5632. Chen, Y., Chi, Y., F an, J., Ma, C., et al. (2021). Sp ectral metho ds for data science: A statistical p ersp ectiv e. F oundations and T r ends ® in Machine L e arning , 14(5):566–806. 30 Chernozh uk ov, V., Hansen, C., Liao, Y., and Zh u, Y. (2023). Inference for low-rank mo dels. The A nnals of statistics , 51(3):1309–1330. Chiang, K.-Y., Hsieh, C.-J., and Dhillon, I. (2016). Robust principal comp onen t analysis with side information. In International Confer enc e on Machine L e arning , pages 2291– 2299. PMLR. Chiang, K.-Y., Hsieh, C.-J., and Dhillon, I. S. (2015). Matrix completion with noisy side information. A dvanc es in neur al information pr o c essing systems , 28. Choi, J. and Y uan, M. (2024). Matrix completion when missing is not at random and its applications in causal panel data mo dels. Journal of the Americ an Statistic al Asso ciation , (just-accepted):1–24. F an, J., Liao, Y., and W ang, W. (2016). Pro jected principal comp onent analysis in factor mo dels. Annals of statistics , 44(1):219. Jain, P . and Dhillon, I. S. (2013). Pro v able inductive matrix completion. arXiv pr eprint arXiv:1306.0626 . Koltc hinskii, V. (2011). Or acle ine qualities in empiric al risk minimization and sp arse r e c ov- ery pr oblems: ´ Ec ole D’ ´ Et ´ e de Pr ob abilit´ es de Saint-Flour XXXVIII-2008 , volume 2033. Springer Science & Business Media. Leden t, A., Alv es, R., Lei, Y., Guermeur, Y., and Kloft, M. (2023). Generalization b ounds for inductiv e matrix completion in low-noise settings. In Pr o c e e dings of the AAAI Con- fer enc e on A rtiﬁcial Intel ligenc e , volume 37, pages 8447–8455. Ma, S., Niu, P .-Y., Zhang, Y., and Zhu, Y. (2025). Statistical inference for noisy ma- trix completion incorp orating auxiliary information. Journal of the A meric an Statistic al Asso ciation , 120(549):343–355. 31 Mao, X., Chen, S. X., and W ong, R. K. (2019). Matrix completion with cov ariate informa- tion. Journal of the A meric an Statistic al Asso ciation , 114(525):198–210. Niranjan, U., Ra jkumar, A., and T ulabandh ula, T. (2017). Pro v able inductive robust p ca via iterativ e hard thresholding. arXiv pr eprint arXiv:1704.00367 . V ershynin, R. (2018). High-dimensional pr ob ability: A n intr o duction with applic ations in data scienc e , volume 47. Cam bridge universit y press. W ang, Y. and Elhamifar, E. (2018). High rank matrix completion with side information. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e , volume 32. Xu, M., Jin, R., and Zhou, Z.-H. (2013). Sp eedup matrix completion with side information: Application to multi-label learning. A dvanc es in neur al information pr o c essing systems , 26. Xue, N., P anagakis, Y., and Zafeiriou, S. (2017). Side information in robust principal com- p onen t analysis: Algorithms and applications. In Pr o c e e dings of the IEEE international c onfer enc e on c omputer vision , pages 4317–4325. Y an, Y. and W ainwrigh t, M. J. (2024). En trywise inference for causal panel data: A simple and instance-optimal approac h. arXiv pr eprint arXiv:2401.13665 . Zhang, X., Du, S., and Gu, Q. (2018). F ast and sample eﬃcien t inductive matrix completion via m ulti-phase pro crustes ﬂow. In International Confer enc e on Machine L e arning , pages 5756–5765. PMLR. Zhong, K., Song, Z., Jain, P ., and Dhillon, I. S. (2019). Prov able non-linear inductive matrix completion. A dvanc es in Neur al Information Pr o c essing Systems , 32. Zh u, Y., Shen, X., and Y e, C. (2016). P ersonalized prediction and sparsit y pursuit in laten t factor mo dels. Journal of the Americ an Statistic al Asso ciation , 111(513):241–252. 32 APPENDIX A Data Descriptions In this section, we describ e the data used in our exp erimen t and provide sources. • p er capita cigarette sales (in pac ks). Source: The T ax Burden on T obacco by Orze- c ho wski and W alker from Centers for Disease Control and Preven tion (CDC). • time av erage retail price of cigarettes (in dollars): F or each state, w e derive the a verage of (annual) retail price of cigarettes ov er the 1970-2000 p erio d. Here, the retail price includes the a verage cost and sales tax in the data of ‘The T ax Burden on T obacco’ (Orzec ho wski and W alker). W e additionally con v erted it to 2000 dollars using the Consumer Price Index. Source: The T ax Burden on T obacco b y Orzec howski and W alker from Centers for Disease Control and Preven tion (CDC). • p er capita state personal income (logged): F or eac h state, we deriv e the a verage of (ann ual) logged p er capita state p ersonal income ov er the 1970-2000 p erio d. W e con v erted the data of U.S. Bureau of Economic Analysis to 2000 dollars using the Consumer Price Index and changed it to the logged v alue. Source: U.S. Bureau of Economic Analysis (BEA). • p ercentage of the p opulation age 18-24: F or eac h state, we deriv e the a verage of the p ercen tage of the population age 18-24 in 1970, 1980, 1990, 2000 U.S. Census. Source: In tegrated Public Use Micro data Series (IPUMS USA). • p ercentage of adults completing four y ears of college or higher: F or each state, we deriv e the a v erage of the p ercen tage of adults completing four years of college or higher in 1970, 1980, 1990, 2000 U.S. Census. Source: USD A, Economic Research Service. 33 • p er capita b eer consumption (in gallons): F or eac h state, we deriv e the av erage of (ann ual) p er capita b eer consumption ov er the 1977-2000 p erio d b ecause the data start from 1977. Source: Surv eillance rep ort #121: ‘Apparent p er capita alcohol consumption: national, state, and regional trends, 1977-2022’ b y National Institute on Alcohol Abuse and Alcoholism in NIH. • p er capita real GDP (logged): W e con v erted the data of W orld Bank Op en Data to the logged v alue. Source: W orld Bank Op en Data. • state a v erage retail price of cigarettes (in dollars): F or each y ear, w e derive the a v erage of retail price of cigarettes o ver 38 states. Here, the retail price includes the a v erage cost and sales tax in the data of ‘The T ax Burden on T obacco’ (Orzecho wski and W alker). W e additionally conv erted it to 2000 dollars using the Consumer Price Index. Source: The T ax Burden on T obacco b y Orzec howski and W alk er from Cen ters for Disease Con trol and Preven tion (CDC). B Additional Sim ulation: Change in the size of rank of eac h part T o study how the relative adv antage of our estimator ov er existing metho ds v aries with the rank of each comp onen t, w e v ary the ranks and compare the estimation p erformance. Sp eciﬁcally , w e v ary K r sub ject to the constraints P 4 r =1 K r = 15, K 2 = K 3 , and K r ≥ 1 for all r (If 15 − ( K 1 + K 4 ) is o dd, w e set K 2 = K 3 + 1). In addition, w e ﬁx α r suc h that α 1 = · · · = α 4 = 0 . 25. Data generating pro cess. W e generate the eigh t c haracteristics in the same w ay as in Section 5.1 . F or the matrix M 1 , w e set g 1 ,k ( x i ) = b (1 ,k ) 0 + 4 X d =1  b (1 ,k ) d, 1 x d,i + b (1 ,k ) d, 2 x 2 d,i + b (1 ,k ) d, 3 x 3 d,i  , for k ≤ K 1 , 34 and draw the co eﬃcients b (1 ,k ) 0 and b (1 ,k ) d,j from the standard normal distribution. Similarly , w e set q 1 ,k ( z t ) = a (1 ,k ) 0 + 4 X d =1  a (1 ,k ) d, 1 z d,t + a (1 ,k ) d, 2 z 2 d,t + a (1 ,k ) d, 3 z 3 d,t  , for k ≤ K 1 , and dra w the co eﬃcien ts a (1 ,k ) 0 and a (1 ,k ) d,j from the standard normal distribution. F or the matrix M 2 , we generate G 2 ( X ) using the same sp eciﬁcation as ab o v e. F or V 1 = [ v 1 , 1 , . . . , v 1 ,T ] ⊤ , w e draw v 1 ,t ∈ R K 2 i.i.d. from the ﬁrst K 2 co ordinates of N  0 , diag(1 , 0 . 75 2 , 1 . 25 2 , 0 . 5 2 , 1 . 5 2 , 0 . 25 2 , 1 . 75 2 )  . (4) Similarly , for the matrix M 3 , w e generate Q 2 ( Z ) using the same sp eciﬁcation as ab o v e and dra w w 1 ,i ∈ R K 3 i.i.d. from the ﬁrst K 3 co ordinates of ( 4 ). F or the matrix M 4 , w e draw w 2 ,i ∈ R K 4 i.i.d. from the ﬁrst K 4 co ordinates of N  0 , diag(1 , 0 . 75 2 , 1 . 25 2 , 0 . 75 2 , 1 . 25 2 , 0 . 5 2 , 1 . 5 2 , 0 . 5 2 , 1 . 5 2 , 0 . 25 2 , 1 . 75 2 , 0 . 25 2 )  , and draw v 2 ,t ∈ R K 4 i.i.d. from N (0 , 1 . 5 2 I K 4 ). Lastly , we normalize all matrices so that ∥ M r ∥ F = 2 √ N T for r = 1 , 2 , 3 , 4. W e generate the noise entries i.i.d. from N (0 , 1 . 5 2 ). Results. W e use a p olynomial siev e with J = 4. The num b er of iterations and the sample size are the same as in Section 5.1 . W e ﬁrst study the fully observ ed case. Figure 6 rep orts the AMSE (av erage mean squared error) of the estimators. W e consider the same set of estimators as in Section 5.1 . The oracle estimator is the sp ectral estimator with kno wn K . F rom the left panel, w e see that the double pro jection estimator p erforms p o orly and that the spectral estimator with an estimated rank behav es unstably . F rom the righ t panel, w e ﬁnd that our metho d outp erforms the other estimators. In addition, the AMSE of the oracle estimator is quite stable and is not sensitive to changes in K r , whereas the AMSEs 35 Figure 6: AMSE under diﬀerent v alues of K r F o otnote: W e v ary K r sub ject to P 4 r =1 K r = 15 and K 2 = K 3 . Figure 7: ( AM S E other − AM S E our ) / AM S E our under diﬀeren t v alues of K r F o otnote: W e v ary K r sub ject to P 4 r =1 K r = 15 and K 2 = K 3 . of our method and nuclear-norm p enalization v ary with K r . Overall, as K 1 increases, our metho d tends to perform b etter. T o further assess the relative adv antage of our estimator, Figure 7 plots ( AM S E other − AM S E our ) / AM S E our . Relativ e to n uclear-norm p enalization, the adv antage of our estima- tor b ecomes larger as K 1 increases. Roughly sp eaking, the adv an tage also increases as K 4 decreases. In the right panel, which compares our estimator with the oracle estimator, the 36 dep endence on K 4 is less clear; nevertheless, w e still observ e that the relativ e adv antage increases with K 1 . Figure 8: Performance comparison in the MAR case Next, w e study the MAR (missing at random) case. Figure 8 rep orts the AMSE (a verage mean squared error) and the ratio betw een the AMSEs of our method and nuclear-norm p enalization, which is a standard estimator in the MAR setting. Relative to nuclear-norm p enalization, our metho d p erforms b etter, and its adv antage increases as K 1 increases. Ho w ever, the pattern with resp ect to K 4 is less clear. Figure 9: Performance comparison in the MNAR case F o otnote: In the right panel, the v alue at α 1 = 1 and α 4 = 0 . 01 is 15 . 58. Lastly , we study the MNAR (missing not at random) case. Figure 9 rep orts the AMSE 37 and the relativ e adv antage of our estimator. Here, we consider the same estimators as in Section 5.1 . The sp ectral-based metho d refers to the approach that uses the oracle estimator to estimate M tall and M wide . When K 1 is large and/or K 4 is small, our metho d generally outp erforms the spectral- based method. Conv ersely , when K 1 is small and/or K 4 is large, the spectral-based metho d p erforms b etter. How ever, the relative adv an tage in the former case is substantially larger than the relativ e disadv antage in the latter case. In summary , across all observ ation pat- terns, our metho d p erforms markedly b etter than the comp eting estimators when K 1 is large. When K 1 is small, the disadv antage of our metho d is relativ ely mild compared to the gains observ ed when K 1 is large. C Pro of C.1 Pro ofs of main results C.1.1 Pro of of Theorem 3.1 F or ι ∈ { 1 , 2 } , deﬁne the siev e appro ximation error matrices as R G ι = G ι ( X ) − Φ( X ) B ι and R Q ι = Q ι ( Z ) − Ψ( Z ) A ι where A ι and B ι are the siev e co eﬃcien t matrices consisting of a ι,k and b ι,k in Assumption 3.3 , resp ectiv ely . First, we note that c M 1 = P X G 1 Q ⊤ 1 P Z + P X G 2 V ⊤ 1 P Z + P X W 1 Q ⊤ 2 P Z + P X W 2 V ⊤ 2 P Z + P X E P Z (5) = G 1 Q ⊤ 1 + G 2 V ⊤ 1 P Z + P X W 1 Q ⊤ 2 + P X W 2 V ⊤ 2 P Z + P X E P Z + smo othing err or 1 , where smo othing err or 1 = ( P X − I N ) R G 1 Q ⊤ 1 + G 1 R ⊤ Q 1 ( P Z − I T ) + ( P X − I N ) R G 1 R ⊤ Q 1 ( P Z − I T ) + ( P X − I N ) R G 2 V ⊤ P Z + P X W 1 R ⊤ Q 2 ( P Z − I T ) . 38 Then, b ecause, by Assumption 3.3 , we ha v e ∥ R G 1 ∥ F = O p  √ N K 1 J γ G 1  , ∥ R G 2 ∥ F = O p  √ N K 2 J γ G 1  , ∥ R Q 1 ∥ F = O p  √ T K 1 J γ Q 1  , and ∥ R Q 2 ∥ F = O p  √ T K 3 J γ Q 2  , w e know ∥ smo othing err or 1 ∥ F = O p √ N T K 1 J γ G 1 + √ N T K 1 J γ Q 1 + √ N T K 2 J γ G 2 + √ N T K 3 J γ Q 2 ! . (6) Next, for c M 2 , note that P X Y ( I T − P Z ) = P X G 1 Q ⊤ 1 ( I T − P Z ) + P X G 2 V ⊤ 1 ( I T − P Z ) + P X W 1 Q ⊤ 2 ( I T − P Z ) + P X W 2 V ⊤ 2 ( I T − P Z ) + P X E ( I T − P Z ) = G 2 V ⊤ 1 ( I T − P Z ) + P X W 2 V ⊤ 2 ( I T − P Z ) + P X E ( I T − P Z ) + smo othing err or 2 , where smo othing err or 2 = P X G 1 R ⊤ Q 1 ( I T − P Z ) + ( P X − I N ) R G 2 V ⊤ 1 ( I T − P Z ) + P X W 1 R ⊤ Q 2 ( I T − P Z ) . By Assumption 3.3 , we ha ve ∥ smo othing err or 2 ∥ = O p √ N T K 1 J γ Q 1 + √ N T K 2 J γ G 2 + √ N T K 3 J γ Q 2 ! = o p  √ T  . Then, b ecause ∥ P X E ( I T − P X ) ∥ ≲ √ T with high probabilit y by Lemma C.4 , we ha v e ∥ P X E ( I T − P X ) + smo othing err or 2 ∥ ≤ ν 2 = C 2 √ T for some large C 2 > 0 with high probabilit y . Hence, by setting S = P X E ( I T − P X ) + smo othing err or 2 and L = M 2 ( I T − P Z ) + P X M 4 ( I T − P Z ), w e can get by Lemma C.1 that    c M 2 − M 2 ( I T − P Z ) − P X M 4 ( I T − P Z )    F = O p  p K 2 + K 4 min { √ T , ∥ M 2 ∥ F + ∥ P X M 4 ∥ F }  . (7) 39 F or c M 3 , note that ( I N − P X ) Y P Z = ( I N − P X ) W 1 Q ⊤ 2 + ( I N − P X ) W 2 V ⊤ 2 P Z + ( I N − P X ) E P Z + smo othing err or 3 , where smo othing err or 3 = ( I N − P X ) R G 1 Q ⊤ 1 P Z + ( I N − P X ) R G 2 V ⊤ 1 P Z + ( I N − P X ) W 1 R ⊤ Q 2 ( P Z − I T ) . By Assumption 3.3 , we ha ve ∥ smo othing err or 3 ∥ = O p √ N T K 1 J γ Q 1 + √ N T K 2 J γ G 2 + √ N T K 3 J γ Q 2 ! = o p  √ N  . Then, b ecause ∥ ( I N − P X ) E P X ∥ ≲ √ N with high probability b y Lemma C.4 , we ha v e ∥ ( I N − P X ) E P X + smo othing err or 3 ∥ ≤ ν 3 = C 3 √ N for some large C 3 > 0 with high probabilit y . Hence, b y setting S = ( I N − P X ) E P X + smo othing err or 3 and L = ( I N − P X ) W 1 Q ⊤ 2 + ( I N − P X ) W 2 V ⊤ 2 P Z , w e can derive by Lemma C.1 that    c M 3 − ( I N − P X ) M 3 − ( I N − P X ) M 4 P Z    F = O p  p K 3 + K 4 min { √ N , ∥ M 3 ∥ F + ∥ M 4 P Z ∥ F }  . (8) Lastly , for c M 4 , note that ( I N − P X ) Y ( I T − P Z ) = ( I N − P X ) W 2 V ⊤ 2 ( I T − P Z ) + ( I N − P X ) E ( I T − P Z ) + smo othing err or 4 , where smo othing err or 4 = ( I N − P X ) R G 1 R ⊤ Q 1 ( I T − P Z ) + ( I N − P X ) R G 2 V ⊤ 1 ( I T − P Z ) + ( I N − P X ) W 1 R ⊤ Q 2 ( I T − P Z ) . 40 Then, b y Assumption 3.3 , we hav e ∥ smo othing err or 4 ∥ = O p √ N T K 1 J ( γ G 1 + γ Q 1 ) + √ N T K 2 J γ G 2 + √ N T K 3 J γ Q 2 ! = o p  √ N + √ T  . Then, b ecause ∥ ( I N − P X ) E ( I T − P Z ) ∥ ≲ √ N + √ T with high probabilit y b y Lemma C.4 , we hav e ∥ ( I N − P X ) E ( I T − P Z ) + smo othing err or 4 ∥ ≤ ν 4 = C 4 √ N + T for some large C 4 > 0 with high probabilit y . Hence, by setting S = ( I N − P X ) E ( I T − P Z ) + smo othing err or 4 and L = ( I N − P X ) W 2 V ⊤ 2 ( I T − P Z ), w e can derive by Lemma C.1 that    c M 4 − ( I N − P X ) M 4 ( I T − P Z )    F = O p  p K 4 min { √ N + T , ∥ M 4 ∥ F }  . (9) F rom the relation ( 5 ), we hav e c M − M = c M 1 + c M 2 + c M 3 + c M 4 − ( M 1 + M 2 + M 3 + M 4 ) = c M 2 − M 2 ( I T − P Z ) − P X M 4 ( I T − P Z ) + c M 3 − ( I N − P X ) M 3 − ( I N − P X ) M 4 P Z + c M 4 − ( I N − P X ) M 4 ( I T − P Z ) + P X E P Z + smo othing err or 1 . Hence, w e hav e from the b ounds ( 6 ), ( 7 ), ( 8 ), and ( 9 ) with Lemma ( C.4 ) that    c M − M    F = O p  J + p K 2 + K 4 min n √ T , ∥ M 2 ∥ F + ∥ P X M 4 ∥ F o + p K 3 + K 4 min n √ N , ∥ M 3 ∥ F + ∥ M 4 P Z ∥ F o + p K 4 min n √ N + T , ∥ M 4 ∥ F o + √ N T  K 1 J γ G 1 + K 1 J γ Q 1 + K 2 J γ G 2 + K 3 J γ Q 2  . □ C.1.2 Pro of of Corollary 3.2 It is easily derived from the Davis-Kahan theorem (see, e.g., Corollary 2.8 and Theorem 2.9 of Chen et al. ( 2021 )). □ 41 C.1.3 Pro of of Theorem 4.1 First, note that c M 1 = 1 p P X (Ω ◦ M ) P Z + 1 p P X (Ω ◦ E ) P Z (10) = P X M P Z + 1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z + 1 p P X (Ω ◦ E ) P Z = G 1 Q ⊤ 1 + G 2 V ⊤ 1 P Z + P X W 1 Q ⊤ 2 + P X W 2 V ⊤ 2 P Z + smo othing err or 1 + 1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z + 1 p P X (Ω ◦ E ) P Z , where smo othing err or 1 is deﬁned in ( 5 ). Note that     1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z     F ≤ 1 p   Φ(Φ ⊤ Φ) − 1     Φ ⊤ ((Ω − p 11 ⊤ ) ◦ M )Ψ   F   Ψ(Ψ ⊤ Ψ) − 1   . By Assumption 3.2 , we know   Φ(Φ ⊤ Φ) − 1   ≲ 1 √ N and   Ψ(Ψ ⊤ Ψ) − 1   ≲ 1 √ T . In addition, b ecause E h   Φ ⊤ ((Ω − p 11 ⊤ ) ◦ M )Ψ   2 F    M , X , Z i = X j 1 ,j 2 E   X it ( ω it − p ) m it ϕ i,j 1 ψ t,j 2 ! 2       M , X , Z   = X j 1 ,j 2 X it E [( ω it − p ) 2 ] m 2 it ϕ 2 i,j 1 ψ 2 t,j 2 = p X j 1 ,j 2 X it m 2 it ϕ 2 i,j 1 ψ 2 t,j 2 = p X it m 2 it ∥ ϕ i ∥ 2 ∥ ψ t ∥ 2 = O p ( pN T J 2 ) b y the assumption that E [ m 4 it ], E [ ϕ 4 ij ], and E [ ψ 4 tj ] are b ounded, w e hav e   Φ ⊤ ((Ω − p 11 ⊤ ) ◦ M )Ψ   F = O p ( √ pN T J ). Hence, w e hav e     1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z     F = O p  J √ p  . 42 By the similar token, w e hav e   Φ ⊤ (Ω ◦ E )Ψ   F = O p ( √ pN T J ) and     1 p P X (Ω ◦ E ) P Z     F = O p  J √ p  . In addition, from ( 10 ), we ha ve Y − c M 1 = G 2 V ⊤ 1 ( I T − P Z ) + ( I N − P X ) W 1 V ⊤ 2 + W 2 V ⊤ 2 − P X W 2 V ⊤ 2 P Z + E − 1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z − 1 p P X (Ω ◦ E ) P Z − smo othing err or 1 . Note that, under our assumptions,    1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z    F ,    1 p P X (Ω ◦ E ) P Z    F , and ∥ smo othing err or 1 ∥ F are o p ( p 1 / 2 √ N + T ). Hence, we hav e ∥ Ω ◦ S ′ ∥ ≤ ∥ Ω ◦ S ′ ∥ F ≤ ∥ S ′ ∥ F = o p  p 1 / 2 √ N + T  where S ′ = 1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z + 1 p P X (Ω ◦ E ) P Z + smo othing err or 1 . In addition, we ha v e ∥ Ω ◦ E ∥ ≲ p 1 / 2 √ N + T with high probability by Lemma C.3 since ∥ ω it ϵ it ∥ ψ 2 ≤ p 1 / 2 σ . Hence, ∥ Ω ◦ ( E − S ′ ) ∥ ≲ p 1 / 2 √ N + T with high probability and ∥ Ω ◦ ( E − S ′ ) ∥ ≤ ν = C p 1 / 2 √ N + T for some large C > 0 with high probability . Then, b y setting S = E − S ′ and L = M 2 ( I T − P Z ) + ( I N − P X ) M 3 + M 4 − P X M 4 P Z , we can get from Lemma C.2 that    c M rest − ( M 2 ( I T − P Z ) + ( I N − P X ) M 3 + M 4 − P X M 4 P Z )    F = O p  √ K ∗ min  √ N + T (1 + M max ) √ p , ∥ M 2 ∥ F + ∥ M 3 ∥ F + ∥ M 4 ∥ F  . Then, b ecause    c M 1 + c M rest − M    F ≤    c M rest − ( M 2 ( I T − P Z ) + ( I N − P X ) M 3 + M 4 − P X M 4 P Z )    F +     1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z + 1 p P X (Ω ◦ E ) P Z     F + ∥ smo othing err or 1 ∥ F , 43 and     1 p P X ((Ω − p 11 ⊤ ) ◦ M ) P Z + 1 p P X (Ω ◦ E ) P Z     F = O p  J √ p  , ∥ smo othing err or 1 ∥ F = O p √ N T K 1 J γ G 1 + √ N T K 1 J γ Q 1 + √ N T K 2 J γ G 2 + √ N T K 3 J γ Q 2 ! , as noted in ( 6 ), we ha ve the desired result. □ C.1.4 Pro of of Theorem 4.2 F or notational simplicity , denote the subscripts ‘ tal l ’ and ‘ w ide ’ by T and W . Here, ( U T , D T , V T ), ( U W , D W , V W ), and ( U, D , V ) mean the SVD of ‘tall’, ‘wide’, and ‘full’ matri- ces, respectively . First, b y applying Corollary 3.2 to the tall and wide matrices, resp ectiv ely , w e hav e    b U W O W − U W    = O p  R W λ min , W  ,    b U T O T − U T    = O p  R T λ min , T  . In addition, b y Lemma C.6 , w e ha v e U W = U ( N 0 ) H W where U ( N 0 ) = [ u 1 , · · · , u N 0 ] ⊤ and H W = ( U ⊤ ( N 0 ) U ( N 0 ) ) − 1 / 2 G W for some K × K orthogonal matrix G W . Then, we ha ve b y Lemma C.6 that    b U W O W − U W    F =    b U W O W − U ( N 0 ) H W    F =    b U W − U ( N 0 ) Q − 1 W    F = O p  R W λ min , W  ,    b U W Q W − U ( N 0 )    F =    b U W − U ( N 0 ) Q − 1 W    F ∥ Q W ∥ = O p  √ N 0 √ N R W λ min , W  , where Q − 1 W = H W O ⊤ W . Similarly , by Lemma C.6 , we ha v e U T = U H T where H T = ( U ⊤ U ) − 1 / 2 G T for some K × K orthogonal matrix G T . Then we hav e    b U T O T − U T    F =    b U T O T − U H T    F =    b U T − U Q − 1 T    F = O p  R T λ min , T  ,    b U T Q T − U    F =    b U T − U Q − 1 T    F ∥ Q T ∥ = O p  R T λ min , T  , 44 where Q − 1 T = H T O ⊤ T . Deﬁne R 1 = b U W Q W − U ( N 0 ) and R 2 = b U T , ( N 0 ) Q T − U ( N 0 ) where U T , ( N 0 ) = [ u T , 1 , · · · , u T ,N 0 ] ⊤ and b U T , ( N 0 ) = [ b u T , 1 , · · · , b u T ,N 0 ] ⊤ . Then, we hav e b U W Q W − R 1 = b U T , ( N 0 ) Q T − R 2 = ⇒ b U W = b U T , ( N 0 ) H ad j + R 1 Q − 1 W − R 2 Q − 1 W , where H ad j = Q T Q − 1 W . Hence, we hav e b H ad j − H ad j =  b U ⊤ T , ( N 0 ) b U T , ( N 0 )  − 1 b U ⊤ T , ( N 0 ) b U W − H ad j =  b U ⊤ T , ( N 0 ) b U T , ( N 0 )  − 1 b U ⊤ T , ( N 0 ) R 1 Q − 1 W −  b U ⊤ T , ( N 0 ) b U T , ( N 0 )  − 1 b U ⊤ T , ( N 0 ) R 2 Q − 1 W . Then, since      b U ⊤ T , ( N 0 ) b U T , ( N 0 )  − 1 b U ⊤ T , ( N 0 )     =      b U ⊤ T , ( N 0 ) b U T , ( N 0 )  − 1     1 / 2 = O p √ N √ N 0 ! b y Lemma C.6 and ∥ R 1 ∥ = O p  √ N 0 √ N R W λ min , W  , ∥ R 2 ∥ = O p  R T λ min , T  b y the ab ov e b ounds, we ha v e by Lemma C.6 that    b H ad j − H ad j    = O p √ N √ N 0 R W λ min , W + N N 0 R T λ min , T ! . In addition, b ecause ∥ H ad j ∥ = O p  √ N √ N 0  b y Lemma C.6 , we hav e    b H ad j    ≤ ∥ H ad j ∥ +    b H ad j − H ad j    = O p √ N √ N 0 ! . Moreo v er, note that    b V W b D W − V W D W O ⊤ W    F =    b D W b V ⊤ W − O W D W V ⊤ W    F =    b U ⊤ W c M W − O W U ⊤ W M W    F =    b U ⊤ W  c M W − M W  +  b U ⊤ W − O W U ⊤ W  M W    F ≤    c M W − M W    F +    b U ⊤ W − O W U ⊤ W    F ∥ M W ∥ 45 = O p  R W + λ max , W λ min , W R W  . Then, b ecause D W V ⊤ W = U ⊤ W M W = U ⊤ W ( U ( N 0 ) H W ) H − 1 W D V ⊤ = U ⊤ W U W H − 1 W D V ⊤ = H − 1 W D V ⊤ , w e hav e    b V W b D W − V D Q ⊤ W    F =    b V W b D W − V W D W O ⊤ W    F = O p  R W + λ max , W λ min , W R W  . Lastly , we hav e the following decomp osition:    c M − M    F =    b U T b H ad j b D W b V ⊤ W − U D V ⊤    F ≲     U Q − 1 T b H ad j  b V W b D W − V D Q ⊤ W  ⊤     F +     b U T − U Q − 1 T  b H ad j Q W D V ⊤    F +    U  Q − 1 T b H ad j Q W − I K  D V ⊤    F . The ﬁrst term can b e b ounded lik e     U Q − 1 T b H ad j  b V W b D W − V D Q ⊤ W  ⊤     F ≤   Q − 1 T      b H ad j       b V W b D W − V D Q ⊤ W    F = O p √ N √ N 0 κ W R W ! , where κ W = λ max , W λ min , W . The second term can be b ounded like     b U T − U Q − 1 T  b H ad j Q W D V ⊤    F ≤    b U T − U Q − 1 T       b H ad j    ∥ Q W ∥ ∥ D ∥ = O p  λ min λ min , T κ R T  . In addition, the last term can b e b ounded like    U  Q − 1 T b H ad j Q W − I K  D V ⊤    F =    U Q − 1 T  b H ad j − H ad j  Q W D V ⊤    F ≤   Q − 1 T      b H ad j − H ad j    ∥ Q W ∥ ∥ D ∥ F 46 = O p κ λ min λ min , W R W + κ √ N √ N 0 λ min λ min , T R T ! . Moreo v er, by Lemma F.1 of Choi and Y uan ( 2024 ), we hav e κ W ≲ κ , λ min λ min , T ≍ √ T √ T 0 , and λ min λ min , W ≍ √ N √ N 0 . Therefore, to sum up, w e hav e the desired result. □ C.2 Auxiliary lemmas Consider the following generic mo del, Z = L + S , where rank ( L ) = K L . Denote the n uclear norm p enalized estimator b y b L : = arg min A ∥ Z − A ∥ 2 F + λ ∥ A ∥ ∗ . Then, w e hav e the following b ound for the estimator. Lemma C.1. L et λ ≥ C ∥ S ∥ for some lar ge c onstant C > 0 . Then, we have    b L − L    F ≲ min n p K L λ, p K L ∥ L ∥ F o . In addition, if L = 0 , then b L = 0 . Pro of. Let ∆ = b L − L . Then, we hav e    Z − b L    2 F = ∥ S − ∆ ∥ 2 F = ∥ S ∥ 2 F + ∥ ∆ ∥ 2 F − 2 tr (∆ ⊤ S ) . In addition, for some constant 0 < c < 1, we ha ve   2 tr (∆ ⊤ S )   ≤ 2 ∥ ∆ ∥ ∗ ∥ S ∥ ≤ (1 − c ) λ ∥ ∆ ∥ ∗ 47 since λ ≥ 2 1 − c ∥ S ∥ . Then, we hav e    Z − b L    2 F + λ    b L    ∗ ≤ ∥ Z − L ∥ 2 F + λ ∥ L ∥ ∗ , ∥ ∆ ∥ 2 F − 2 tr (∆ ⊤ S ) + λ    b L    ∗ ≤ λ ∥ L ∥ ∗ , ∥ ∆ ∥ 2 F − (1 − c ) λ ∥ ∆ ∥ ∗ + λ    b L    ∗ ≤ λ ∥ L ∥ ∗ . (11) (1) When L = 0. Since b L = ∆ and ∥ L ∥ ∗ = 0, we ha ve by ( 11 ) that ∥ ∆ ∥ 2 F + 2 cλ ∥ ∆ ∥ ∗ ≤ 0 . Since c > 0, w e hav e ∥ ∆ ∥ F = 0. (2) When L  = 0. Note that    b L    ∗ = ∥ ∆ + L ∥ ∗ ≥ ∥ ∆ ∥ ∗ − ∥ L ∥ ∗ . Hence, w e hav e by ( 11 ) that ∥ ∆ ∥ 2 F − (1 − c ) λ ∥ ∆ ∥ ∗ + λ ∥ ∆ ∥ ∗ − λ ∥ L ∥ ∗ ≤ ∥ ∆ ∥ 2 F − (1 − c ) λ ∥ ∆ ∥ ∗ + λ    b L    ∗ ≤ λ ∥ L ∥ ∗ , ∥ ∆ ∥ 2 F + cλ ∥ ∆ ∥ ∗ − λ ∥ L ∥ ∗ ≤ λ ∥ L ∥ ∗ , ∥ ∆ ∥ 2 F + cλ ∥ ∆ ∥ ∗ ≤ 2 λ ∥ L ∥ ∗ . So, w e hav e ∥ ∆ ∥ ∗ ≤ 2 c ∥ L ∥ ∗ . Then, we hav e ∥ ∆ ∥ F ≤ ∥ ∆ ∥ ∗ ≤ 2 c ∥ L ∥ ∗ ≤ 2 c √ K L ∥ L ∥ F . Next, we deriv e the b ound of √ K L λ . Denote the singular v alue decomp osition of L b y L = U D V ⊤ where U = ( U o , U c ) and V = ( V o , V c ). Here, ( U c , V c ) are the columns of U , V that correspond to the zero singular v alues, while ( U o , V o ) denote the columns of U , V 48 asso ciated with the nonzero singular v alues. In addition, let P ( A ) = U c U ⊤ c AV c V ⊤ c , M ( A ) = A − P ( A ) . Note that    b L    ∗ = ∥ L + ∆ ∥ ∗ = ∥ L + P (∆) + M (∆) ∥ ∗ (12) ≥ ∥ L + P (∆) ∥ ∗ − ∥M (∆) ∥ ∗ = ∥ L ∥ ∗ + ∥P (∆) ∥ ∗ − ∥M (∆) ∥ ∗ . So, using this relation with ( 11 ) and the fact that (1 − c ) λ ∥ ∆ ∥ ∗ ≤ (1 − c ) λ ∥P (∆) ∥ ∗ + (1 − c ) λ ∥M (∆) ∥ ∗ , w e hav e ∥ ∆ ∥ 2 F + cλ ∥P (∆) ∥ ∗ ≤ (2 − c ) λ ∥M (∆) ∥ ∗ . Therefore, we hav e ∥ ∆ ∥ 2 F ≤ (2 − c ) λ ∥M (∆) ∥ ∗ ≤ λ ∥M (∆) ∥ F p 2 K L ≤ λ ∥ ∆ ∥ F p 2 K L . □ On the other hand, if w e can only observe Ω ◦ Z instead of Z where Ω = ( ω it ) i ≤ N ,t ≤ T and ω it = 1 { z it is observ ed } , then the nuclear norm p enalized estimator b ecomes b L : = arg min A ∈A ∥ Ω ◦ ( Z − A ) ∥ 2 F + λ ∥ A ∥ ∗ , where A = { A : ∥ A ∥ ∞ ≤ L max } . Then, we hav e the following b ound for the estimator. Lemma C.2. L et λ ≥ C ∥ Ω ◦ S ∥ for some lar ge c onstant C > 0 . Then, if ∥ L ∥ ∞ ≤ L max , with pr ob ability c onver ging to 1 , we have    b L − L    F ≲ min ( √ K L λ p + p K L ( N + T ) L max √ p , p K L ∥ L ∥ F ) , wher e p = E [ ω it ] . 49 Pro of. (i) Let ∆ = b L − L . Then, we hav e    Ω ◦ ( Z − b L )    2 F = ∥ Ω ◦ ( S − ∆) ∥ 2 F = ∥ Ω ◦ S ∥ 2 F + ∥ Ω ◦ ∆ ∥ 2 F − 2 tr ((Ω ◦ ∆) ⊤ (Ω ◦ S )) . In addition, for some constant 0 < c < 1, we ha ve   2 tr ((Ω ◦ ∆) ⊤ (Ω ◦ S ))   =   2 tr (∆ ⊤ (Ω ◦ S ))   ≤ 2 ∥ ∆ ∥ ∗ ∥ Ω ◦ S ∥ ≤ (1 − c ) λ ∥ ∆ ∥ ∗ since λ ≥ 2 1 − c ∥ Ω ◦ S ∥ . Then, we hav e    Ω ◦ ( Z − b L )    2 F + λ    b L    ∗ ≤ ∥ Ω ◦ ( Z − L ) ∥ 2 F + λ ∥ L ∥ ∗ , ∥ Ω ◦ ∆ ∥ 2 F − 2 tr ((Ω ◦ ∆) ⊤ (Ω ◦ S )) + λ    b L    ∗ ≤ λ ∥ L ∥ ∗ , ∥ Ω ◦ ∆ ∥ 2 F − (1 − c ) λ ∥ ∆ ∥ ∗ + λ    b L    ∗ ≤ λ ∥ L ∥ ∗ . (13) Note that    b L    ∗ = ∥ ∆ + L ∥ ∗ ≥ ∥ ∆ ∥ ∗ − ∥ L ∥ ∗ . Hence, w e hav e by ( 13 ) that ∥ Ω ◦ ∆ ∥ 2 F − (1 − c ) λ ∥ ∆ ∥ ∗ + λ ∥ ∆ ∥ ∗ − λ ∥ L ∥ ∗ ≤ ∥ Ω ◦ ∆ ∥ 2 F − (1 − c ) λ ∥ ∆ ∥ ∗ + λ    b L    ∗ ≤ λ ∥ L ∥ ∗ , ∥ Ω ◦ ∆ ∥ 2 F + cλ ∥ ∆ ∥ ∗ ≤ 2 λ ∥ L ∥ ∗ . So, w e hav e ∥ ∆ ∥ ∗ ≤ 2 c ∥ L ∥ ∗ . Then, we hav e ∥ ∆ ∥ F ≤ ∥ ∆ ∥ ∗ ≤ 2 c ∥ L ∥ ∗ ≤ 2 c √ K L ∥ L ∥ F . (2) Using the relations ( 12 ) and ( 13 ) with the fact that (1 − c ) λ ∥ ∆ ∥ ∗ ≤ (1 − c ) λ ∥P (∆) ∥ ∗ + (1 − c ) λ ∥M (∆) ∥ ∗ , w e hav e cλ ∥P (∆) ∥ ∗ ≤ ∥ Ω ◦ ∆ ∥ 2 F + cλ ∥P (∆) ∥ ∗ ≤ (2 − c ) λ ∥M (∆) ∥ ∗ . Hence, we ha ve ∥P (∆) ∥ ∗ ≤ 2 − c c ∥M (∆) ∥ ∗ . In addition, w e know ∥ ∆ ∥ ∞ ≤ 2 L max . Set 50 ¯ L = 2 L max and C 1 = 2 − c c . If ∥ ∆ ∥ 2 F > 2 B ¯ L 2 p √ N T for some suﬃciently large B > 0, then ∆ ∈ C  C 1 , 2 B ¯ L 2 p  and w e hav e with high probability that p ∥ ∆ ∥ 2 F < 2 ∥ Ω ◦ ∆ ∥ 2 F + 8 B K L ( N + T ) L 2 max b y Lemma C.5 . Then, since ∥ Ω ◦ ∆ ∥ 2 F ≤ (2 − c ) λ ∥M (∆) ∥ ∗ , w e hav e p ∥ ∆ ∥ 2 F < 2(2 − c ) λ ∥M (∆) ∥ ∗ + 8 B K L ( N + T ) L 2 max < 2(2 − c ) λ p 2 K L ∥ ∆ ∥ F + 8 B K L ( N + T ) L 2 max . If the ﬁrst term dominates the second term, we hav e p ∥ ∆ ∥ 2 F < 4(2 − c ) λ p 2 K L ∥ ∆ ∥ F = ⇒ ∥ ∆ ∥ F < 4(2 − c ) λ p 2 K L /p. If the second term dominates the ﬁrst term, we hav e p ∥ ∆ ∥ 2 F < 16 B K L ( N + T ) L 2 max = ⇒ ∥ ∆ ∥ F < 4 B 1 / 2 p K L ( N + T ) L max / √ p. In addition, if ∥ ∆ ∥ 2 F ≤ 2 B ¯ L 2 p √ N T , we ha ve ∥ ∆ ∥ F ≤ 2 √ 2 B 1 / 2 L max √ N + T √ p . Hence, with probabilit y conv erging to 1, we hav e    b L − L    F ≲ √ K L λ p + p K L ( N + T ) L max √ p . □ Lemma C.3. L et E b e a N × T matrix of indep endent sub-Gaussian entries such that ∥ ϵ it ∥ ψ 2 ≤ σ . In addition, let L and R b e N × J 1 and T × J 2 orthonormal matric es, 51 r esp e ctively. Then, we have   L ⊤ E R   ≲ σ p J 1 + J 2 + log(max { N , T } ) , with pr ob ability at le ast 1 − O (max { N , T } − 5 ) . Pro of. First, b y Corollary 4.2.13 of V ershynin ( 2018 ), we can ﬁnd an 1 / 4-net N of the unit sphere S J 1 − 1 and 1 / 4-net M of the unit sphere S J 2 − 1 with cardinalities |N | ≤ 9 J 1 , |M| ≤ 9 J 2 . Note that   L ⊤ E R   ≤ 2 max a ∈N ,b ∈M N X i =1 T X t =1 ϵ it ( a ⊤ L i )( b ⊤ R t ) where L ⊤ i is the i -th row of L and R ⊤ t is the t -th row of R (see, Section 4.4.1 of V ersh ynin ( 2018 )). Fix a o ∈ N and b o ∈ M . Then, by Ho eﬀding’s inequality with the independent sub-Gaussian assumption, w e hav e with probability at least 1 − u , N X i =1 T X t =1 ϵ it ( a ⊤ o L i )( b ⊤ o R t ) ≲ σ ∥ La o ∥ 2 ∥ Rb o ∥ 2 p log( u − 1 ) = σ p log( u − 1 ) , and by setting u = max { N , T } − 5 9 − J 1 − J 2 , we hav e with probabilit y at least 1 − max { N , T } − 5 9 − J 1 − J 2 , N X i =1 T X t =1 ϵ it ( a ⊤ o L i )( b ⊤ o R t ) ≲ σ p J 1 + J 2 + log(max { N , T } ) , Then, b ecause P max a ∈N ,b ∈M N X i =1 T X t =1 ϵ it ( a ⊤ L i )( b ⊤ R t ) > c ! ≤ X a ∈N ,b ∈M P N X i =1 T X t =1 ϵ it ( a ⊤ L i )( b ⊤ R t ) > c ! ≤ 9 J 1 + J 2 P N X i =1 T X t =1 ϵ it ( a ⊤ L i )( b ⊤ R t ) > c ! , 52 b y setting c = σ p J 1 + J 2 + log(max { N , T } ), w e hav e P max a ∈N ,b ∈M N X i =1 T X t =1 ϵ it ( a ⊤ L i )( b ⊤ R t ) > σ p J 1 + J 2 + log(max { N , T } ) ! ≲ max { N , T } − 5 . □ Lemma C.4. With pr ob ability c onver ging to 1 , we have (i) ∥ P X E P Z ∥ ≲ σ √ J ; (ii) ∥ P X E ∥ ≲ σ √ T ; (iii) ∥ E P Z ∥ ≲ σ √ N ; (iv) ∥ E ∥ ≲ σ √ N + T . Pro of. (i) By Assumption 3.2 , with probabilit y conv erging to 1,   Φ(Φ ⊤ Φ) − 1 / 2   and   Ψ(Ψ ⊤ Ψ) − 1 / 2   are b ounded. Hence, by Lemma C.3 , with probabilit y con verging to 1, w e hav e ∥ P X E P Z ∥ ≤   Φ(Φ ⊤ Φ) − 1 / 2     ((Φ ⊤ Φ) − 1 / 2 Φ ⊤ ) E (Ψ(Ψ ⊤ Ψ) − 1 / 2 )     (Ψ ⊤ Ψ) − 1 / 2 Ψ ⊤   ≲ σ √ J b ecause Φ(Φ ⊤ Φ) − 1 / 2 and Ψ(Ψ ⊤ Ψ) − 1 / 2 are N × J and T × J orthogonal matrices, respectively , and log(max { N , T } ) ≲ J . (ii) By Lemma C.3 , with probability conv erging to 1, we hav e ∥ P X E ∥ ≤   Φ(Φ ⊤ Φ) − 1 / 2     ((Φ ⊤ Φ) − 1 / 2 Φ ⊤ ) E I T   ≲ σ √ T b ecause Φ(Φ ⊤ Φ) − 1 / 2 and I T are N × J and T × T orthogonal matrices, resp ectively , and J ≲ T . (iii) The pro of is symmetric to that of (ii). (iv) It follo ws from Lemma C.3 with L = I N and R = I T . □ Lemma C.5. Deﬁne the r estricte d set of dir e ctions as C ( c 1 , c 2 ) = n A ∈ A ∗ : ∥P ( A ) ∥ ∗ ≤ c 1 ∥M ( A ) ∥ ∗ , ∥ A ∥ 2 F > c 2 √ N T o , 53 wher e A ∗ = { A ∈ R N × T : ∥ A ∥ max ≤ ¯ L } . Then, for any C 1 > 0 and suﬃciently lar ge B > 0 , we have, with pr ob ability c onver ging to 1 , that uniformly for A ∈ C  C 1 , 2 B ¯ L 2 p  , ∥ Ω ◦ A ∥ 2 F > 0 . 5 p ∥ A ∥ 2 F − B K L ( N + T ) ¯ L 2 . Pro of. It is an extension of Lemma A.2 of Chernozh uko v et al. ( 2023 ). First, let Ω( A ) = ∥ Ω ◦ A ∥ 2 F = P it ω 2 it A 2 it . Then, we hav e E Ω( A ) = p P it A 2 it = p ∥ A ∥ 2 F . In addition, deﬁne E ( A ) =  | Ω( A ) − E Ω( A ) | > 0 . 5 · E Ω( A ) + B K L ( N + T ) ¯ L 2  . Then, w e wan t to show that P  ∃ A ∈ C  C 1 , 2 B ¯ L 2 p  : E ( A ) holds  → 0. T o use the stan- dard p eeling argument, deﬁne Γ l =  A ∈ C  C 1 , 2 B ¯ L 2 p  : 2 l v n ≤ E Ω( A ) ≤ 2 l +1 v n  where v n = B ¯ L 2 √ N T and l ∈ N . P art 1. W e wan t to show C  C 1 , 2 B ¯ L 2 p  ⊂ ∪ ∞ l =1 Γ l . If A ∈ C  C 1 , 2 B ¯ L 2 p  , w e hav e E Ω( A ) = p ∥ A ∥ 2 F ≥ 2 B ¯ L 2 √ N T = 2 v n . Hence, there is l ∈ N suc h that A ∈ Γ l for an y A ∈ C  C 1 , 2 B ¯ L 2 p  . P art 2. Let D ( x ) =  A ∈ C  C 1 , 2 B ¯ L 2 p  : ∥ A ∥ 2 F ≤ x  , F ( A ) =  | Ω( A ) − E Ω( A ) | − B K L ( N + T ) ¯ L 2 > 0 . 25 · 2 l +1 v n  . 54 W e wan t to sho w that if A ∈ Γ l and E ( A ) holds, than A ∈ D ( x l ) where x l = p − 1 2 l +1 v n and F ( A ) holds. This is because | Ω( A ) − E Ω( A ) | − B K L ( N + T ) ¯ L 2 > 0 . 5 · E Ω( A ) ≥ 0 . 25 · 2 l +1 v n , and ∥ A ∥ 2 F = p − 1 E Ω( A ) ≤ p − 1 2 l +1 v n . P art 3. Let Q ( x ) = sup A ∈D ( x )      1 N T X it ω 2 it A 2 it − pA 2 it      . W e b ound E Q ( x ). First, note that for any A ∈ D ( x ) ⊂ C  C 1 , 2 B ¯ L 2 p  , w e hav e ∥ A ∥ ∗ = ∥P ( A ) + M ( A ) ∥ ∗ ≤ (1 + C 1 ) ∥M ( A ) ∥ ∗ ≤ (1 + C 1 ) p K L ∥M ( A ) ∥ F ≤ (1 + C 1 ) p K L ∥ A ∥ F ≤ (1 + C 1 ) p K L x. Let u it b e an i.i.d. Rademacher random v ariable. Then, E ∥ Ω u ∥ ≲ p 1 / 2 √ N + T where Ω u = ( ω it u it ) N × T . Hence, by using the symmetrization argument with the concentration inequalit y (e.g., (2.3) of Koltchinskii ( 2011 )), we hav e E Q ( x ) ≤ 2 E sup A ∈D ( x )      1 N T X it ω 2 it A 2 it u it      ≤ c 3 ¯ L E sup A ∈D ( x )      1 N T X it ω it A it u it      = c 3 ¯ L E sup A ∈D ( x )     1 N T tr (Ω u A ⊤ )     ≤ c 3 ¯ L E sup A ∈D ( x ) 1 N T ∥ Ω u ∥ ∥ A ∥ ∗ ≤ c 4 p 1 / 2 √ N + T N T ¯ L sup A ∈D ( x ) ∥ A ∥ ∗ ≤ c 5 p 1 / 2 √ N + T N T ¯ L p K L x = 2 c 5 √ 8 ¯ L p K L ( N + T ) √ N T × r p 32 N T x ≤ p 32 N T x + 32 c 2 5 ¯ L 2 K L ( N + T ) N T ≤ p 32 N T x + B ¯ L 2 K L ( N + T ) N T , for suﬃcien tly large B > 0. Here, w e use the fact that ω it is b ounded b y 1 and A it is b ounded by ¯ L . 55 P art 4. Next, w e b ound the tail probabilit y of Q ( x ) − E Q ( x ). Since A 2 it / ¯ L 2 is b ounded, w e can use the Massart inequalit y (e.g., Theorem 14.2 of B ¨ uhlmann and V an De Geer ( 2011 )) to ha ve P  1 ¯ L 2 Q ( x ) > 1 ¯ L 2 E Q ( x ) + t  ≤ exp( − c 6 N T t 2 ) . Set t = 7 xp 32 ¯ L 2 N T . Then, b ecause E Q ( x ) ¯ L 2 ≤ p 32 ¯ L 2 N T x + B K L ( N + T ) N T , w e hav e P  Q ( x ) > B K L ( N + T ) N T ¯ L 2 + 0 . 25 · xp N T  = P  1 ¯ L 2 Q ( x ) > B K L ( N + T ) N T + 0 . 25 · xp ¯ L 2 N T  ≤ exp  − c 7 p 2 x 2 ¯ L 4 N T  . P art 5. Finally , we use the p ealing argument. Note that P  ∃ A ∈ C  C 1 , 2 B ¯ L 2 p  : E ( A ) holds  ≤ ∞ X l =1 P ( ∃ A ∈ Γ l : E ( A ) holds) ≤ ∞ X l =1 P ( ∃ A ∈ D ( x l ) : F ( A ) holds) ≤ ∞ X l =1 P sup A ∈D ( x l ) | Ω( A ) − E Ω( A ) | > B K L ( N + T ) ¯ L 2 + 0 . 25 px l ! = ∞ X l =1 P  Q ( x l ) > B K L ( N + T ) N T ¯ L 2 + 0 . 25 px l N T  ≤ ∞ X l =1 exp  − c 7 p 2 x 2 l ¯ L 4 N T  = ∞ X l =1 exp  − c 7 4 l +1 B 2  ≤ exp( − 16 c 7 B 2 ) 1 − exp( − 16 c 7 B 2 ) < ε for any ε > 0 and suﬃcien tly large B . Here, we use the relations x 2 l = p − 2 4 l +1 v 2 n and v 2 n = B 2 ¯ L 4 N T . Therefore, we hav e, with probabilit y conv erging to 1, that uniformly for all A ∈ C  C 1 , 2 B ¯ L 2 p  , | Ω( A ) − E Ω( A ) | ≤ 0 . 5 · E Ω( A ) + B K L ( N + T ) ¯ L 2 , 56 whic h means that Ω( A ) ≥ 0 . 5 · E Ω( A ) − B K L ( N + T ) ¯ L 2 . Then, the desired result follows from the deﬁnition of Ω( A ) and E Ω( A ). □ Lemma C.6. (i) We have U W = U ( N 0 ) H W wher e U ( N 0 ) = [ u 1 , · · · , u N 0 ] ⊤ and H W = ( U ⊤ ( N 0 ) U ( N 0 ) ) − 1 / 2 G W for some K × K ortho gonal matrix G W ; (ii) We have U T = U H T wher e H T = ( U ⊤ U ) − 1 / 2 G T for some K × K ortho gonal matrix G T ; (iii) ∥ H W ∥ ≲ √ N √ N 0 ,   H − 1 W   ≲ √ N 0 √ N , ∥ Q W ∥ ≲ √ N 0 √ N ,   Q − 1 W   ≲ √ N √ N 0 with pr ob ability c onver ging to 1. In addition, ∥ H T ∥ ,   H − 1 T   , ∥ Q T ∥ ,   Q − 1 T   ar e b ounde d; (iv) ||  b U ⊤ T , ( N 0 ) b U T , ( N 0 )  − 1 || = O p  N N 0  . Pro of. (i) Let Ω W = ( U ⊤ ( N 0 ) U ( N 0 ) ) 1 / 2 D 2 ( U ⊤ ( N 0 ) U ( N 0 ) ) 1 / 2 and G W b e a K × K matrix whose columns are the eigen vecto rs of Ω W suc h that Λ W = G ⊤ W Ω W G W is the descending order diagonal matrix of the eigen v alues of Ω W . Deﬁne H W = ( U ⊤ ( N 0 ) U ( N 0 ) ) − 1 / 2 G W . Then, w e ha v e ( U ( N 0 ) D 2 U ⊤ ( N 0 ) ) U ( N 0 ) H W = U ( N 0 )  U ⊤ ( N 0 ) U ( N 0 )  − 1 / 2  U ⊤ ( N 0 ) U ( N 0 )  1 / 2 D 2  U ⊤ ( N 0 ) U ( N 0 )  1 / 2  U ⊤ ( N 0 ) U ( N 0 )  1 / 2 H W = U ( N 0 )  U ⊤ ( N 0 ) U ( N 0 )  − 1 / 2 h  U ⊤ ( N 0 ) U ( N 0 )  1 / 2 D 2  U ⊤ ( N 0 ) U ( N 0 )  1 / 2 G W i = U ( N 0 )  U ⊤ ( N 0 ) U ( N 0 )  − 1 / 2 Ω W G W = U ( N 0 )  U ⊤ ( N 0 ) U ( N 0 )  − 1 / 2 G W Λ W = U ( N 0 ) H W Λ W . In addition, note that  U ( N 0 ) H W  ⊤  U ( N 0 ) H W  = H ⊤ W U ⊤ ( N 0 ) U ( N 0 ) H W = G ⊤ W G W = I K . Therefore, the columns of U ( N 0 ) H W are the eigen vectors of U ( N 0 ) D 2 U ⊤ ( N 0 ) and the left sin- gular v ectors of U ( N 0 ) D V ⊤ . (ii) The pro of is the same as that of the ab ov e. (iii) Since ∥ H W ∥ ≤      U ⊤ ( N 0 ) U ( N 0 )  − 1 / 2     ∥ G W ∥ and G W is an eigen vector matrix, w e ha ve ∥ H W ∥ ≲ √ N √ N 0 and   H − 1 W   ≲ √ N 0 √ N with high probabilit y b y Assumption 4.3 . Similarly , ∥ H T ∥ and   H − 1 T   are b ounded since U ⊤ U = I K . In addition, b ecause Q − 1 W = H W O ⊤ W , w e ha v e 57 ∥ Q W ∥ ≲ √ N 0 √ N and   Q − 1 W   ≲ √ N √ N 0 with high probability . Because Q − 1 T = H T O ⊤ T , we ha ve ∥ Q T ∥ and   Q − 1 T   are b ounded. (iv) First, note that    b U ⊤ T , ( N 0 ) b U T , ( N 0 ) − Q −⊤ T U ⊤ N 0 U N 0 Q − 1 T    ≲   U N 0 Q − 1 T      b U ( N 0 ) , T − U ( N 0 ) Q − 1 T    F = O p  √ N 0 √ N R T λ min , T  = o p  N 0 N  since ∥ U N 0 ∥ = O p  √ N 0 √ N  and    b U T − U Q − 1 T    F = O p  R T λ min , T  . Then, b ecause λ min ( Q −⊤ T U ⊤ N 0 U N 0 Q − 1 T ) ≥ λ 2 min ( Q − 1 T ) λ min ( U ⊤ N 0 U N 0 ) ≥ c N 0 N , for some constan t c > 0, we ha ve with probability conv erging to 1 that λ min  b U ⊤ T , ( N 0 ) b U T , ( N 0 )  ≥ λ min ( Q −⊤ T U ⊤ N 0 U N 0 Q − 1 T ) −    b U ⊤ T , ( N 0 ) b U T , ( N 0 ) − Q −⊤ T U ⊤ N 0 U N 0 Q − 1 T    ≥ c 2 N 0 N . Hence, ||  b U ⊤ T , ( N 0 ) b U T , ( N 0 )  − 1 || = O p  N N 0  . 58

Robust Matrix Estimation with Side Information

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment