Goodness-of-Fit Tests for Latent Class Models with Ordinal Categorical Data

Goodness-of - Fit T ests for Latent Class Models wit h Ordinal Cate gorical Data Huan Qing a, ∗ a Scho ol of Ec onomics an d F i nance , Chongqi ng Univer sity of T ech nolog y , Chongqing, 400054, China Abstract Ordinal categorical data are widely collected in psycholog y , education , and other social sciences, appearin g co m - monly in question naires, assessments, and sur veys. Latent class models provid e a ﬂexible fr amew ork f or uncovering unobser ved heterogen eity b y gro uping individuals into ho mogeneo us classes b a sed on th eir respo n se pattern s. A fun- damental challen ge in app lying these mod els is de termining the n umber of late n t classes, w h ich is u n known and must be infe r red from data. In this paper, we p ropo se on e test statistic for this problem . The test statistic centers the largest singular value of a norma lized residual m atrix by a simple sample-size adju stme nt. Und e r the null hypothe sis that the candidate numb er of latent classes is corr e ct, its up per bou nd conv e rges to zer o in probab ility . Und e r an un der-ﬁtted alternative, th e statistic itself exceeds a ﬁxed positiv e co nstant with probability approa c hing one. This sharp dichoto- mous behavior of th e test statistic yield s two sequen tial testing algo rithms that co nsistently estimate th e true n u mber of latent c la sses. Exten si ve experime ntal stu d ies conﬁr m the theo retical ﬁnding s and dem onstrate their acc uracy and reliability in d etermining th e n umber o f latent classes. K eywor ds: Ordinal categorical data, latent class model, goodn e ss-o f-ﬁt, estimatio n of n umber of latent classes 1. Introduction Ordinal ca tegorical data are com m only encoun tered in psychology , educa tion, po litical science, and many other ﬁelds. In psych ological surveys, responden ts rate their agreemen t on a Likert scale, with o ptions coded as 0 , 1 , 2 , 3 , 4 representin g “stron gly disagr e e, ” “d isagree, ” “neutral, ” “agree, ” and “strongly agree. ” In ed ucational assessments, student perform ance is of ten classiﬁed into o rdered p roﬁciency lev els suc h as 0 (below ba sic), 1 (basic), 2 (proﬁcien t), and 3 (advanced). In political polls, individuals in dicate their lev el of support for a policy using a similar ord e red scale. Such data can b e organize d into an N × J r esponse matrix R , wher e N is the n umber of sub jects (individuals), J is the numb e r of item s, and each entr y R ( i , j ) records subject i ’ s respo n se to item j . Responses take values in { 0 , 1 , . . . , M } , with 0 denotin g the lowest intensity and M the h ighest. When M = 1, the data are dichotomou s (also known as binar y ); whe n M ≥ 2, th ey are p olytomo us. A key featu r e of ordinal categor ical data is that wh ile th e categories f ollow a na tural order, the d istances between them are not necessarily equa l or inte r pretable. Any v a lid statistical an alysis must respect this ordinal nature with out imposing un warranted assumption s abo ut equ al spacin g [ 1 ]. The latent class model (LCM) pr ovides a ﬂexible and in terpretable framework for u ncovering latent pop ula- tion structure from such data [ 9 ]. This m odel is widely used in p sycholog ical, behavioral, and social sciences [ 12 , 27 , 7 , 11 ]. The LCM p osits th a t the pop ulation consists o f K d istinct laten t classes (also known as gr oups), an d that condition al on class memb e rship, an individual’ s r esponses to di ﬀ e rent items are inde penden t. This assumption accounts f or the association s observed amo ng item s, as any de p endenc e b etween responses arises solely from their shared latent class membersh ip. For o rdinal responses, a natur al speciﬁcation mod e ls each item respon se as a Bino- mial ran dom variable with M trials a n d a class-speciﬁc success pr obability . When M = 1, this r educes to the Bernoulli distribution co mmonly u sed for binary d ata [ 6 , 32 , 20 ]. F o r M ≥ 2, the Bino m ial form ulation directly captures the polytom ous natu re by allo wing the expected r esponse to increase with the success pro bability , thereby shifting the ∗ Correspondi ng author . Email addre s s: qinghua n@u.nus.e du&qinghuan@cqut.edu.cn (Huan Qing) Prep rint submitted to F ebruary 26, 2026 distribution toward higher categorie s. As d iscussed extensively in the literature on categorical data analysis [ 1 ], the Binomial distribution is a fundam ental tool for mo deling discrete r esponses, an d its mea n is dir ectly linked to the un- derlying probab ility of succ ess. Historically , parameter estimation has relied o n the Exp e c tation-Maximization (EM) algorithm , which treats class memb erships as missing data [ 8 ]. Ho wev e r , the EM alg orithm can be comp utationally demand in g fo r large datasets and is sensiti ve to initialization, of ten requiring multiple random starts to a void local optima [ 5 ]. T o address these limitations, alternative methods have b een developed in recent years, includin g spectral clustering that exploits th e low-rank structur e o f the data matr ix [ 21 , 22 , 23 , 1 9 , 20 ], tenso r d ecompo sition method s that operate on lo w-or d er moments [ 32 ], and regularized estimation techniques that p erform simultaneous parame- ter e stima tio n and mo d el selectio n [ 6 ]. Other framew orks that can model o r dinal categor ical data with polytomous responses exist, such as the ge n eral diag nostic mo del proposed b y [ 26 ]. A fun damental and unre solved challenge in applyin g LCMs is determ ining the numb er of latent c lasses K . In real-world ap plications, K is rarely known a priori and must b e in ferred from th e o bserved resp o nse matrix R . Se- lecting too fe w classes can obscu re m eaningf ul hetero geneity , while selecting too many can lead to overﬁtting and scientiﬁcally spuriou s conclusions. Traditional approaches for selecting K include inform ation criteria such as the Akaike Inform ation Criter ion (AI C) and Bayesian Info rmation Criter ion (BIC) [ 2 , 24 ], and likelihoo d-ratio tests, typ - ically implemen ted with the Expectation-Maximization (EM) algorithm. These metho ds inherit the computation al burden of the EM algo rithm, makin g them expensi ve when many cand idate K values mu st be evaluated. M oreover , their theo retical prope rties in high-dimensional settings where the n u mber of items J grows with the sample size N remain la rgely u nexplored . The consistency o f these criter ia of ten relies on regularity cond itions that may not h old in complex latent variable mo dels—a point examined b y Keribin [ 16 ] in the con text of mixture mod e ls, wh ic h in clude LCMs as a special case. Mor e recent appro aches address th ese limitations. Regularized latent class analysis, pro posed by Chen et al. [ 6 ], learns the latent stru cture by shrink ing small param e ter d i ﬀ erences tow ard zero, th ereby revealing the u n derlying numb er o f laten t classes. Their fram ework u ses a gener a lized in formation cr iter ion (GIC) to select both the regularization param e te r and the number of classes. Spectra l m e th ods o ﬀ er ano ther d irection. By exploiting the low- rank stru cture o f th e expected respon se ma tr ix, the numb er of latent classes can be estimated b y thresholding th e singular v alues of the data matr ix . For binary r esponses, L y u and Gu [ 20 ] d ev eloped such a method, wh ich is compu - tationally e ﬃ cient a nd theor e tically g round ed, but yields o nly a p oint estimate of K with o ut a fo rmal goodne ss-of-ﬁt assessment. While promising , th ese existing me th ods lack r igorou s go odness-of -ﬁt guarantees, especially for ordin al categorical data. This gap mo tivates the present work. In this paper, we introduc e novel good ness-of- ﬁt testing proced ures that directly add r esses the p roblem o f estimat- ing K in LCMs fo r ordin a l ca tegor ical d ata. Our approac h es tra n sform a challengin g mod el selection p r oblem into a sequence o f simple spectral chec k s, o ﬀ ering both comp utational e ﬃ cien cy and statistical rig o r . The main contr ibutions of this work are as follows: • W e dev elop a goodn ess-of-ﬁt framework for determ ining the numb er of latent classes in ord in al categorica l data under the latent class models. The f r amew ork intro duces a test statistic con stru cted fr om a normalized r e sid ual matrix by a simple sample-size adjustment. W e prove that its upp er boun d con verges to zero in prob ability under the null hypoth e sis th at th e can didate n umber of latent classes K 0 equals the true K , while the statistic itself exceed s a ﬁxed po siti ve co nstant with probab ility ap p roachin g one u nder an under-ﬁtted a lter native. These results ar e o btained u sing matr ix co ncentratio n ineq ualities and a pe rturbation a n alysis th a t con trols the error from estimated p arameters. • Based on the dichoto m ous behavior of the test statist ic, we develop two seq uential testing pro cedures that consistently estimates the true nu mber of latent classes unde r the latent class mo dels. The co nsistency theo r ems specify the required gr owth r ates of the thr esholds re la tive to the sample dimensions. • W e e valuate both pro cedures through extensive simulation s. The results sho w that our methods sign iﬁcantly outperf orm existing ap p roach in accuracy . A real-d ata applica tio n fur ther demo n strates their practical value. The r emainder of this paper is organized as follows. Section 2 de ﬁnes the latent class model for ord inal categorical data and forma lly states the prob le m of estimating th e number of latent classes. Section 3 introdu ces the test statistic and estab lishe s its asympto tic prop erties und er both null an d under-ﬁtted alter n ativ es. Sectio n 4 pr esents two sequen- tial testing proced ures based on th ese statistics and proves their con sistency . Section 5 reports simulation studies and 2 a real data application . Section 6 conclud es this pap er and discu sses futu re research direction s. T echnica l proo fs are giv en in the Appendix . 2. Model and problem This section in troduces the latent class m odel f or ord inal categorical da ta and then states the p roblem of estimating the number of latent classes. 2.1. Latent class mo del W e now give a fo rmal deﬁnition of th e laten t class mod el. Recall th at th e obser ved d ata f orm an N × J re sponse matrix R , where N is the numb er of subjects an d J is the number o f items. E ach entry R ( i , j ) recor ds th e response of subject i to item j , taking an ordinal value in the set { 0 , 1 , . . . , M } , with M ≥ 1 ﬁxed a n d representin g the highest response category . Suppose all N su bjects are divided in to K distinct latent classes, which we deno te by C 1 , C 2 , . . . , C K . Le t ℓ ( i ) ∈ [ K ] be the class memb ership of subject i ; tha t is, ℓ ( i ) = k pre c isely when i ∈ C k . Th e c lassiﬁcation m atrix Z ∈ { 0 , 1 } N × K is deﬁned by Z ( i , k ) = 1 { ℓ ( i ) = k } . Hence each row of Z contain s exactly one 1, an d Z ⊤ Z = d iag( N 1 , . . . , N K ), where N k = |C k | d enotes the n umber o f su b jects in cla ss k for k ∈ [ K ]. W e also set N min = m in k ∈ [ K ] N k and N max = m ax k ∈ [ K ] N k . For each item j and each latent class k , let Θ ( j , k ) ∈ [0 , M ] b e th e expec ted response of a sub ject in class k to item j . Th ese values fo r m the item parame ter matrix Θ ∈ [0 , M ] J × K . Deﬁne the N × J expected r esponse m atrix R : = E [ R ] as R = Z Θ ⊤ . By deﬁnition, R ( i , j ) = Θ ( j , ℓ ( i )) . W e now spe cify th e data- generating me chanism of the observed response matr ix R . Follo wing the f r amew ork of gener alized linear models f or categor ical data [ 1 ], cond itional on the latent structure (i.e., on Z ), we assume tha t the entries of R are in depend ent and each follows a b inomial distribution. The distribution depend s o n Z and Θ only thr ough th e expected response m atrix R . E xplicitly , R ( i , j ) ∼ Binomial M , R ( i , j ) M ! , i ∈ [ N ] , j ∈ [ J ] , indepen d ently across a ll p airs ( i , j ). Equiv alen tly , fo r ea ch m ∈ { 0 , 1 , . . . , M } , we have P  R ( i , j ) = m  = M m !  R ( i , j ) M  m  1 − R ( i , j ) M  M − m , i ∈ [ N ] , j ∈ [ J ] . It follows immediately that E [ R ( i , j )] = R ( i , j ) , V ar[ R ( i , j )] = R ( i , j )  1 − R ( i , j ) M  i ∈ [ N ] , j ∈ [ J ] . W e now fo r mally de ﬁne the latent class model via the f ollowing d eﬁnition. Deﬁnition 1 (L atent class m o del for o rdinal categorical data) . F or ﬁxed po sitive inte gers N , J , M , K ∈ N with M ≥ 1 and K ≥ 1 , the la tent class mod el with K latent cla sses, d e n oted LCM( K ) , is a statistical mo del for the observed r espon se matrix R ∈ { 0 , 1 , . . . , M } N × J with parameters ( Z , Θ ) ∈ { 0 , 1 } N × K × [0 , M ] J × K , wher e Z is the classiﬁcation matrix induc e d by a partition {C 1 , . . . , C K } of [ N ] an d Θ is the item parameter matrix. The expected r esponse matrix is R = Z Θ ⊤ . Con ditional on Z and Θ , the entries of R ar e indepen dent and satisfy R ( i , j ) ∼ Binomial M , R ( i , j ) M ! , i ∈ [ N ] , j ∈ [ J ] . The form ulation o f the latent class mo del giv e n in Deﬁn itio n 1 follows th e seminal framework introd uced by [ 9 ], where th e pop u lation is a ssum ed to consist of K latent classes and, con ditional on class m embership , the respo nses to di ﬀ erent items are ind ependen t. Th e LCM model co nsidered in th is pap er extends that c lassical mo del to accommo - date or dinal ca tego r ical respo nses throu gh a bino m ial speciﬁcation . When M = 1, the bino mial distribution red uces to Bernoulli, and LCM( K ) becomes the classical latent class model for binar y respo nses, which has been extensiv ely studied in th e liter a ture [ 6 , 30 , 11 , 10 , 32 , 20 ]. 3 2.2. Pr ob lem statemen t Throu g hout this paper, we u se K to denote the true number of latent classes in the laten t class model deﬁne d in Deﬁnition 1 , an d u se K 0 to de n ote a hypo thetical number of latent classes. The goal is to estimate K f rom the ob served response matrix R . W e adopt a sequen tial go odness-of-ﬁt testing fr amew ork. Th is approach was ﬁrst introduced for stochastic block mode ls by Bickel and Sar kar [ 4 ], who de veloped a recur si ve bipartition in g algo rithm b ased on the limiting T racy-W idom distribution of the largest eigenv alue of th e centered ad jacency ma trix. Subsequen t work extended this idea to test H 0 : K = K 0 directly using various test statistics with k nown asymptotic null distributions [ 17 , 1 3 , 14 , 29 ]. For a candidate value K 0 , we test H 0 : K = K 0 versus H 1 : K > K 0 . If H 0 is not r ejected, we set ˆ K = K 0 ; otherwise we incr ease K 0 and r epeat. T he key is to co nstruct a test statistic T K 0 whose behavior under the null and altern ativ e hypotheses is sharply di ﬀ erent. T he following sections develop such a statistic an d establish its asymptotic properties. 3. T est statistic The co r e o f ou r m ethod is a test statistic whose behavior is fundamen tally di ﬀ erent whe n the mode l is correctly speciﬁed co m pared to wh e n it is under-ﬁtted. W e construct this statistic in two stages. First, we consider an idealized version that u ses the true, u n known par ameters. This ideal version provides the theoretical found ation. W e then replace the unknowns with estimates to obtain a practical, data-driven statistic. Finally , we analyze the asymptotic proper ties o f th is practical statistic un der b oth the null and alternative h ypoth e ses. 3.1. Idea l normalized r esidua l matrix T o b u ild intuition, we ﬁr st construct the idea l no rmalized resid u al matrix using tru e p arameters, which serves as the theoretical fo undation for ou r test statistic. Deﬁne the ideal norma lized residual matrix R ∗ ∈ R N × J as: R ∗ ( i , j ) = R ( i , j ) − R ( i , j ) p N V ( i , j ) , V ( i , j ) = R ( i , j ) 1 − R ( i , j ) M ! . (1) For this ide a l n ormalized r esidual matrix R ∗ to be well-deﬁned and for its entries to have stable v ariance, the denomin ator must b e b ounded aw ay f rom ze r o. Th is leads to o u r ﬁrst assumptio n . Assumption 1 ( Parameter bo unded n ess) . Ther e exis ts a constant δ ∈ (0 , 1 / 2] su ch that for all j ∈ [ J ] , k ∈ [ K ] , δ ≤ Θ ( j , k ) M ≤ 1 − δ . Assumption 1 gua rantees that V ( i , j ) is boun ded below b y a positive con stant, making th e n ormalization in Equa- tion ( 1 ) v alid. Under th is assumption, it is straig htforward to verify that E [ R ∗ ( i , j )] = 0 an d V ar( R ∗ ( i , j )) = 1 / N for all i , j . M eanwhile, we call δ signal stren gth parameter in this p aper for the following reason : δ contr ols the rang e o f the expected item responses Θ ( j , k ) thro ugh the c onstraint δ ≤ Θ ( j , k ) / M ≤ 1 − δ . When δ is small, th e admissible interval [ δ M , ( 1 − δ ) M ] is wid e, allowing the class-speciﬁc parameters to take values nea r the extremes 0 o r M . This creates substantial di ﬀ ere n ces between classes, resulting in a strong signal th at facilitates accurate estimatio n o f K . As δ increases tow ard 0 . 5, the interval shrin ks an d becomes sym metric aro und M / 2. The class-speciﬁc parame ter s are then conﬁn ed to a na r row central region, reduc in g the between -class d i ﬀ erences and weakening the signal. Conse- quently , d istinguishing between latent classes b ecomes more di ﬃ cult, an d estimating the true nu mber of latent classes K becomes increasin gly challeng ing. Thus, we see th a t smaller values o f δ cor respond to stronger sign als and easier estimation. W e are inte r ested in the spectral norm of this ideal residu al matr ix . The fo llowing lem m a characterizes the asymp- totic b ehavior of σ 1 ( R ∗ ), the largest singular v alue of R ∗ (i.e., spe c tr al n orm). Lemma 1 ( Spectral no rm of ideal residual matrix) . When Assumption 1 holds, for an y ǫ > 0 , we have lim N →∞ P  k R ∗ k ≤ 1 + r J N + ǫ  = 1 . 4 Lemma 1 imp lies that σ 1 ( R ∗ ) is asym p totically n o larger th an 1 + √ J / N . This m otiv ates th e d eﬁnition o f an ide a l test statistic, T ideal : = σ 1 ( R ∗ ) −  1 + r J N  , which b y L emma 1 , satisﬁes P ( T ideal > ǫ ) → 0 for any ǫ > 0 . 3.2. Practical test statistic In practice, th e true parame ters Z and Θ in the ideal test statistic T ideal are unknown. W e now con struct a prac- tical test statistic using estimated parameters an d e stab lish its a symptotic p roperties under both null and alternativ e hypoth eses. Giv e n a can didate numb er of laten t classes K 0 , ap ply a consistent classiﬁcation estimato r M to obtain estimates estimated classiﬁcation matr ix ˆ Z , estimated item parameter matrix ˆ Θ , and the ﬁtted expe c ta tio n resp onse ma tr ix ˆ R = ˆ Z ˆ Θ ⊤ . W ith these estimates, we deﬁne our pr actical n ormalized r e sidual matrix ˜ R ∈ R N × J as ˜ R ( i , j ) =                R ( i , j ) − ˆ R ( i , j ) q N ˆ V ( i , j ) , if ˆ V ( i , j ) > 0 , 0 , otherwise , (2) where ˆ V ( i , j ) = ˆ R ( i , j )  1 − ˆ R ( i , j ) / M  . T o ensure th at ˜ R is a good p roxy for the ideal normalized residua l ma trix R ∗ under the n u ll h ypothe sis, we n eed to c o ntrol th e error introduce d by p arameter estimation. This r e quires several regularity condition s on the latent class structure and the estimator . Assumption 2 (Class balanc e ) . Ther e exists a constan t c 0 > 0 such that for all k ∈ [ K ] , c 0 N K ≤ N k ≤ 1 c 0 N K . Assumption 2 ensures that n o latent class is too sma ll, which is n e cessary for obtain in g uniform concentration bound s a c r oss classes—a key in g redient in bo undin g the estimatio n error of ˆ R . Assumption 3 (Separ ation condition) . There exist two co n stants ζ > 0 and c 1 > 0 such that for any distinct true classes k and l, the set T kl = { j ∈ [ J ] : | Θ ( j , k ) − Θ ( j , l ) | ≥ ζ / 2 } satisﬁes |T kl | ≥ c 1 J . Assumption 3 gua rantees that any two d istinct latent classes are distinguished by a t least a constant fraction of the items, with a minimu m d i ﬀ erence in exp ectation. This sep aration will play a key role both in establishing consistency of the estimato r (un der th e n u ll) a nd in cr eating a detectable signal under under -ﬁtted alternatives . Assumption 4 (Growth co nditions) . The numbe r of items J and the numb er of latent cla sses K satisfy the fo llowing gr owth rates as N → ∞ : (i) K 2 log( N + J ) N − → 0 ; (ii) J K log( J K ) N − → 0 . Assumption 4 provides precise growth rates for the dim ensions. They ar e needed to ensure that accumu lated estimation er r ors vanish asy m ptotically . Remark 1. Assumptions 1 – 3 are mild and widely a dopted in th e an alysis of laten t class mod e ls with high- d imensional data. Sp eciﬁcally , the b ounded ness con dition in Assumptio n 1 and the cla ss sep aration condition in Assumption 3 a re essential for establishing the co n sistency of estimato rs in b inary LCMs when both the number of subjects and the number of items in cr ease [ 32 ]. Assumption 2 , which ensur es a min imum class size, is a typical r eq u ir emen t for obtainin g unifo rm concentration b ound s acr oss all latent classes. In co ntrast, Assumption 4 pr ovid e s pr e cise gr owth rates on the d imensions J and K r elative to the samp le size N that a r e necessary for the technical p r oofs of ou r ma in theor ems, particula rly to contr o l the accu mulated estimation err ors. Finally , the asym ptotic ana lysis un der the null hypo thesis requ ires a condition on the classiﬁer itself. 5 Assumption 5 (Con sistency of classiﬁcation estimator ) . When K 0 = K , the cla ssiﬁcation estimation method M satisﬁes P  ˆ Z = Z Π  → 1 , wher e Π ∈ { 0 , 1 } K × K is a pe rmutation ma trix. Assumption 5 requir es a classiﬁer that achieves exact rec overy: ˆ Z = Z Π with proba b ility ten ding to one. The spectral clusterin g with likelihood reﬁn ement (SOLA) prop osed b y L yu and Gu [ 20 ] p rovides such a guar a ntee, but it is d esigned for binar y respon ses and does n ot directly han dle the po lytomou s o rdinal responses in our mo del. T o ob tain a prac tical estimator f or o ur setting, we introduce a spectral clu stering algor ithm called SC-LCM in Append ix C . Theorem 6 shows that SC-LCM is consistent for estimatin g the cla ss memb ership matrix Z in the sense tha t the clustering err or er r( ˆ Z , Z ) deﬁn ed in Append ix C conv e rges to zero in pro bability , but it d oes no t m eet the exact recovery requirem ent of Assumption 5 . This g ap betwe en a theo retical con d ition an d th e actual pe r forman ce of an estimator is no t unc o mmon. A similar situation occur s in th e stocha stic block m odel literature for network an alysis. In developing a go o dness-of- ﬁt test fo r stoch astic block m odels, Lei [ 17 ] assumed the existence of a commu nity estimator that exactly recovers th e true pa rtition. In their numerica l work, h owe ver, they used the sp ectral clustering alg orithm studied by Lei and Rinaldo [ 18 ], which is o nly known to be co nsistent (i.e . , its misclassiﬁcation pro portion tend s to z e ro). Despite this discrepancy , their testing pro cedure per formed very well in simulations. Recently , important theoretical ad vances sug g est th at exact recovery m ig ht also be achievable for SC-LCM under stronger c o nditions. Th e leav e-one-o ut sing ular subspace p erturbatio n analysis developed b y Zhang and Zhou [ 33 ] provides a p owerful tool for obtaining entrywise b ound s on the di ﬀ erence between emp irical and population singu la r vectors. Bu ilding on this, L yu and Gu [ 20 ] pr oved that their SOLA alg o rithm achieves exact recovery fo r binary latent class mo dels when the signal is su ﬃ cien tly strong. This indicates that SC-LCM could po tentially be shown to achieve exact recovery u n der analogo u s strengthe ned conditions. A rigoro us p roof of this, h owe ver, would require a detailed entrywise analysis that is beyond the sco p e o f the present pape r, whose main focus is to develop a goodn ess-of-ﬁt test for the numb er of latent classes. W e leav e this m e aningfu l theoretical question for our future work. Inspire d by th e pr ecedent in network analysis and suppo r ted by rec e nt theo retical insights, we adopt SC-LCM as the practica l implementatio n of M in our sequen tial testing algorithm . As the n umerical studies in Section 5 dem o nstrate, th e resulting sequential testing proced u re estimates the tru e num b er of latent classes with hig h accuracy ac r oss a wide rang e of settings, pr oviding empirical support fo r using SC-LCM as a reliable plu g-in estimator even thoug h its exact r ecovery proper ty has no t been f ormally established her e . W ith the se assumptio ns in place, we can qua n tify the di ﬀ ere n ce between the practical and ideal normalize d r esidual matrix m atrices u nder the nu ll h ypothesis. Lemma 2 ( Perturbation c ontrol of nor malized re sidual matrix ) . When K = K 0 and A ssumptions 1 – 5 h old, we h a ve k ˜ R − R ∗ k = o P (1) . (3) Lemma 2 shows that u nder the null hy pothesis, the empirical n ormalized residu al matrix ˜ R is asymptotically equiv a lent to its o racle counter p art R ∗ in sp ectral nor m. Hen ce, ˜ R asy mptotically inh e r its the behavior of R ∗ shown in Lemm a 1 . This m eans that after corr ectly ﬁtting a K 0 -class mod el, the empirical residu al matrix is asymptotically indistinguishab le f rom a pur e n oise matr ix . Based on th is lemma, our test statistic for a cand idate K 0 is then deﬁne d as T K 0 = σ 1 ( ˜ R ) − (1 + r J N ) . (4) The following theorem e stab lishes T K 0 ’ s behavior w h en the m odel is c orrectly speciﬁed . Theorem 1 (Null behavior of test statistic) . When K = K 0 and Assumptions 1 – 5 h old, for a ny ǫ > 0 , we h ave lim N →∞ P  T K 0 < ǫ  = 1 . Theorem 1 establishes that un der the null hyp o thesis, T K 0 is bo unded above by any positive con stant with pro ba- bility ten ding to on e. Hence , th e test statistic is asym ptotically negligib le, providin g no evidence against the n ull. This result (i.e., Th e o rem 1 ) alone, h owe ver, d o es not su ﬃ ce for mod el selectio n. Therefor e , we m ust also und erstand the 6 behavior of T K 0 when the ca n didate number of latent classes is too small. When K 0 < K , the estimated expec tatio n response matrix ˆ R can n ot capture the full structure o f the d ata, leaving a determin istic (i.e., no-r andom) sign al in the residual m atrix that we can detect. T o quan tify this signal, we in troduce the ﬁn al co ndition. Assumption 6 (Signal-to-noise ratio) . Deﬁne th e constants c : = √ c 3 0 c 1 2 √ 2 , C signal : = 2 c ζ √ M , C noise : = 5 δ . There exists a constant η 0 > 0 such that for all su ﬃ cien tly large N , C signal √ J √ K K 0 ≥ C noise + 1 + 3 η 0 , Assumption 6 quantiﬁes the stre n gth of the signa l: the left-hand side is the minimal deterministic signal (afte r scaling in ˜ R ), while the righ t-hand side ag g regates the max im al p ossible noise and th e centering term, plus a b u ﬀ er . When this holds, th e signal dom inates the no ise, g uaranteein g th at the test will reject H 0 with probab ility tending to one. W e are now ready to state the power g uarantee. Theorem 2 (Alternative b ehavior of the test statistic) . Under K 0 < K and Assump tio ns 1 – 4 an d 6 , we ha v e lim N →∞ P  T K 0 > 2 η 0  = 1 . Theorem 2 shows that under th e alternative hypoth e sis K 0 < K , the test statistic T K 0 exceeds a ﬁxed positi ve constant 2 η 0 with p robab ility tendin g to one. Henc e , the test has p ower a gainst any und er-ﬁtted mod el. T og e ther with Theorem 1 , we ha ve established the fundamen tal d ichotom y : T K 0 ’ s upper b ound is near zero unde r correct speciﬁ- cation but itself is large u nder under-ﬁtting. This sharp d ichotomo us beh avior of T K 0 under co r rect and u nder-ﬁtted speciﬁcations p r ovides the foun d ation for the c o nsistent sequential estimation of the true number of laten t classes K . 4. Algorithms The sharp dich o tomy in the behavior of T K 0 under the null and altern ativ e hyp otheses motiv ates two natur a l sequential testing procedur es for estimatin g K : the ﬁrst stops at the smallest K 0 for which T K 0 falls below a thresho ld; the second stops when th e ratio of successive test statistics excee d s a diverging th reshold. Th is section pr esents two such sequential alg o rithms an d proves their estimation consistency . 4.1. GoF-LCM algorithm Our ﬁrst algorithm d irectly imp lements th e d ichotomo us b ehavior of T K 0 for e stima ting K . Alg orithm 1 sequen- tially tests candida te s K 0 = 1 , 2 , . . . , K max . It accep ts the ﬁrst K 0 for which T K 0 falls below a thresho ld τ N that decays to z e ro. The maximum ca ndidate K max is chosen so that any K 0 ≤ K max respects the g r owth co ndition in Assump- tion 4 (i). A convenient default is K max =  p N / log( N + J )  . This cho ice e nsures that fo r all candid a te s con sidered, K 2 0 log( N + J ) / N → 0, which is required f or the tech nical argum ents in the consistency pro o f. The following theore m estab lishes the consistency of Algorithm 1 in estimatin g the tru e number of latent classes K un der the latent class mode ls. Theorem 3 ( Consistency of GoF–LCM) . Let the true nu m b er of latent classes be K (which ma y g r ow with N subject to A ssumption 4 ). Suppose Assumptions 1 – 6 hold a nd the th r esho ld sequen ce { τ N } N ≥ 1 used in Alg orithm 1 satisfy (Con1) τ N − − − − → N →∞ 0 and N τ 2 N max( J K log( J K ) , log( N + J )) − − − − → N →∞ ∞ . Then the estima to r ˆ K pr o duced by A lgorithm 1 satisﬁes lim N →∞ P  ˆ K = K  = 1 . 7 Algorithm 1 Go F-LCM: G o odness-o f -Fit T e sting f or Laten t Class Mo dels Require: Observed response matr ix R ∈ { 0 , 1 , . . . , M } N × J , maximum candidate n umber K max (default ⌊ p N / log( N + J ) ⌋ ), thresho ld seque n ce τ N (default τ N = N − 1 / 5 ), and a classiﬁcation estimator M . Ensure: Estimated nu mber of latent classes ˆ K 1: Initialize ˆ K ← K max 2: for K 0 = 1 , 2 , . . . , K max do 3: Apply M to R with candid ate K 0 to o btain ˆ Z , ˆ Θ , ˆ R = ˆ Z ˆ Θ ⊤ 4: Compute T K 0 via Equation ( 4 ) 5: if T K 0 < τ N then 6: ˆ K ← K 0 7: break 8: end if 9: end for 10: return ˆ K Theorem 3 ensur es that GoF-LCM cor rectly id entiﬁes the true numb er of latent classes with pr obability ap proach - ing 1 as sample sizes grow . Condition (Con1 ) quantiﬁes how slowly τ N must decay . The ﬁrst req uirement τ N → 0 ensures that under the null K 0 = K th e test statistic T K ev entually falls below the th reshold with high p robab ility (Theor e m 1 ). The secon d requ ir ement gua rantees that for every under-ﬁtted K 0 < K the statistic T K 0 exceeds τ N with proba bility tending to on e (Th e orem 2 ). The sequential nature of GoF–L CM makes it computa tio nally e ﬃ cient, typically r equiring o nly a few iter a tions b efore stopping at the true K . Remark 2 (Choice of τ N ) . A simple and theo r etically valid defau lt is τ N = N − 1 / 5 . Und er Assumption 4 , we ha ve J K lo g( J K ) = o ( N ) and lo g( N + J ) = o ( N ) . Hence N τ 2 N = N 3 / 5 gr ows faster th an both terms in the d enomina to r , satisfying (Con 1) pr ovided that J K log( J K ) and lo g( N + J ) do not a ppr oa ch the order of N too rapidly . In practice, the algorithm is r obust to moderate variation s o f τ N as long a s it d ecays slowly . Other choices such as τ N = ( log N ) − 1 ar e also admissible under (Co n1) whe n J K log( J K ) gr o ws su ﬃ ciently slowly . 4.2. RGoF- LCM algo rithm In this section, we develop a ratio-based g oodne ss- of -ﬁt test for the latent class m odel. T his metho d complements the GoF-LCM algorith m by using the ratio o f successiv e test statistics, which often exh ibits m ore robust ﬁnite-sam p le behaviour . Recall th at fo r a candidate number of latent classes K 0 , th e test statistic T K 0 is deﬁn ed in Equatio n ( 4 ). For K 0 ≥ 2 , w e in troduce th e ra tio r K 0 : =       T K 0 − 1 T K 0       , (5) with the convention that r 1 is n ot deﬁned. Th e ab so lute value hand les the po ssibility that T K may be negative under the true model, thou gh T heorem 1 gua rantees that T K conv erges to z e r o fr om above. Th e following theor em ch aracterizes the asymptotic b e h aviour of the ratio statistic r K 0 . Theorem 4 (Asym ptotic behaviour of the ratio statistic) . Assume that Assumptions 1 – 6 hold, and K 3 = o ( N ) . W e have: 1. (Diver genc e at the true mod el.) F or the true candid ate K 0 = K , r K P − → ∞ as N → ∞ . 2. (Upper bound under un der-ﬁtting.) F or every K 0 with 2 ≤ K 0 < K , lim N →∞ P       r K 0 > √ M c low √ K ( K − 1)       = 0 , wher e c low is the con stant fr om Lemma 7 (depen ding on ly on the mod el parameters δ , c 0 , c 1 , M , ζ ), an d √ K ( K − 1) r eﬂects the depen dence on the tru e number o f latent classes. 8 In contrast to the origin al test statistic T K 0 , who se u pper bo und converges to zero und er the null (Theorem 1 ) while the statistic itself exceeds a ﬁxed positiv e constant und er unde r- ﬁtting (Th eorem 2 ), the ratio r K 0 ampliﬁes this di ﬀ erence. It di verges to in ﬁn ity a t th e true model K 0 = K , but for every un der-ﬁtted K 0 < K it rem ains b ound e d above b y a quantity d ependin g on K . Thus r K 0 exhibits a sharp peak at the tru e n umber of latent classes, sugg esting a natural sequential stopp ing rule: comp ute r K 0 for increa sing K 0 and stop at the ﬁrst value for which th e ratio exceed s a d i verging thr eshold γ N . W e n ow propose a seque n tial alg orithm based o n this idea. T h e algorithm ﬁr st checks the candidate K 0 = 1 using the o r iginal te st statistic with a de caying th r eshold τ N ; if accepted, it retu rns ˆ K = 1 . Otherwise, it proceed s to compute ratios fo r K 0 = 2 , 3 , . . . , K max and stop s at the ﬁrst K 0 for wh ich r K 0 > γ N . The choice of th resholds must satisfy the condition s in T heorem 5 provided later to ensure consistent e stima tio n of K . Algorithm 2 RGo F- LCM: Ratio-b ased Goo dness-of- Fit fo r L a tent Class Mod e ls Require: Observed r e sponse matrix R ∈ { 0 , 1 , . . . , M } N × J , maximum candidate number K max (default: ⌊ p N / log( N + J ) ⌋ ), thresho ld sequences τ N (default τ N = N − 1 / 5 ) and γ N (default γ N = log N ), and a classiﬁ- cation estimato r M Ensure: Estimated nu m ber of latent classes ˆ K 1: Compu te T 1 with K 0 = 1 2: if T 1 < τ N then 3: return ˆ K = 1 4: end if 5: for K 0 = 2 , 3 , . . . , K max do 6: Compute r K 0 via Equation ( 5 ) 7: if r K 0 > γ N then 8: return ˆ K = K 0 9: break 10: end if 11: end for 12: return ˆ K W e now pr ove that under appropriate co nditions o n the th r eshold seq uences, the RGoF-LCM alg orithm consis- tently estimates the true numb er of late n t classes K . For this theorem , we assume that K is ﬁxed (does not grow with N ) to simplify our analysis and the ch oice of the threshold γ N . Theorem 5 (Con sistency of RGoF-LCM) . Let the true numb er of la tent classes be K (ﬁxed, i.e., not gr owing with N ). Assume th at Assumptions 1 – 6 ho ld. Let the thr e shold sequ ences γ N satisfy: (Con2) γ N − − − − → N →∞ ∞ a nd γ N p N / log J − − − − → N →∞ 0 . Then the estima to r ˆ K pr o duced by A lgorithm 2 satisﬁes lim N →∞ P ( ˆ K = K ) = 1 . Theorem 5 g uarantees that RGoF-LCM consistently recovers the true numb er of laten t classes when K is ﬁxed. Condition (Con2) ensures that the diverging threshold γ N grows fast enou gh to b e exceed ed by r K at the true mo d el (Theor e m 4 p art 1), yet slowly en o ugh so th at u nder-ﬁtted ratios r K 0 ( K 0 < K ) stay belo w γ N with h igh p r obability (Theor e m 4 part 2). T h e two requir ements together p revent both und er-es timation and over-estimation a sy mptotically . Remark 3 (Why K is assumed ﬁx ed) . Theor em 5 a ssumes that the true num b er of latent classes K doe s not gr o w with the sample size . Th is assumption is mad e for theo r etica l co n v e n ience: the upper bou nd for the und er-ﬁtted ratios r K 0 in Theor em 4 contains a fa ctor √ K ( K − 1) that depends on K . Wh en K is ﬁxed, this factor is a constant, and the condition on γ N can be expr essed in a simple form in depend ent of K . If K wer e allowed to incr ease with N , the bound wo u ld gr ow with K , and the choice of a diver g ing thr eshold γ N would have to depend o n the gr owth rate of 9 K as well. Such an extension is possible in princip le but would make the a nalysis considerably mor e complex. W e ther efo r e r estrict ourselves to the ﬁxed-K setting. In practice , a s long a s K is sma ll r elative to the sample size , the ﬁxed-K analysis pr ovides r eliable guida nce for selecting γ N . Remark 4 (Choice of γ N ) . A simple and theo retically valid default is γ N = log N . Ind eed log N → ∞ , and un der Assumption 4 we hav e J = o ( N ) , h ence log J ≤ log N for a ll su ﬃ ciently larg e N . Consequently , log N p N / log J ≤ log N p N / log N = (log N ) 3 / 2 √ N − − − − → N →∞ 0 , so γ N = log N satisﬁes Condition ( Con2). In practice, other slo wly diver ging sequences such as log log N are also admissible as lon g as they meet the gr owth r estrictions. 5. Numerical Studies In this section, we cond uct compr ehensive experimenta l stud ies to ev alu a te the perf ormance o f the propo sed goodn ess-of-ﬁt test and th e two sequential estimation algo rithms GoF-LCM and RGoF-LCM. Our n umerical experi- ments are d esigned to em p irically validate the theoretical p r operties established in Theo rems 1 to 5 . Speciﬁcally , we in vestigate: 1. The behavior of the test statistic T K 0 and th e ratio statistic r K 0 under bo th the null hyp othesis ( H 0 : K = K 0 ) and the alternativ e hyp othesis ( H 1 : K > K 0 ). 2. The acceptance rates of GoF-LCM and RGoF-LCM under the true m odel an d their rejection r ates u n der und er- ﬁtted models. 3. The accur a cy of G o F-LCM and RGoF-LCM in estimating th e true number of latent classes K un der v arious combinatio ns of sam ple size N , nu mber of items J , sign al streng th parameter δ , and tru e number of laten t classes K . 4. The comp utational e ﬃ cien cy of the two proposed m ethods wh en the samp le size N increases. 5. The sensitivity of the algorithm s to the ir thresh olds: τ N for GoF-LCM and γ N for RGoF-LCM. 6. The robustness of GoF-LCM an d RGoF-LCM to the n umber of items J when J grows much faster than N , thereby violating the g r owth condition J = o ( N ) re quired by Assumptio n 4 . 5.1. General simulatio n setu p Data are gene rated f rom the latent class m odels for ord inal categoric a l data as deﬁned in Deﬁnition 1 . The generation process follows th e steps outlined below , with all parame ter s cho sen to satisfy Assumptions 1 – 3 unless otherwise n oted. Class membership matrix Z . For a giv e n tr ue number of latent classes K , we assign each of th e N subjects to one of the K classes in depend ently with equal p robability 1 / K . Th is yields a memb ership vecto r ℓ ∈ [ K ] N , an d the classiﬁcation m atrix Z ∈ { 0 , 1 } N × K is deﬁn ed by Z ( i , k ) = 1 { ℓ ( i ) = k } . Th is ran dom assignme nt ensur e s that Assumption 2 (class ba lance) h olds with hig h p robab ility for su ﬃ c iently large N . Item parameter matrix Θ . W e g enerate Θ ∈ [0 , M ] J × K with con tr olled sig nal streng th a nd class sepa ration as fo llows. Let M = 5 be ﬁxed throu ghou t all experiments. For e a ch j ∈ [ J ] and k ∈ [ K ], independ ently d raw θ jk ∼ Un iform[ δ M , (1 − δ ) M ] , where δ ∈ (0 , 0 . 5] is th e sign al strength parameter . This construction guarantees δ ≤ θ jk / M ≤ 1 − δ by d eﬁnition, thereby satisfyin g A ssum ption 1 . Assumption 3 requir es that for any two d istinct classes k 1 , k 2 , at least a fraction c 1 of the items exhibit a di ﬀ eren c e in exp e ctation a t least ζ / 2. W e set c 1 = 0 . 3 thr ougho ut an d deﬁn e the separation th r eshold as ζ = (1 − 2 δ ) M / 2, wh ich 10 is half th e length o f the uniform interval. In our experimen ts, the number of items is at least J = 60 and the max imum number o f classes is K = 8. For a ﬁxed pair of distinct classes k 1 , k 2 , co nsider th e in dicator X j = 1  | θ j , k 1 − θ j , k 2 | ≥ ζ / 2  , j = 1 , . . . , J . Because θ j , k 1 and θ j , k 2 are independ e n t and un iformly distributed, a simple ge o metric argumen t y ields P ( X j = 1 ) = 1 − ζ / 2 (1 − 2 δ ) M ! 2 = 3 4 ! 2 = 9 16 , which is inde penden t of δ . He n ce E P J j = 1 X j  = 9 J 16 , wh ich fo r J = 6 0 equ als 33 . 75, well above th e re q uired c 1 J = 18. By Ho e ﬀ ding’ s in equality , the pr o bability that a single pair fails to meet the condition is b ounded by P  J X j = 1 X j < 0 . 3 J  ≤ exp − 2 J  9 16 − 0 . 3  2 ! = exp( − 0 . 1378 1 25 J ) . For J = 60 this bo u nd is exp( − 8 . 26 8 75) ≈ 2 . 57 × 10 − 4 . A union boun d ov e r all  K 2  class pa irs (with K = 8,  8 2  = 2 8) g i ves an overall failure pr obability at most 8 2 ! exp( − 0 . 137812 5 × 60) ≈ 0 . 007 2 . Thus, with prob ability exceeding 0 . 9928, a randomly gen erated Θ ma trix automatically satisﬁes Assumption 3 . For larger J this probability rapidly approaches 1 (e.g., f or J = 7 0 it exceed s 0 . 9 98). Given this strong theoretical guaran tee , no rejection sampling or veriﬁcation step is nee d ed; all matrice s are used directly in the simulations. The signal strength parameter δ directly contr ols the width of the admissible inter val [ δ M , (1 − δ ) M ] : larger δ r educes ζ , weakening the signal, exactly as intende d . Response ma trix R. Given Z and Θ , the expected response matrix is R = Z Θ ⊤ . For each i ∈ [ N ] and j ∈ [ J ], th e response R ( i , j ) is indepen dently drawn fr o m a Binom ial distribution with M trials and success prob ability R ( i , j ) / M . Estimation an d evaluation . For each simulated dataset, we app ly th e SC-LCM algo rithm (Algor ith m 3 in App endix C ) as the classiﬁcation estimator M to obtain ˆ Z an d ˆ Θ . The test statistic T K 0 is then co mputed using Equ ation ( 4 ), an d the ratio r K 0 is com puted via Eq uation ( 5 ). F or the sequ ential alg o rithms, we u se th e d efault thresho lds: τ N = N − 1 / 5 for Go F-LCM an d γ N = log N for RGoF- LCM, u nless oth erwise stated . The m aximum can didate n umber is set to K max = ⌊ p N / log( N + J ) ⌋ . All re su lts are based on 200 independe nt Monte Carlo replications. 5.2. Experiment 1: behavior of T K 0 and r K 0 under H 0 and H 1 This experimen t empir ically veriﬁes the sharp dichotomo us behavior of T K 0 (Theor e m s 1 and 2 ) and r K 0 (Theo- rem 4 ). W e ﬁx the true n umber o f latent classes K = 4 , the numb e r of items J = 60 , and the signal strength p arameter δ = 0 . 2. W e then comp u te T K 0 for K 0 = 1 , 2 , 3 , 4 and the ratio statistics r 2 = | T 1 / T 2 | , r 3 = | T 2 / T 3 | , and r 4 = | T 3 / T 4 | . W e vary th e sam p le size N ∈ { 20 0 , 400 , 600 , 800 , 1 000 } . For each N and eac h c a ndidate K 0 , we c o mpute the mean and stand ard d eviation of T K 0 over 200 replication s. T able 1 reports the r e su lts. Under the cor rectly speciﬁed mod el ( K 0 = K = 4), the mean of T 4 is close to zer o and its absolu te value dec reases as N g rows, with the standar d deviation also shrinking. This conﬁrms Theor em 1 . For the un der-ﬁtted mo dels ( K 0 = 1 , 2 , 3), T K 0 takes large p ositiv e values that in crease with N , validating the divergence predicted b y Theorem 2 . Th e ratios r 2 = | T 1 / T 2 | and r 3 = | T 2 / T 3 | remain b ounded between 1 . 16 a nd 1 . 21 as N increases, consistent with par t (b) of Theo rem 4 f o r K 0 < K . In contrast, r 4 = | T 3 / T 4 | diver ges b ecause T 3 grows while | T 4 | tends to zer o, co n ﬁrming par t (a) of Theo rem 4 for the tru e candid ate K 0 = K . 11 T able 1: Beha vior of T K 0 and ratios r 2 , r 3 , r 4 under H 0 ( K 0 = 4) and under-ﬁtted models ( K 0 = 1 , 2 , 3) for true K = 4. V alues are mean (standard de viation) over 200 replica tions. N T K 0 Ratios K 0 = 1 ( H 1 ) K 0 = 2 ( H 1 ) K 0 = 3 ( H 1 ) K 0 = 4 ( H 0 ) r 2 = | T 1 / T 2 | r 3 = | T 2 / T 3 | r 4 = | T 3 / T 4 | 200 2.020 (0.171) 1.706 (0.147) 1.418 (0.149) -0.037 (0.026) 1.19 (0.12) 1.21 (0.13) 255.42 (2952.98) 400 2.135 (0.167) 1.835 (0.147) 1.551 (0.153) -0.024 (0.018) 1.17 (0.10) 1.19 (0.12) 157.51 (486.75) 600 2.214 (0.174) 1.891 (0.147) 1.620 (0.157) -0.018 (0.014) 1.18 (0.10) 1.18 (0.12) 229.62 (676.48) 800 2.257 (0.167) 1.933 (0.152) 1.662 (0.153) -0.017 (0.011) 1.17 (0.11) 1.17 (0.12) 388.22 (2547.19) 1000 2.278 (0.169) 1.956 (0.150) 1.690 (0.161) -0.014 (0.010) 1.17 (0.10) 1.16 (0.12) 450.91 (2527.84) 5.3. Experiment 2: acceptance and r ejection rates This experiment ev aluates the ability of both sequen tial algorith ms, GoF-LCM and RGoF- L CM, to co rrectly accept the tru e m odel and rejec t u n derﬁtted models. W e ﬁx N = 1000, J = 60 , δ = 0 . 2, and co nsider true nu mbers of latent classes K ∈ { 2 , 3 , 4 , 5 , 6 } . For each true K , we simulate 200 in depend ent datasets. On each dataset, we apply both a lg orithms: • For GoF-LCM, we reco rd the stoppin g K 0 as the ﬁrst can didate fo r which T K 0 < τ N (with τ N = N − 1 / 5 ). • For RGoF-LCM, we r ecord the stopping K 0 as the ﬁr st can didate for wh ich r K 0 > γ N (with γ N = lo g N ). If T 1 < τ N , the algorithm stops at K 0 = 1 ; o therwise, it proceeds to com pute r K 0 for K 0 ≥ 2 . W e then compu te the prop o rtion of times each algorithm stops at each candidate K 0 over the 200 rep lications. T able 2 presents th e results. For each true K , the table shows the stopp ing propo r tions fo r GoF-LCM and RGoF- LCM sid e by side, with colum ns covering K 0 = 1 to 10 to f ully capture overﬁtting behavior . T he main ﬁnd ings are: • Both algorith ms exhibit hig h accep tance rates at the tru e m odel. For K = 2 , 3 , 4 , 5, bo th meth o ds stop at the true K in all 200 replications. For K = 6, GoF-LCM stop s at K 0 = 6 in all replications, while RGoF-LCM does so in 9 9 . 5% o f th e c a ses, with a small overﬁtting prob ability of 0 . 5% stopping at K 0 = 7 . • The probab ility o f stopping at an un derﬁtted mod el ( K 0 < K ) is e ﬀ ectively zero fo r both algorith ms in e very case. • Overﬁtting ( K 0 > K ) occur s with very low p robability , only observed in the K = 6 case for RGoF-LCM. These results empirically c onﬁrm the co n sistency of both seq uential alg orithms. T able 2: Proporti on of times GoF-LCM and RGoF-LCM stop at each candidat e K 0 for true K = 2 , 3 , 4 , 5 , 6. Columns cover K 0 = 1 to 10. T rue K Algorithm Stopping K 0 1 2 3 4 5 6 7 8 9 10 2 GoF-LCM 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 RGoF-LCM 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 3 GoF-LCM 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 RGoF-LCM 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4 GoF-LCM 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 RGoF-LCM 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 5 GoF-LCM 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 RGoF-LCM 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 6 GoF-LCM 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 RGoF-LCM 0.000 0.000 0.000 0.000 0.000 0.995 0.005 0.000 0.000 0.000 12 5.4. Experiment 3: accuracy in e stima tin g K W e now a ssess the e stima tio n accuracy of the propo sed metho ds GoF-LCM and RGoF- L CM, and co mpare them with the spectral thresho lding method (d enoted Spec) p roposed in Equation (17 ) of L yu and Gu [ 20 ], which estimates K by co u nting singular values exceeding 2 . 01( √ J + √ N ). The parameter s ar e : N ∈ { 200 , 600 } , J ∈ { 60 , 100 } , δ ∈ { 0 . 1 , 0 . 2 , 0 . 3 } , and K ∈ { 1 , 2 , 3 , 4 } . For each c ombinatio n, we generate 200 ind ependen t datasets and ap ply all three m ethods. Accura cy is deﬁned as the pro p ortion of replicatio n s where the estimated ˆ K equ als the true K . Standard err ors are co mputed as p accuracy × (1 − accur acy ) / 200 an d are shown in p arentheses. T able 3 repor ts the accu racy for all 48 parameter com binations. Th e ma in ﬁn dings ar e: • The case K = 1 is tri vial: all me thods ach iev e perfec t accuracy a cross all setting s. • GoF-LCM and RGoF-LCM achieve near-perfect accu r acy across almost all setting s. For all combin ations except one, bo th meth ods attain an accuracy of 1 . 000. The on ly exception is K = 4 , N = 200 , J = 60 , δ = 0 . 3, where GoF-LCM yields 0 . 985 an d RGoF-LCM yields 0 . 990. T his dem o nstrates the strong con sistency and reliability of th e propo sed seq uential testing pr ocedur es, which are able to rec over the true num ber of latent classes even u nder relatively weak signal conditions. • The spectral thre sh olding meth od (Spec) perform s u n reliably , with accuracy rang ing from perfect to zero de- pending on th e signal strength. I ts accu racy depend s critically o n the stren gth of the sign al, which is controlled by δ , K , N , an d J . Recall that δ is a signal stren gth parameter : smaller δ allows the class-speciﬁc parameters Θ ( j , k ) to take v alues n ear the extremes 0 or M , crea ting substan tial di ﬀ e r ences betwee n classes and a strong signal. Larger δ con ﬁnes Θ ( j , k ) to a n arrow interval arou nd M / 2, reducing class sep aration and weakening the signal. The data conﬁr m this interpr etation: – At δ = 0 . 1 (stron g signal), Sp ec p erform s excellen tly , with accuracy 1 . 0 00 in all but o ne case ( K = 4 , N = 200 , J = 60 , δ = 0 . 1 yields 0 . 995 ). – At δ = 0 . 2 (mo derate signal), Spec remain s good overall, b ut some d egradation appears: K = 3 , N = 200 , J = 60 , δ = 0 . 2 giv es 0 . 9 95, and K = 4 , N = 200 , J = 6 0 , δ = 0 . 2 d r ops to 0 . 520. – At δ = 0 . 3 ( weak signal), Spec frequen tly fails. For K = 3, accu racy range s from 0 . 00 0 to 0 . 995 ; fo r K = 4 , it is 0 . 000 in three of four com binations, reach ing on ly 0 . 56 5 at th e largest samp le size ( N = 6 00 , J = 100 ). • While Spe c is simp le a nd co mputation ally cheap, its perfo rmance is high ly sensitive to the signal stre n gth. I n weak-signal regimes ( δ = 0 . 3), it can com pletely fail. In c o ntrast, Go F- LCM an d RGoF-LCM adap tiv ely test the ﬁt of candidate models and maintain near-perfect accur acy across all scen arios, includ in g the weakest signal considered . The slight ad vantage of RGo F-LCM over GoF-LCM in the hardest case ( K = 4 , N = 200 , J = 60 , δ = 0 . 3) con ﬁrms its ma rgin ally improved robustness These results highlight the practical advantage of the pro p osed sequential testing framew ork: it delivers high ly accurate and robust estimates of the number of latent classes without requiring manual tuning, where a s the simple spectral thresholding metho d can fail catastrophica lly wh en the sign al is weak . 5.5. Experiment 4: computational e ﬃ ciency under weak signal W e investigate the perf ormance o f GoF-LCM, RGoF-LCM, an d Spec in a ch allenging weak-signal scenario with signal strength parameter δ = 0 . 3, nu mber of item s J = 60, tru e n umber o f latent classes K = 8, and sample size N ranging f rom 4 00 to 4000 in increm ents of 400. For each con ﬁguration , we generate 200 indep endent datasets and apply b o th alg orithms. The a ccuracy (pro portion of correct K estimates) and the average runn ing time (in seco nds) are recorde d . Figur e 1 display s the results. Under th is weak signal scen ario, RGoF-LCM demo nstrates strong robustness, achieving high a ccuracy even at the smallest sample size and reach ing per fect accuracy for almost all samp les. GoF- LCM, on the other h and, has low accur acy at N = 4 00 and N = 800 , b ut its p e rforman ce im proves rapidly as N increases, r e aching perfect acc uracy for N ≥ 1600 . In contr ast, Spec fails completely in this weak-signa l regime: its accuracy is almost always z ero. All th ree methods are comp utationally e ﬃ cient. The run ning times of GoF-LCM and RGoF-LCM scale app roximate ly linearly with N , and even at N = 4 000, th e a verag e time remains below 0 . 6 seconds. Spec is extremely fast, w ith n egligible running time across all sample sizes, due to its simple singular 13 T able 3: Acc urac y (with standard error in parentheses) of GoF-LCM, RGoF-LCM, and the spectral thresholding method (Spec) for all parameter combinat ions. Parame ters GoF-LCM RGoF-LCM Spec K N J δ Accura cy (S.E. ) Accurac y (S.E.) Accuracy (S.E. ) 1 200 60 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 200 60 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 200 60 0.3 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 200 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 200 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 200 100 0.3 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 600 60 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 600 60 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 600 60 0.3 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 600 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 600 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1 600 100 0.3 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 200 60 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 200 60 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 200 60 0.3 1.000 (0.000) 1.000 (0.000) 0.880 (0.023) 2 200 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 200 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 200 100 0.3 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 600 60 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 600 60 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 600 60 0.3 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 600 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 600 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 2 600 100 0.3 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 200 60 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 200 60 0.2 1.000 (0.000) 1.000 (0.000) 0.995 (0.005) 3 200 60 0.3 1.000 (0.000) 1.000 (0.000) 0.000 (0.000) 3 200 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 200 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 200 100 0.3 1.000 (0.000) 1.000 (0.000) 0.440 (0.035) 3 600 60 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 600 60 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 600 60 0.3 1.000 (0.000) 1.000 (0.000) 0.450 (0.035) 3 600 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 600 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 3 600 100 0.3 1.000 (0.000) 1.000 (0.000) 0.995 (0.005) 4 200 60 0.1 1.000 (0.000) 1.000 (0.000) 0.995 (0.005) 4 200 60 0.2 1.000 (0.000) 1.000 (0.000) 0.520 (0.035) 4 200 60 0.3 0.985 (0.009) 0.990 (0.007) 0.000 (0.000) 4 200 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 4 200 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 4 200 100 0.3 1.000 (0.000) 1.000 (0.000) 0.000 (0.000) 4 600 60 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 4 600 60 0.2 1.000 (0.000) 1.000 (0.000) 0.980 (0.010) 4 600 60 0.3 1.000 (0.000) 1.000 (0.000) 0.000 (0.000) 4 600 100 0.1 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 4 600 100 0.2 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 4 600 100 0.3 1.000 (0.000) 1.000 (0.000) 0.565 (0.035) 14 0 1000 2000 3000 4000 Sample size N 0 0.2 0.4 0.6 0.8 1 Accuracy Accuracy vs. N GoF-LCM RGoF-LCM Spec 0 1000 2000 3000 4000 Sample size N 0 0.1 0.2 0.3 0.4 0.5 0.6 Running time (seconds) Running time vs. N GoF-LCM RGoF-LCM Spec Figure 1: Accurac y (left) and running time (right) of GoF-LCM and RGoF-LCM for K = 8, J = 60, δ = 0 . 3, with vary ing N . value thre sh olding proced ure. These results conﬁrm that RGoF- L CM is th e method of c h oice in weak-signal settings, o ﬀ ering both h igh accuracy and fast computation across the entire range of sample sizes. GoF-L CM also ach iev es excellent accuracy on ce the sample size is su ﬃ ciently large, while Spe c, despite its speed, is u nreliable when th e signal is weak . 5.6. Experiment 5: sensitivity to thr e sholds This ﬁnal experiment investigates the sen siti v ity of th e two a lg orithms to their p rimary tuning p arameters. W e ﬁx K = 5, N = 10 00, J = 6 0, and δ = 0 . 2 . 5.6.1. Sensitivity of Go F-LCM to τ N = N − ǫ W e vary th e d ecay parameter ǫ in the thr eshold τ N = N − ǫ over the set { 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 , 0 . 8 , 0 . 9 , 1 } . For each ǫ , we run GoF-LCM on 2 0 0 simu la ted datasets and record the accu racy . The left pa r t of T ab le 4 shows the results. Accuracy rem a ins perfec t (1 . 000) fo r ǫ ≤ 0 . 4, then gradually de clines: 0 . 995 a t ǫ = 0 . 5 , 0 . 990 at ǫ = 0 . 6, 0 . 970 at ǫ = 0 . 7 , 0 . 960 at ǫ = 0 . 8, an d ﬁnally 0 . 935 and 0 . 9 30 at ǫ = 0 . 9 a nd 1 . 0, respectively . Overall, GoF-L CM is robust to the choice of ǫ as lo ng as it does not exceed 0 . 5; beyond this p oint, perf ormance gradually degrades. Finally , the default ǫ = 0 . 2 lies well within the stable region and yields perfect ac c uracy . 5.6.2. Sensitivity of R GoF-LCM to γ N = a lo g N W e vary the mu ltiplier a in the thresho ld γ N = a log N over the set { 0 . 5 , 1 . 0 , 1 . 5 , 2 . 0 , 2 . 5 , 3 . 0 , 3 . 5 , 4 . 0 , 4 . 5 , 5 . 0 } . For each a , we run RGoF-LCM o n 200 simulated data sets a nd rec o rd th e accura cy . The right p art of T able 4 shows the results. Accuracy is per fect (1 . 000) f or all a ≤ 4 . 5 , a nd only slightly lower (0 . 995) at a = 5 . 0. This dem onstrates tha t RGoF-LCM is highly ro bust to the ch oice of the m ultiplier a over a wide range. Finally , the default a = 1 ( γ N = lo g N ) lies well within the stable region. 5.7. Experiment 6: performance und er lar ge J ( v iolating J = o ( N ) ) The th e oretical analysis in Sec tions 3 and 4 relies o n Assump tion 4 , which requ ires J K log( J K ) N → 0. In particu lar , this assump tion fo rces the numb er of items J to g row slower than the samp le size N . This condition is used in the proof s to control accu mulated estimation er rors. Exper iment 6 investigates the p erform ance of GoF-LCM and RGoF- LCM wh en J is allo wed to be substantially larger th an N , i. e . wh en J grows faster than N an d the term J K log( J K ) N does not tend to zer o. W e ﬁx th e sample size N = 60 0, th e true nu mber of laten t classes K = 8, and the sign al strength param eter δ = 0 . 3 (weak signal). Th e nu mber of item s J is varied from 200 to 2000 in steps of 200. For each value of J , we gener ate 200 15 T able 4: Sensitiv ity of GoF-LCM to ǫ in τ N = N − ǫ and of RGoF-LCM to multiplie r a in γ N = a log N . GoF-LCM: τ N = N − ǫ ǫ Accuracy 0.1 1.000 0.2 1.000 0.3 1.000 0.4 1.000 0.5 0.995 0.6 0.990 0.7 0.970 0.8 0.960 0.9 0.935 1.0 0.930 RGoF-LCM: γ N = a log N a Accuracy 0.5 1.000 1.0 1.000 1.5 1.000 2.0 1.000 2.5 1.000 3.0 1.000 3.5 1.000 4.0 1.000 4.5 1.000 5.0 0.995 indepen d ent datasets. Fig u re 2 rep orts the accuracy of both algo rithms f or all con sidered values of J . Remark ably , for ev ery value o f J (including J = 20 00, which is mor e than three times of N ), b oth GoF-LCM a nd RGoF-L CM achieve almost perfect accu racy of 1 . 000. 0 500 1000 1500 2000 Number of items J 0 0.2 0.4 0.6 0.8 1 Accuracy Accuracy vs. J GoF-LCM RGoF-LCM Figure 2: Accurac y of GoF-LCM and RGoF-LCM K = 8, N = 600, δ = 0 . 3, with varying J . These resu lts show th a t the pro posed sequential testing p rocedur es are r obust to the dimension of the item spac e. Although Assum ption 4 requires J = o ( N ) for the th e o retical der iv ations, the a c tu al p erform a nce d oes n ot deteriorate ev en w h en J signiﬁcantly exceeds N . For exam ple, at J = 2000 we have J K log( J K ) N ≈ 258 , wh ich is far from zero , yet the algorithm s still recover the true number of classes with p robability one. Th is suggests tha t the growth co ndition in Assumption 4 is a su ﬃ cient con dition impo sed by th e pro of techniq ue and ma y not b e n ecessary for the co nsistency of the a lgorithms. This empir ical ﬁnding op ens the p ossibility of r elaxing the g rowth co nditions in fu ture theor etical work. 5.8. Real data e xample W e illustrate the p roposed sequen tial testing pro cedures using a pu b licly av ailab le dataset from the Open-So urce Psychometrics Pr oject. The data ﬁle ‘rando mnumb er .zip‘ can be downloaded f rom https: //ope npsychomet rics.org/_rawdata/ . The survey consisted of a cognitive task ( g enerating r andom nu mbers) followed by a standa r d Big Fi ve Perso nality T est (BFPT). The test com prises J = 50 item s, each rated on a six-point scale with integer values fr o m 0 to 5. For 16 each item, 0 in dicates the lowest level of agreemen t an d 5 the hig hest. The dataset co ntains resp onses fro m N = 1369 individuals, yielding a response ma tr ix R ∈ { 0 , 1 , . . . , 5 } 1369 × 50 . 5 10 15 1 1.2 1.4 1.6 1.8 2 2.2 5 10 15 0.9 1 1.1 1.2 1.3 1.4 1.5 Figure 3: T est statisti c T K 0 (left) and ratio r K 0 = | T ( K 0 − 1) / T ( K 0 ) | (right) versus candida te number of latent classes K 0 for the BFPT dataset. The left panel of Figure 3 plots the test statistic T K 0 for th e BFPT d ata ( N = 1369 , J = 50) . All values lie between 1 and 2 . 2 , well above GoF-LCM’ s default thr eshold τ N = N − 1 / 5 ≈ 0 . 2 4. Consequ e ntly , GoF-LCM n ev er encou n ters a K 0 satisfying T K 0 < τ N and th e r efore runs to the max imum c andidate K max = ⌊ p N / log( N + J ) ⌋ ≈ 13, p roducin g ˆ K = 13 by d efault. Thus, GoF-LCM yield s an estimate that is likely too large and provides little insight into the underly ing stru cture. The right panel displays the ratio r K 0 = | T K 0 − 1 / T K 0 | . It attains a clear peak at K 0 = 2 ( r 2 ≈ 1 . 47 ), while for K 0 ≥ 3 the ratios ﬂuctu a te between 0 . 9 and 1 . 2 witho ut further stru cture. This p attern indicates that the re lati ve ch ange in T K 0 is largest when moving from K 0 = 1 to K 0 = 2 , suggesting th at a two-class model cap tures th e dom inant hetero geneity in th e d ata. Such a proﬁle is exactly what RGoF-LCM is designed to exploit. Theorem 5 requires the stopp ing th reshold γ N to satisfy γ N → ∞ an d γ N = o ( p N / log J ); any such sequen ce is admissible. The default γ N = log N ≈ 7 . 22 exceed s r 2 , so the formal sequ ential proced ure does not stop at K 0 = 2. Nev ertheless, the p r onoun ced maximum pr ovides valuable diag nostic information : it suggests that a two-class mod el o ﬀ ers the m ost p roper description of th e BFPT data, e ven thoug h the default threshold is to o conservati ve to trigg er a stop. This o bservation highligh ts a key di ﬀ ere nce between the two approac h es. GoF-LCM relies on an abso lute thresh - old an d fails when m odel assump tions are violated. RGoF-LCM, b y focusing on relativ e im provements, can re veal dominan t structu r al transitions throu gh the shape of r K 0 —ev en when all T K 0 values a r e inﬂated. In Fig u re 3 , this lea ds to the clear peak at K 0 = 2 , a co nclusion that would be m issed by examining T K 0 alone or by a mecha n ical app lication of the d efault thresho ld rule. From a broade r perspe c ti ve, these results illu stra te the v alue of diagnostic tools sensiti ve to relativ e rather th an absolute ﬁt. GoF-LCM enjoys clean asymptotic th eory under correct spe c iﬁcation, but RGoF-LCM demonstrates greater empirical robustness in practice. For the BFPT data, the evidence supp orts ˆ K = 2 as a reasonab le descriptio n , a co nclusion that em erges clearly from the r atio p roﬁle. 6. Conclusion W e develop a goodn ess-of-ﬁt fr amew ork f o r determin ing the numb er of latent classes in ord in al categorical data. The framew o rk intr o duces a test statistic, exhibiting a sharp dich otomou s behavior that leads to consistent sequen- tial estimation. By transformin g a challeng ing model selection problem into simple sequ ential tests, the proposed 17 methods are both computatio nally e ﬃ cient an d theo retically princip led. Simulation studies an d real-data applicatio ns demonstra te th e ir acc uracy and reliability . Sev e r al directions for future resear ch emerge f rom this work. Relaxing th e current mo deling assumptions is a nat- ural extension. The bin omial respo nse assum p tion could b e generalized to accomm odate oth er exponen tial- family dis- tributions, such as Poisson for count data or multinomial for polytomo us responses. The bound edness and separation condition s might be wea kened thro ugh more reﬁned con c entration argum ents. Extending the framework to richer la- tent stru ctures p r esents an other impo rtant dir ection. M ixed -member ship formu lations, such as Gra d e-of-Membership (GoM) mo dels W ood bury et al. [ 28 ], allow individuals partial membership acro ss multiple laten t proﬁles. Degree heteroge n eous laten t class models L yu et al. [ 1 9 ] (Dh-LCM) introd u ce additional in dividual-speciﬁc parameters to capture varying lev els o f activity . For these more complex mode ls, a fun d amental ﬁrst step is to estimate the num b er o f latent classes K , wh ich remains largely u n explored and cou ld b uild upo n the sequ ential te sting framework developed in this work , thou gh adap ting them would require f u ndamen tal re consideratio n of th e residual co n cept. H ie r archical latent structures, where classes themselves are organized in to higher-order g roupin gs, p ose a further challenge that may n e c essitate layer e d testing proc e dures. Finally , e xtending to dynamic settings, such as latent transition mod els for lo ngitudin a l data, would enable studying how class memberships e volve over time. Pursuin g these dir ections will promising ly extend ou r fram ew o r k to more complex data en v ironmen ts. CRediT authorship co nt rib ution statement Huan Qing is the so le au thor o f th is article. Declaratio n of competing interest The author declares n o com peting interests. Data availability Data and co de will b e m ade available on request. Ap pendix A. T echnical proofs Append ix A.1. Pr o of of Lemma 1 Pr oof. For any ﬁxed ( i , j ), we have E [ R ∗ ( i , j )] = E [ R ( i , j )] − R ( i , j ) p N V ( i , j ) = 0 . The variance is V ar( R ∗ ( i , j )) = V ar ( R ( i , j )) N V ( i , j ) . Since V ar( R ( i , j )) = V ( i , j ), we obtain V ar( R ∗ ( i , j )) = V ( i , j ) N V ( i , j ) = 1 N . By Assumption 1 , we have p i j : = R ( i , j ) M ∈ [ δ , 1 − δ ]. Then V ( i , j ) = R ( i , j ) 1 − R ( i , j ) M ! = M p i j (1 − p i j ) . 18 The function g ( p ) = p (1 − p ) is continu ous and concave on [ δ , 1 − δ ]. Since g ( p ) is symm etric about p = 1 / 2 and δ ≤ 1 / 2 , its minimum on [ δ , 1 − δ ] is attained at bo th end points, i.e . , g ( p ) ≥ g ( δ ) = δ (1 − δ ) for all p ∈ [ δ , 1 − δ ] . Consequently , we have V ( i , j ) ≥ M δ (1 − δ ) for all i ∈ [ N ] , j ∈ [ J ] . Because R ( i , j ) takes values in { 0 , 1 , . . . , M } , we have | R ( i , j ) − R ( i , j ) | ≤ M . Hence, we g et | R ∗ ( i , j ) | = | R ( i , j ) − R ( i , j ) | p N V ( i , j ) ≤ M √ N · M δ (1 − δ ) = r M N δ (1 − δ ) . Hence, the maximum entr ywise m agnitude (d enoted by σ ∗ in m atrix concentra tio n ine q ualities) satisﬁes σ ∗ : = m ax i , j k R ∗ ( i , j ) k ∞ ≤ r M N δ (1 − δ ) . For each row i ∈ [ N ], we have J X j = 1 E  ( R ∗ ( i , j )) 2  = J X j = 1 1 N = J N , which g i ves σ 1 : = m ax i ∈ [ N ] v u t J X j = 1 E  ( R ∗ ( i , j )) 2  = r J N . For each colum n j ∈ [ J ], we have N X i = 1 E  ( R ∗ ( i , j )) 2  = N X i = 1 1 N = 1 , which g i ves σ 2 : = m ax j ∈ [ J ] v u t N X i = 1 E  ( R ∗ ( i , j )) 2  = 1 . The following lemma is ob tained f rom th e stateme n ts after Corollary 3 .12 o f [ 3 ]. Lemma 3. Let X b e a ny n 1 × n 2 random matrix with indepen dent entries satisfying E [ X i j ] = 0 . De ﬁ ne σ 1 = m ax i s X j E [ X 2 i j ] , σ 2 = m ax j s X i E [ X 2 i j ] , σ ∗ = m ax i , j k X i j k ∞ . Then for any 0 < η ≤ 1 2 , th er e exists a constan t C η > 0 such that for all t ≥ 0 , P  k X k ≥ (1 + η )( σ 1 + σ 2 ) + t  ≤ ( n 1 + n 2 ) exp − t 2 C η σ 2 ∗ ! . W e apply Le m ma 3 with X = R ∗ , n 1 = N , n 2 = J , and σ 1 , σ 2 , σ ∗ as co mputed befo re. Sub stituting these values, for any t ≥ 0, we obtain P        k R ∗ k ≥ (1 + η )        r J N + 1        + t        ≤ ( N + J ) exp − t 2 δ (1 − δ ) N C η M ! . 19 Let γ > 2 be a ﬁxed constant (e.g ., γ = 3). Choose t = q γ C η M log( N + J ) δ (1 − δ ) N . Substitutin g th is t into th e r ight-han d side of the ab ove ineq uality g ives ( N + J ) exp − t 2 δ (1 − δ ) N C η M ! = ( N + J ) exp − γ C η M lo g( N + J ) δ (1 − δ ) N · δ (1 − δ ) N C η M ! = ( N + J ) 1 − γ . Since γ > 2 , ( N + J ) 1 − γ = o (1) a s N → ∞ . Ther e fore, with prob a bility at least 1 − o (1), we have k R ∗ k ≤ (1 + η )        r J N + 1        + s γ C η M lo g( N + J ) δ (1 − δ ) N . Because the term q γ C η M log( N + J ) δ (1 − δ ) N = O q log N N ! = o (1) and the parameter η can b e chosen arbitrarily sm all, we obtain th at with p robab ility approach ing 1, k R ∗ k ≤ 1 + r J N + o (1) . Thus, for any ﬁxed ǫ > 0, we have lim N →∞ P        k R ∗ k ≤ 1 + r J N + ǫ        = 1 . Append ix A.2. Pr o of of Lemma 2 Pr oof. W e prove this lemma via ﬁve parts. Part 1: A high-probability event. By Lem ma 4 and the fact that a ﬁn ite intersection of e vents with probab ility tending to 1 still has probability ten ding to 1, there exists an event F N with P ( F N ) → 1 such that on F N the following hold simu ltaneously: (i) ˆ Z = Z Π ; (ii) K X k = 1 J X j = 1  ˆ Θ ( j , π ( k )) − Θ ( j , k )  2 ≤ C 1 J K 2 N ; (iii) max i ∈ [ N ] , j ∈ [ J ]    ˆ R ( i , j ) − R ( i , j )    ≤ C 2 r K log( J K ) N ; (iv) ˆ V ( i , j ) ≥ v min ∀ i , j ; (v) V ( i , j ) ≥ δ ( 1 − δ ) M ∀ i , j . (Here V ( i , j ) = R ( i , j )(1 − R ( i , j ) / M ), ˆ V ( i , j ) = ˆ R ( i , j )(1 − ˆ R ( i , j ) / M ).) All subsequ ent estimates are perform ed pointwise o n F N and are therefore deter ministic on this event. Part 2: Dec omposition of the di ﬀ erence matrix. Set ∆ : = ˜ R − R ∗ ∈ R N × J . On F N , ˆ V ( i , j ) > 0 by (iv), so the deﬁnition in Eq uation ( 2 ) red uces to the standard case with a positiv e denomin ator . For a ﬁxed pair ( i , j ), write A : = R ( i , j ) , p : = R ( i , j ) , ˆ p : = ˆ R ( i , j ) , v : = V ( i , j ) , ˆ v : = ˆ V ( i , j ) . Then, we hav e ∆ ( i , j ) = A − ˆ p √ N ˆ v − A − p √ N v = ( A − p )  1 √ N ˆ v − 1 √ N v  + p − ˆ p √ N ˆ v . 20 Deﬁne two N × J matrices E 1 , E 2 entrywise by E 1 ( i , j ) : = ( A − p )  1 √ N ˆ v − 1 √ N v  , E 2 ( i , j ) : = p − ˆ p √ N ˆ v . Thus ∆ = E 1 + E 2 on F N . Part 3: Spectral norm of E 2 . Becau se ˆ p = ˆ R ( i , j ) = ˆ Θ ( j , π ( ℓ ( i ))) is co nstant on each true class, E 2 ( i , j ) dep ends only on the c lass of subject i and th e item j . Let C k : = { i : ℓ ( i ) = k } for k ∈ [ K ]. On F N , f or i ∈ C k , we h av e ˆ V ( i , j ) = ˆ Θ ( j , π ( k ))  1 − ˆ Θ ( j , π ( k )) M  = : ˆ v k ( j ) , which is inde penden t of i ∈ C k and, b y (i v), satisﬁes ˆ v k ( j ) ≥ v min . Hence f o r i ∈ C k , we h av e E 2 ( i , j ) = Θ ( j , k ) − ˆ Θ ( j , π ( k )) p N ˆ v k ( j ) . Construct V ∈ R J × K by V ( j , k ) : = Θ ( j , k ) − ˆ Θ ( j , π ( k )) p N ˆ v k ( j ) . Since Z ( i , k ) = 1 { i ∈C k } , we have ( Z V ⊤ ) i j = V ( j , ℓ ( i )) = E 2 ( i , j ). Th us, we get E 2 = Z V ⊤ on F N . Z ⊤ Z = diag( N 1 , . . . , N K ) gives k Z k = √ max k N k = : √ N max . Assump tio n 2 gi ves c 0 N / K ≤ N k ≤ N / ( c 0 K ), so N max ≤ N / ( c 0 K ) and k Z k ≤ r N c 0 K . Using ˆ v k ( j ) ≥ v min gets k V k 2 F = K X k = 1 J X j = 1  Θ ( j , k ) − ˆ Θ ( j , π ( k ))  2 N ˆ v k ( j ) ≤ 1 N v min K X k = 1 J X j = 1  Θ ( j , k ) − ˆ Θ ( j , π ( k ))  2 . On F N , ( ii) b ounds the do uble su m by C 1 J K 2 / N . Thus, we h av e k V k 2 F ≤ 1 N v min · C 1 J K 2 N = C 1 v − 1 min J K 2 N 2 , k V k F ≤ q C 1 v − 1 min K √ J N . Submultiplicativity of the spectral n orm g iv e s k E 2 k ≤ k Z k k V k ≤ k Z k k V k F ≤ r N c 0 K q C 1 v − 1 min K √ J N = r C 1 c 0 v min | {z } = : C E 2 r J K N . By Assumption 4 , we h av e √ J K / N = o (1) . Con seq uently , f or any ε > 0, ∃ N 0 s.t. ∀ N ≥ N 0 , on F N we ha ve k E 2 k < ε , which gives k E 2 k = o ( 1) on F N . Part 4: Spect ral norm of E 1 . Set f ( x ) = 1 / √ x for x > 0. W e have f ′ ( x ) = − 1 2 x − 3 / 2 and | f ′ ( x ) | = 1 2 x − 3 / 2 . Fix ( i , j ). On F N , ˆ v ≥ v min by ( iv). Fr om (v ), v ≥ δ (1 − δ ) M . Because δ ≤ 1 2 , one has δ ( 1 − δ ) ≥ δ 2 / 4, h ence v ≥ v min as well. Thus ˆ v , v ∈ [ v min , ∞ ). Apply the mean value theore m to f on th e in terval between ˆ v and v : ther e exists ξ strictly between ˆ v an d v such that 1 √ ˆ v − 1 √ v = f ′ ( ξ ) ( ˆ v − v ) . Since ξ ≥ v min , we h av e     1 √ ˆ v − 1 √ v     = | f ′ ( ξ ) | | ˆ v − v | ≤ 1 2 v − 3 / 2 min | ˆ v − v | . 21 Deﬁne h ( x ) = x (1 − x / M ) fo r x ∈ [0 , M ]. Then, we have ˆ v = h ( ˆ p ), v = h ( p ). | h ′ ( x ) | = | 1 − 2 x / M | ≤ 1 for all x ∈ [0 , M ], so the mean v alue theorem y ields | ˆ v − v | = | h ( ˆ p ) − h ( p ) | ≤  max z ∈ [0 , M ] | h ′ ( z ) |  | ˆ p − p | ≤ | ˆ p − p | . Thus, we h av e     1 √ ˆ v − 1 √ v     ≤ 1 2 v − 3 / 2 min | ˆ p − p | . Because R ( i , j ) ∈ { 0 , 1 , . . . , M } and R ( i , j ) ∈ [0 , M ], we have | A − p | ≤ M . Therefo re, we g et | E 1 ( i , j ) | ≤ | A − p | 1 √ N     1 √ ˆ v − 1 √ v     ≤ M 2 √ N v 3 / 2 min | ˆ p − p | . On F N , ( iii) giv es max i , j | ˆ p − p | ≤ C 2 q K log( J K ) N . Hen ce, for every ( i , j ), we hav e | E 1 ( i , j ) | ≤ MC 2 2 v 3 / 2 min r K log( J K ) N 2 = : MC 2 2 v 3 / 2 min | {z } = : C E 1 r K lo g( J K ) N 2 . For any matrix , k E 1 k ≤ k E 1 k F and k E 1 k F = q P N i = 1 P J j = 1 E 1 ( i , j ) 2 ≤ √ N J max i , j | E 1 ( i , j ) | . Consequently , on F N , we h av e k E 1 k ≤ √ N J C E 1 r K log( J K ) N 2 = C E 1 r J K log( J K ) N . Assumption 4 dir e ctly gi ves J K log( J K ) / N → 0, so p J K lo g( J K ) / N = o ( 1). Thu s for any ε > 0 , ∃ N 0 s.t. ∀ N ≥ N 0 , on F N we h av e k E 1 k < ε . Part 5: Con version to o P (1) . On the e vent F N , th e trian gle inequality yields k ∆ k ≤ k E 1 k + k E 2 k ≤ C E 1 r J K log( J K ) N + C E 2 r J K N . Both terms are o (1). Ther e fore, f or e very ε > 0 there exists N 0 (depen d ing o nly on ε and the model co nstants) such tha t for all N ≥ N 0 and every realization belon ging to F N , k ∆ k < ε . Recall P ( F N ) → 1. For an arb itrary ε > 0, we h av e P  k ∆ k > ε  = P  {k ∆ k > ε } ∩ F N  + P  {k ∆ k > ε } ∩ F c N  ≤ P  {k ∆ k > ε } ∩ F N  + P ( F c N ) . From the previous resu lts, we k now that for su ﬃ ciently large N the set {k ∆ k > ε } ∩ F N is empty . Hence its probab ility is 0 . Thus, we h av e lim sup N →∞ P ( k ∆ k > ε ) ≤ lim N →∞ P ( F c N ) = 0 , and since ε > 0 was arb itrary , k ∆ k P − → 0 , which is exactly Equation ( 3 ). Append ix A.3. Pr o of of Theo r em 1 Pr oof. For any two real ma trices o f iden tical d im ensions, W eyl’ s ine q uality for singular values asserts | σ 1 ( A ) − σ 1 ( B ) | ≤ k A − B k . Applying it with A = ˜ R an d B = R ∗ (both are N × J m a tr ices) y ields th e d eterministic bound σ 1 ( ˜ R ) ≤ σ 1 ( R ∗ ) + k ˜ R − R ∗ k . Fix an arbitrary ε > 0, we have  σ 1 ( ˜ R ) > 1 + p J / N + ε  ⊆  σ 1 ( R ∗ ) > 1 + p J / N + ε/ 2  ∪  k ˜ R − R ∗ k > ε/ 2  , 22 which y ields P  σ 1 ( ˜ R ) > 1 + p J / N + ε  ≤ P  σ 1 ( R ∗ ) > 1 + p J / N + ε/ 2  + P  k ˜ R − R ∗ k > ε/ 2  . By Le m mas 1 and 2 , we have that both pro babilities on th e right-hand side of th e above eq uation converge to zero as N → ∞ . Consequently , we get lim N →∞ P  σ 1 ( ˜ R ) > 1 + p J / N + ε  = 0 . Recall the d eﬁnition o f T K 0 is T K 0 = σ 1 ( ˜ R ) −  1 + √ J / N  , so f or any ε > 0 , w e h av e { T K 0 > ε } = { σ 1 ( ˜ R ) > 1 + p J / N + ε } , which imm ediately im plies lim N →∞ P  T K 0 > ε  = 0 . Hence, we h av e P  T K 0 < ε  − → 1 , as N → ∞ . This completes the pro of of Theo rem 1 . Append ix A.4. Pr o of of Theo r em 2 Pr oof of Theor em 2 . W e work cond itio nally o n the output of the classiﬁer M . Lemm a 5 holds determ inistically for ev ery possible ˆ Z whe n ev er K 0 < K . Step 1. Det erministic objects from Lemma 5 . Because K 0 < K , L emma 5 guarante e s the existence of two distinct true classes k 1 , k 2 ∈ [ K ], subsets S 1 ⊆ C k 1 , S 2 ⊆ C k 2 , S : = S 1 ∪ S 2 ⊆ [ N ] , T ⊆ [ J ] , and unit vectors u ∈ R | S | , v ∈ R | T | with the speciﬁc forms constructed in the lemma (see its proo f) su c h that the following deterministic p roperties hold: | S 1 | , | S 2 | ≥ c 0 N K K 0 , | S | = | S 1 | + | S 2 | ≤ 2 N c 0 K , | T | ≥ c 1 J , u i =          α/ √ | S 1 | , i ∈ S 1 , − β/ √ | S 2 | , i ∈ S 2 , v j = s j √ | T | , s j = sig n  Θ ( j , k 1 ) − Θ ( j , k 2 )  , where α = √ | S 2 | / √ | S 1 | + | S 2 | , β = √ | S 1 | / √ | S 1 | + | S 2 | . Moreover , f rom th e constru ction in L emma 5 , we ha ve the deterministic lower bound u ⊤ ( R − ˆ R ) S , T v = γ 0 √ | T | X j ∈ T | Θ ( j , k 1 ) − Θ ( j , k 2 ) | ≥ c ζ √ N J √ K K 0 , (A.1) where γ 0 = √ | S 1 || S 2 | / ( | S 1 | + | S 2 | ). All subjects in S are assigne d by ˆ Z to the same estimated class, say κ ∈ [ K 0 ]; consequen tly fo r e very j ∈ T the ﬁtted value ˆ R ( i , j ) = ˆ Θ ( j , κ ) is constant on S × { j } . Step 2 . Sca ling f a ctors. For j ∈ T , d eﬁne the following scaling factors: γ j : = 1 q N ˆ V ( i , j ) , i ∈ S , where ˆ V ( i , j ) = ˆ R ( i , j )  1 − ˆ R ( i , j ) / M  and th e value doe s not d epend on the particular i ∈ S . Set D : = d iag( γ j ) j ∈ T ∈ R | T |×| T | . Th en the su b matrix of the prac tical norm alized residual matrix satisﬁes ˜ R S , T = ( R − ˆ R ) S , T D =  W + M  D , where W : = ( R − R ) S , T and M : = ( R − ˆ R ) S , T . Step 3 . Three high-probability events. W e n ow construct three e vents whose probab ilities tend to 1 : 23 • Event E A – no zer o denomina tor . Deﬁne E A : = { γ j > 0 for all j ∈ T } . Fix a column j ∈ T . Recall that for any i ∈ S we have ˆ R ( i , j ) = ˆ Θ ( j , κ ) (a constant over S ) and ˆ V ( i , j ) = ˆ R ( i , j )  1 − ˆ R ( i , j ) / M  . Since the f unction h ( x ) = x (1 − x / M ) satisﬁes h ( x ) = 0 i ﬀ x ∈ { 0 , M } , we obtain γ j = 1 q N ˆ V ( i , j ) > 0 ⇐ ⇒ ˆ Θ ( j , κ ) < { 0 , M } . Because ˆ Θ ( j , κ ) = 1 | S | P i ∈ S R ( i , j ) (th e samp le m ean of the | S | in d epende nt respo nses), the event { ˆ Θ ( j , κ ) = 0 } occurs on ly if every R ( i , j ) = 0 ; like wise { ˆ Θ ( j , κ ) = M } occurs only if every R ( i , j ) = M . Hence γ j = 0 (equiv alently ˆ Θ ( j , κ ) ∈ { 0 , M } ) is co ntained in the union of the two events “all re sp onses in item j on the set S are 0” an d “all a re M ”. Now we boun d the probability of each of these two extreme ev ents. Under Assumption 1 , for any tru e class k ∈ [ K ], we have Θ ( j , k ) / M ∈ [ δ , 1 − δ ]. Therefore, for any ind ividual i (regardless of its tr ue class), we have P  R ( i , j ) = 0  =  1 − Θ ( j , ℓ ( i )) M  M ≤ (1 − δ ) M , P  R ( i , j ) = M  =  Θ ( j , ℓ ( i )) M  M ≤ (1 − δ ) M . The responses { R ( i , j ) : i ∈ S } are mutually indepen dent. Consequently , we have P  all R ( i , j ) = 0 , i ∈ S  = Y i ∈ S P ( R ( i , j ) = 0) ≤ (1 − δ ) M | S | , P  all R ( i , j ) = M , i ∈ S  = Y i ∈ S P ( R ( i , j ) = M ) ≤ (1 − δ ) M | S | . Applying the union b ound g iv e s, condition a lly on the set S , P ( γ j = 0 | S ) ≤ 2 (1 − δ ) M | S | . From Lemm a 5 , when ev er K 0 < K , we h av e the determin istic lo wer boun d | S | ≥ 2 c 0 N K K 0 (it hold s for every realisation of ˆ Z ). Thus, we h ave (1 − δ ) M | S | ≤ (1 − δ ) 2 c 0 M N / ( K K 0 ) almost surely . T ak ing expe ctations an d using the fact that the boun d is non-random y ie ld s P ( γ j = 0 ) = E  P ( γ j = 0 | S )  ≤ E  2(1 − δ ) 2 c 0 M N K K 0  = 2(1 − δ ) 2 c 0 M N K K 0 , Finally , applying the unio n boun d over the columns in T ( n ote that | T | ≤ J ) yields P ( E c A ) ≤ X j ∈ T P ( γ j = 0 ) ≤ 2 J (1 − δ ) 2 c 0 M N / ( K K 0 ) . (A.2) W e now pr ove tha t the right-hand side of Equation ( A.2 ) tend s to 0 . Set  : = − ln (1 − δ ) > 0 (since 0 < δ ≤ 1 / 2 ). Then (1 − δ ) x = e −  x and Equation ( A . 2 ) becomes P ( E c A ) ≤ 2 J exp  −  · 2 c 0 M N K K 0  . (A.3) Because K 0 < K , we hav e K K 0 ≤ K 2 and consequen tly N / ( K K 0 ) ≥ N / K 2 . Hence, we have P ( E c A ) ≤ 2 J exp  − 2 c 0 M  | {z } = : c ′ N K 2  . (A.4) 24 By Assumption 4 , f o r any positive con stant ǫ there exists an integer N 0 ( ǫ ) such that for all N ≥ N 0 ( ǫ ), K 2 log( N + J ) N ≤ 1 ǫ , or equivalently N K 2 ≥ ǫ log( N + J ) . (A.5) W e n ow ﬁx a speciﬁc value of ǫ . Cho o se ǫ = 2 c ′ . Then for all su ﬃ ciently large N (say N ≥ N 0 (2 / c ′ )), we have N K 2 ≥ 2 c ′ log( N + J ) . (A.6) Substituting Equation ( A.6 ) in to Equ ation ( A.4 ) g ives P ( E c A ) ≤ 2 J exp  − c ′ · N K 2  ≤ 2 J exp  − c ′ · 2 c ′ log( N + J )  = 2 J ( N + J ) − 2 − − − − → N →∞ 0 , which giv es P ( E A ) → 1. • Event E B – estimated class mean sta y s away fr om boundaries. Let κ be the estimated c lass co ntaining S . By Lemma 5 , n κ : = | S | ≥ 2 c 0 N / ( K K 0 ) deterministically . For any j ∈ T , ˆ Θ ( j , κ ) = 1 n κ P i ∈ S R ( i , j ) is the av erage o f n κ indepen d ent Binomial( M , · ) variables, each bo unded in [0 , M ]. By Assumption 1 , E [ R ( i , j )] = Θ ( j , ℓ ( i )) ∈ [ δ M , (1 − δ ) M ] for all i ∈ S . Hence, we ha ve E [ ˆ Θ ( j , κ )] ∈ [ δ M , (1 − δ ) M ]. Hoe ﬀ d ing’ s ineq u ality gives, for t = δ M / 2, P  | ˆ Θ ( j , κ ) − E [ ˆ Θ ( j , κ )] | ≥ δ M 2  ≤ 2 exp  − 2 n κ ( δ M / 2) 2 M 2  = 2 exp  − n κ δ 2 2  . Since n κ ≥ 2 c 0 N / ( K K 0 ) ≥ 2 c 0 N / K 2 (because K 0 < K ) and the expon ential function is decreasing, the right-hand side is bound ed b y 2 exp  − c 0 δ 2 N / K 2  . I f the d eviation is less than δ M / 2, then ˆ Θ ( j , κ ) ∈ [ δ M / 2 , (1 − δ/ 2) M ] because E [ ˆ Θ ( j , κ )] ∈ [ δ M , (1 − δ ) M ]. Deﬁne E B : =  ˆ Θ ( j , κ ) ∈ [ δ M / 2 , (1 − δ/ 2 ) M ] for all j ∈ T  . The e vent {∀ j : | ˆ Θ ( j , κ ) − E [ ˆ Θ ( j , κ )] | < δ M / 2 } is contain ed in E B . Th erefor e , we hav e P ( E c B ) ≤ P  ∃ j : | ˆ Θ ( j , κ ) − E [ ˆ Θ ( j , κ )] | ≥ δ M / 2  ≤ | T | · 2 exp  − c 0 δ 2 N K 2  ≤ 2 J exp  − c 0 δ 2 N K 2  . Since K 2 = o ( N ) and J = o ( N ) by Assumption 4 , the rig ht-hand side tend s to 0; thus P ( E B ) → 1 . By Assumption 4 , we h av e N K 2 ≫ log N and J = o ( N ). Hence, fo r a ll su ﬃ ciently large N , we have J ≤ N . Therefo re, we hav e 2 J exp  − c 0 δ 2 N K 2  ≤ 2 N exp  − c 0 δ 2 N K 2  − − − − → N →∞ 0 . Consequently , we obtain P ( E c B ) → 0 and P ( E B ) → 1 . • Event E E – spectral norm bo und for the noise ma trix. Recall X = R − R ∈ R N × J . By Lemm a 6 , we hav e k X k = O P ( √ N + √ J ). Assum p tion 4 implies J / N → 0. Hen c e , we have k X k = O P ( √ N ). More concretely , from the proof of Lemm a 6 , we can take the constan t C X : = 2 . 5 √ M such that P  k X k ≥ C X √ N  → 0 , and consequen tly P  k X k ≤ C X √ N  → 1 . Deﬁne E E : =  k X k ≤ C X √ N  . Then P ( E E ) → 1 . 25 Step 4 . Beha v iour o n the inter section E C : = E A ∩ E B ∩ E E . Because each of E A , E B , E E has proba b ility tending to 1, P ( E C ) → 1 . On the event E C we h av e the following d e terministic bo unds. (i) Lower bo und f or γ j . On the event E A , γ j > 0 . Sin ce 0 ≤ ˆ R ( i , j ) ≤ M always imp lies ˆ V ( i , j ) ≤ M / 4, we have γ j = 1 q N ˆ V ( i , j ) ≥ 1 √ N · M / 4 = 2 √ N M ( ∀ j ∈ T ) . (ii) Upper bo und for k D k . On the event E B , ˆ Θ ( j , κ ) ∈ [ δ M / 2 , (1 − δ/ 2) M ] f or all j ∈ T . The function h ( x ) = x (1 − x / M ) satisﬁes h ( x ) ≥ h ( δ M / 2) = δ (2 − δ ) M / 4 ≥ δ 2 M / 4 ( sin ce δ ≤ 1 / 2) . He nce, we have ˆ V ( i , j ) ≥ δ 2 M / 4 and consequen tly γ j ≤ 1 p N · δ 2 M / 4 = 2 δ √ N M , k D k = m ax j ∈ T γ j ≤ 2 δ √ N M . (iii) Upper bound for k W D k . Because W is a subm a tr ix of X , k W k ≤ k X k . On the event E E , k X k ≤ C X √ N = 2 . 5 √ M N . Th us on E B ∩ E E , we h av e k W D k ≤ k W kk D k ≤ 2 . 5 √ M N · 2 δ √ N M = 5 δ = C noise . (iv) Lower bound for k M D k . As shown in the proo f of L emma 5 , for each co lumn j ∈ T we have the deterministic identity u ⊤ M : , j v j = γ 0 √ | T | | Θ ( j , k 1 ) − Θ ( j , k 2 ) | ≥ 0 , where M : , j denotes th e j -th column of M . Using (i) an d Equation ( A.1 ) g i ves u ⊤ ( M D ) v = X j ∈ T γ j ( u ⊤ M : , j ) v j ≥  min j ∈ T γ j  u ⊤ M v ≥ 2 √ N M · c ζ √ N J √ K K 0 = C signal √ J √ K K 0 . Since k u k 2 = k v k 2 = 1 , th e variational characterization of th e spec tr al n orm y ields k M D k ≥ | u ⊤ ( M D ) v | ≥ C signal √ J √ K K 0 . Step 5 . Lower bound for k ˜ R S , T k and for σ 1 ( ˜ R ) . On th e event E C , th e triangle in equality gi ves k ˜ R S , T k = k ( M + W ) D k ≥ k M D k − k W D k ≥ C signal √ J √ K K 0 − C noise . Because ˜ R is an N × J matrix, σ 1 ( ˜ R ) = k ˜ R k ≥ k ˜ R S , T k . Hen ce o n the event E C , we h av e σ 1 ( ˜ R ) ≥ L N , L N : = C signal √ J √ K K 0 − C noise . Step 6 . Positive constant lower bo und f or T K 0 . Assum ption 6 asserts tha t f o r all su ﬃ cien tly large N , L N ≥ 1 + 3 η 0 . Assumption 4 gu arantees √ J / N → 0. Therefo re, there exists N 0 such that fo r all N ≥ N 0 , r J N ≤ η 0 . 26 Now for any N ≥ N 0 and on the event E C , we h av e T K 0 = σ 1 ( ˜ R ) −  1 + r J N  ≥ L N −  1 + r J N  ≥ (1 + 3 η 0 ) − (1 + η 0 ) = 2 η 0 . Thus, we h av e P  T K 0 > 2 η 0  ≥ P ( E C ) − → 1 as N → ∞ . Step 7. Imp lication for the testing procedure. I f Algo rithm 1 uses a thr eshold sequence τ N with τ N → 0 (e.g. τ N = N − 1 / 5 ), then f or all su ﬃ cie n tly large N we hav e τ N < 2 η 0 . Con sequently , P  T K 0 > τ N  ≥ P  T K 0 > 2 η 0  − → 1 . Hence, un der the alter n ativ e K 0 < K and Assumptio n 6 , th e test statistic exceeds the th reshold with p robab ility tend ing to o ne, lead ing to a corr e ct reje c tion of H 0 . This com pletes the pro of of Theor e m 2 . Append ix A.5. Pr o of of Theo r em 3 Pr oof. The proof is o rganized into fo ur parts: (I) Decom position of the er r or prob ability; (II) Contr ol of P ( C ) – the probab ility of n o t stopping at the tr ue K ; (III) Control of P ( B ) – the pr obability of stopp ing at an und er-s peciﬁed K 0 ; (IV) Asym ptotic an alysis under Assumption 4 . Part 1: Events and error dec omposition. Deﬁn e the events G A : = { ˆ K = K } , G B : =  ∃ K 0 < K : T K 0 < τ N  , G C : = { T K ≥ τ N } . If neither G B nor G C occurs, then the algor ithm does not stop at any K 0 < K (becau se T K 0 ≥ τ N for a ll such K 0 ) and it do es stop at K 0 = K (since T K < τ N ). Hence G c A ⊆ G B ∪ G C and consequen tly P ( G c A ) ≤ P ( G B ) + P ( G C ) . (A.7) Thus, it su ﬃ ces to prove P ( G B ) → 0 a nd P ( G C ) → 0 . Part 2: Control of P ( G C ) – correct speciﬁcat io n. When K 0 = K , ˜ R is the pr actical nor malized residual matrix an d R ∗ its ideal co unterpar t. By W eyl’ s inequ a lity , we h av e σ 1 ( ˜ R ) ≤ σ 1 ( R ∗ ) + k ˜ R − R ∗ k , and th e refore { T K ≥ τ N } = { σ 1 ( ˜ R ) ≥ 1 + p J / N + τ N } ⊆ { σ 1 ( R ∗ ) ≥ 1 + p J / N + τ N / 2 } ∪ {k ˜ R − R ∗ k ≥ τ N / 2 } . Bound fo r σ 1 ( R ∗ ) . From the proo f of Lemma 1 , we extract the following explicit rate. Set t N = s γ C η M log( N + J ) δ (1 − δ ) N , where γ > 2 is ﬁxed and C η is the co nstant from Lem ma 3 . The pro of o f L e mma 1 shows that P  σ 1 ( R ∗ ) ≥ ( 1 + η )  1 + r J N  + t N  ≤ ( N + J ) 1 − γ . Since γ > 2 , th e right-hand side ten ds to 0. Let η be c lo se to zero, we get P  σ 1 ( R ∗ ) > 1 + r J N + C R r log( N + J ) N  − → 0 . 27 Condition (Con1 ) implies N τ 2 N log( N + J ) → ∞ . Hence, we h av e τ N 2 ≥ C R q log( N + J ) N holds for all su ﬃ ciently large N . For such N , we have the inclusion { σ 1 ( R ∗ ) ≥ 1 + p J / N + τ N / 2 } ⊆ { σ 1 ( R ∗ ) > 1 + p J / N + C R p log( N + J ) / N } , because th e lef t- hand threshold is larger than the right-han d thre sh old. Con sequently , we get P  σ 1 ( R ∗ ) ≥ 1 + p J / N + τ N / 2  ≤ P  σ 1 ( R ∗ ) > 1 + p J / N + C R p log( N + J ) / N  − − − − → N →∞ 0 . Bound for k ˜ R − R ∗ k . Inspecting the pro of of L e mma 2 , there exists an event F N with P ( F c N ) → 0 and a constant C pert > 0 (d epending on ly on δ , c 0 , M ) such th at on F N , k ˜ R − R ∗ k ≤ C pert r J K lo g( J K ) N . Condition (Con1) a sserts th at τ N p J K log( J K ) / N → ∞ . Hence, there exists N ∗ such that fo r all N ≥ N ∗ , τ N 2 > C pert r J K lo g( J K ) N . On the event F N and for N ≥ N ∗ , the ab ove inequality tog e ther with the boun d o n k ˜ R − R ∗ k implies k ˜ R − R ∗ k < τ N / 2. Therefo re, we g et {k ˜ R − R ∗ k ≥ τ N / 2 } ⊆ F c N , and co nsequently P ( k ˜ R − R ∗ k ≥ τ N / 2) ≤ P ( F c N ) → 0 . Both probabilities in th e d ecompo sition ten d to zer o, hen ce we have P ( G C ) → 0 . Part 3: Control of P ( G B ) – under-speciﬁcation. I f K = 1, th e set { K 0 < K } is emp ty and P ( G B ) = 0 trivially . Hence, we assume K ≥ 2. Fix an arbitrary K 0 with 1 ≤ K 0 < K . W e will derive an upper bo und for P ( T K 0 < τ N ) th at does not depend o n the pa r ticular K 0 . Step 1. Det erministic objects from Lemma 5 . Because K 0 < K , L emma 5 guarante e s the existence of two distinct true classes k 1 , k 2 ∈ [ K ], subsets S = S 1 ∪ S 2 ⊆ [ N ] with S 1 ⊆ C k 1 , S 2 ⊆ C k 2 , and a subset T ⊆ [ J ] satisfying the deterministic b ound s | S 1 | , | S 2 | ≥ c 0 N K K 0 , | S | ≥ 2 c 0 N K K 0 , | T | ≥ c 1 J . All subjects in S are assign ed by ˆ Z to the same estimated class κ ∈ [ K 0 ]. Deﬁne the scaling factors γ j : = 1 / q N ˆ V ( i , j ) for i ∈ S (c o nstant on S ), and set W : = ( R − R ) S , T , M : = ( R − ˆ R ) S , T , D : = diag( γ j ) j ∈ T . Then ˜ R S , T = ( W + M ) D . Step 2. Events and probability bo unds from Theorem 2 . In the proo f of Theorem 2 , th e following e ven ts are introdu c ed (with the same S , T , κ ): E A : = { γ j > 0 ∀ j ∈ T } , E B : = { ˆ Θ ( j , κ ) ∈ [ δ M / 2 , (1 − δ/ 2 ) M ] ∀ j ∈ T } , E E : = {k R − Rk ≤ 2 . 5 √ M N } . From that p roof, we h av e the expone n tial bound s (valid for all su ﬃ c ie n tly large N ): P ( E c A ) ≤ 2 J e − c A N / ( K K 0 ) , P ( E c B ) ≤ 2 J e − c B N / ( K K 0 ) , P ( E c E ) ≤ ( N + J ) e − c E N , (A.8) where c A : = 2 c 0 M ( − ln( 1 − δ )), c B : = c 0 δ 2 , and c E > 0 depends on ly on M an d the co nstant f rom Lemma 3 . Mor eover , on the event E A ∩ E B ∩ E E , th e an alysis in Th eorem 2 y ields th e determ inistic lower bo und T K 0 ≥ 2 η 0 , 28 where η 0 is the constant fr om Assump tion 6 . Step 3. From the event bound to a bo und on P ( T K 0 < τ N ) . Con dition ( Con1) e nsures τ N < 2 η 0 for all large N . Consequently , we have E A ∩ E B ∩ E E ⊆ { T K 0 ≥ 2 η 0 } ⊆ { T K 0 ≥ τ N } . T akin g com plements an d applying the unio n bou nd, we obtain P ( T K 0 < τ N ) ≤ P ( E c A ) + P ( E c B ) + P ( E c E ) . Step 4. Uniform bound independent o f K 0 . Since K 0 ≤ K − 1 and K ≥ 2, we have K K 0 ≤ K ( K − 1) ≤ K 2 , w h ich giv es N K K 0 ≥ N K ( K − 1) ≥ N K 2 . Therefo re, the exponential bou nds in Equatio n ( A.8 ) can be weakened to P ( E c A ) ≤ 2 J e − c A N / K 2 , P ( E c B ) ≤ 2 J e − c B N / K 2 . Note that the bound for P ( E c E ) in E quation ( A.8 ) already d oes not inv olve K 0 . Con sequently , f o r ev ery K 0 < K a n d all su ﬃ c iently large N , we have P ( T K 0 < τ N ) ≤ 2 J e − c A N / K 2 + 2 J e − c B N / K 2 + ( N + J ) e − c E N . Step 5 . Unio n bound over all K 0 < K . Because there are at most K − 1 candid ates with K 0 < K , we hav e P ( G B ) ≤ K − 1 X K 0 = 1 P ( T K 0 < τ N ) ≤ ( K − 1) h 2 J e − c A N / K 2 + 2 J e − c B N / K 2 + ( N + J ) e − c E N i . Part 4: Asymptotic analysis under Assumption 4 . • For the term 2( K − 1) J e − cN / K 2 (with c = c A or c B ): Assum ption 4 (i) states K 2 log( N + J ) N → 0 . T h is imp lies that for any ﬁxed m 0 > 0 , th ere exists N 2 such that fo r all N ≥ N 2 , N K 2 ≥ m 0 log( N + J ) . Choose m 0 such that cm 0 > 1 . T hen, f or all N ≥ N 2 , we h av e e − cN / K 2 ≤ e − cm 0 log( N + J ) = ( N + J ) − cm 0 ≤ N − cm 0 . Assumption 4 (ii) gives J K N → 0. Hence, ther e exists N 3 such that for all N ≥ N 3 , J K ≤ N , which imp lies ( K − 1) J ≤ K J ≤ N . Th erefore , for all N ≥ max( N 2 , N 3 ), we hav e 2( K − 1 ) J e − cN / K 2 ≤ 2 N · N − cm 0 = 2 N 1 − cm 0 − − − − → N →∞ 0 , since 1 − cm 0 < 0 . • For the term ( K − 1)( N + J ) e − c E N : From K 2 = o ( N ) (a direct co nsequenc e of Assumption 4 (i)), we have K = o ( N 1 / 2 ), and certainly K = o ( N ). Henc e, there exists N 4 such that f o r all N ≥ N 4 , K ≤ N . Also, from J / N → 0 , we eventually h av e J ≤ N . Thu s f o r N ≥ N 4 , we have ( K − 1) ( N + J ) ≤ N · 2 N = 2 N 2 . Because e − c E N = o ( N − 2 ) (exponen tial decay do m inates any p olynom ial), we o btain ( K − 1)( N + J ) e − c E N − − − − → N →∞ 0 . All compon ents of P ( G B ) c o n verge to 0, therefo re P ( G B ) → 0 . Part 5 : Conclusion. W e have establishe d P ( G C ) → 0 and P ( G B ) → 0 . By decom position E q uation ( A.7 ), P ( G c A ) → 0 , i.e. P ( ˆ K = K ) → 1 . T h is co m pletes th e p roof of Theo rem 3 . 29 Append ix A.6. Pr o of of Theo r em 4 Pr oof. W e pr ove the two statements sepa rately . Throug h out th e pro of, we work u nder the given assum ptions, which allow K to grow with N subject to K 3 = o ( N ). P art 1: Diver genc e at the true model. Assume K 0 = K is corr ectly speciﬁed . W e ﬁr st show that T K = o P (1). Fr o m Assumption 4 ( ii), we have J = o ( N ). Sin c e Assum ption 1 ho ld s, L emma 9 is applicable and yields k R ∗ k a . s . − − → 1, i.e. , k R ∗ k = 1 + o P (1). By Lemma 2 and Assum p tion 4 (ii), k ˜ R − R ∗ k = O P ( q J K log( J K ) N ) = o P (1). W eyl’ s inequality then implies | σ 1 ( ˜ R ) − σ 1 ( R ∗ ) | ≤ k ˜ R − R ∗ k = o P (1) , so σ 1 ( ˜ R ) = σ 1 ( R ∗ ) + o P (1) = 1 + o P (1). T herefor e, we g et T K = σ 1 ( ˜ R ) −  1 + r J N  = o P (1) − r J N = o P (1) , because √ J / N → 0. Now consider K 0 = K − 1 < K . Assumption 6 (assumed to hold for e very K 0 < K ) ensures that Theorem 2 ap plies, giving a co nstant 2 η 0 > 0 such that P  T K − 1 > 2 η 0  → 1 . Fix an arbitrary con stant M a > 0 . T K = o P (1) gives P ( | T K | < 2 η 0 / M a ) → 1 . Deﬁn e two events A N : = { T K − 1 > 2 η 0 } , B N : = {| T K | < 2 η 0 / M a } . W e have P ( A N ∩ B N ) → 1 . On the event A N ∩ B N , we g et r K = | T K − 1 T K | > 2 η 0 2 η 0 / M a = M a . Thus, we g et P ( r K > M a ) ≥ P ( A N ∩ B N ) → 1. Because M a is arbitrary , we con c lude that r K P − → ∞ . P art 2: Bou ndedne ss und er under-ﬁtting. Let c low be the c onstant fro m Lemma 7 . For each m with 1 ≤ m < K , Lemma 7 gu arantees the existence of an event E ( m ) N such that P ( E ( m ) N ) → 1 and on E ( m ) N , T m ≥ c low √ J √ K m . T o contro l the p robabilities of the c o mplemen ts, we recall the construc tion in the proo f of Th eorem 2 . F o r a given under-ﬁtted can didate K 0 = m , the proo f of Th eorem 2 d eﬁnes thre e events E A ( m ), E B ( m ) an d E E (the latter does not depend on m ) such that E ( m ) N = E A ( m ) ∩ E B ( m ) ∩ E E . From the estimates obtained there (see the boun ds ( A.4 ) an d the subsequen t an a lysis), there exist positi ve constants c A , c B , c E depend ing only on the model p arameters δ , c 0 , M (and on the c o nstant fr om Lem ma 3 for c E ) su c h th at f or all su ﬃ ciently large N , P  E c A ( m )  ≤ 2 J exp  − c A N K m  , P  E c B ( m )  ≤ 2 J exp  − c B N K m  , P ( E c E ) ≤ ( N + J ) e − c E N . (These bounds are uniform in m because the co nstants c A , c B , c E are derived fr o m Assumptions 1 – 4 and do not in- volve m ; the factor K in th e denominator appe a rs be c ause the class sizes scale as N / K unif ormly , as established in Assumption 2 .) Using the u n ion b ound, P  ( E ( m ) N ) c  ≤ P ( E c A ( m )) + P ( E c B ( m )) + P ( E c E ) ≤ 2 J exp  − c A N K m  + 2 J exp  − c B N K m  + ( N + J ) e − c E N . 30 Since m ≤ K , we have 1 / ( K m ) ≥ 1 / K 2 , an d therefore exp  − c A N K m  ≤ exp  − c A N K 2  , exp  − c B N K m  ≤ exp  − c B N K 2  . Consequently , f o r all m < K , we have P  ( E ( m ) N ) c  ≤ 4 J exp  − c A B N K 2  + ( N + J ) e − c E N , where we set c A B : = m in { c A , c B } > 0. Now deﬁne ˜ E N : = T K − 1 m = 1 E ( m ) N . Applying the union boun d giv es P ( ˜ E c N ) = P         K − 1 [ m = 1 ( E ( m ) N ) c         ≤ K − 1 X m = 1 P  ( E ( m ) N ) c  ≤ ( K − 1) h 4 J exp  − c A B N K 2  + ( N + J ) e − c E N i . W e show that the right-hand side tends to zero . • First term. By Assumption 4 , we get K 2 log N ≪ N and log(( K − 1) J ) ≪ log N for su ﬃ ciently large N . Therefo re, we hav e log  ( K − 1) J e − c AB N / K 2  ≪ log N − c A B N K 2 − → −∞ , so ( K − 1) J e − c AB N / K 2 = o (1). • Second term. Since K ≪ N 1 / 2 and J ≪ N fo r large N by Assumption 4 , we hav e ( K − 1) ( N + J ) e − c E N ≪ 2 N 1 . 5 e − c E N = o (1 ) , as the expo nential decay dom inates any p olynom ial g rowth. Thus, we have P ( ˜ E c N ) → 0, i. e., P ( ˜ E N ) → 1. On the ev ent ˜ E N , for every m with 1 ≤ m < K we have the lower bound T m ≥ c low √ J √ K m . I n par ticular, fo r any K 0 with 2 ≤ K 0 < K , we hav e T K 0 − 1 ≥ c low √ J √ K ( K 0 − 1) , T K 0 ≥ c low √ J √ K K 0 . Lemma 8 provid es th e deterministic u niversal b o und T m ≤ √ M J fo r ev ery m ≥ 1, which h olds with probab ility 1. Thus, o n ˜ E N , we h av e r K 0 = T K 0 − 1 T K 0 ≤ √ M J c low √ J √ K K 0 = √ M c low √ K K 0 ≤ √ M c low √ K ( K − 1) . Therefo re, we obtain P       r K 0 > √ M c low √ K ( K − 1)       ≤ P ( ˜ E c N ) − → 0 , which co mpletes the p roof. Append ix A.7. Pr o of of Theo r em 5 Pr oof. W e trea t the two po ssibilities K = 1 and K ≥ 2 separately . T hroug hout the pro of, all constants m ay dep end on the ﬁxed number K a nd on the mode l parame te r s δ , c 0 , c 1 , M , ζ , but never on N or J . Part 1: Case K = 1 Under the null h ypothe sis K = 1, Th eorem 1 gives T 1 = o P (1). A closer in spection of th e p roof (combin ing Le m ma 2 and Lemma 1 ) shows that th ere exist a constant ˜ C 1 > 0 and an ev e n t F (1) N with P ( F (1) N ) → 1 such that o n F (1) N , | T 1 | ≤ ˜ C 1 r J K log( J K ) N . (A.9) 31 Because K is ﬁxed, the righ t-hand side is O  p J log J / N  . Condition (Con1) guarantee s that for all su ﬃ ciently large N , we have τ N ≥ ˜ C 1 p J K log( J K ) / N on a set whose probability ten ds to one . Con sequently , we get P  T 1 < τ N  ≥ P  F (1) N  − o (1) − → 1 . Thus the alg orithm returns ˆ K = 1 with probab ility tending to on e. Part 2: Case K ≥ 2 W e mu st show two facts: (i) the a lgorithm does not stop at any u nder-ﬁtted ca n didate K 0 < K ; (ii) it does stop at the tru e ca n didate K 0 = K . (i) No sto p a t under-ﬁt ted ca ndidates. • Candidate K 0 = 1 . Becau se K ≥ 2, th e can didate 1 is und e r-ﬁtted. Th eorem 2 (applied with K 0 = 1 ) provides a constant 2 η 0 > 0 such that P ( T 1 > 2 η 0 ) → 1 . By Cond ition (Con1) , we have P ( T 1 < τ N ) ≤ P ( T 1 ≤ 2 η 0 ) → 0 . Hence the alg orithm does n ot stop at K 0 = 1 . • Candidates 2 ≤ K 0 < K . Fix any such K 0 . Because K is ﬁxed, the con dition K 3 = o ( N ) in Th eorem 4 is trivially satisﬁed. Hence p a r t (b) of Theor e m 4 applies, and we ob tain lim N →∞ P  r K 0 > √ M c low √ K ( K − 1)  = 0 . (A.10) Since γ N → ∞ b y Condition ( Co n 2), for all su ﬃ ciently large N we have γ N > √ M c low √ K ( K − 1). Th erefore, P  r K 0 > γ N  ≤ P  r K 0 > √ M c low √ K ( K − 1)  + 1 { γ N ≤ √ M c low √ K ( K − 1) } − → 0 . Consequently , with pro bability tending to o ne, th e ratio condition fails for every su c h K 0 . Thus, th e algor ithm does not stop at any of th em. (ii) Stop at the true candidate K 0 = K . W e now prove that P ( r K > γ N ) → 1 . From Lemma 7 applied with K 0 = K − 1, there exists an event A N with P ( A N ) → 1 such that on A N , T K − 1 ≥ c low √ J √ K ( K − 1) > 0 . (A.11) From th e pro of of Theorem 1 (speciﬁcally Lemm a 2 and W eyl’ s inequa lity ), there exists a constant ˜ C 2 > 0 and an ev ent B N with P ( B N ) → 1 such that on B N , | T K | ≤ ˜ C 2 r J K lo g( J K ) N . (A.12) Deﬁne ˘ E N : = A N ∩ B N . Since both A N and B N have pro bability tending to one , P ( ˘ E N ) → 1. On ˘ E N , com b ining Equation s ( A.11 ) and ( A.12 ) y ields r K = T K − 1 | T K | ≥ c low √ J √ K ( K − 1) ˜ C 2 r J K log( J K ) N = c low ˜ C 2 ( K − 1) K √ N p log( J K ) . (A.13) 32 Denote ˜ C 3 : = c low ˜ C 2 ( K − 1) K > 0 . T hen on ˘ E N , we h av e r K ≥ ˜ C 3 √ N p log( J K ) . (A.14) Condition (Con2) states that γ N = o  q N log J  . G iven that K is ﬁxed, in particu lar , there exists an integer N 0 such that f or all N ≥ N 0 , ˜ C 3 √ N p log( J K ) > γ N . Consequently , o n the event ˘ E N and for all N ≥ N 0 , inequality ( A.14 ) yield s r K > γ N . Hence, we obtain P  r K > γ N  ≥ P ( ˘ E N ) − → 1 . W e ha ve shown that with pro b ability ten ding to one the alg orithm does not stop at any K 0 < K and does stop at K 0 = K . Because the algorithm sequ entially examin es can didates in increasing ord er , this implies P ( ˆ K = K ) → 1. Thus the th eorem ho lds fo r both K = 1 and K ≥ 2. Ap pendix B. T echnical lemmas Lemma 4 (Consistency of e stima ted para meters under the null) . Under H 0 with th e true K = K 0 , sup pose that Assumptions 1 , 2 , 4 , an d 5 ho ld . Let ˆ Z , ˆ Θ , ˆ R be the estimates fr om Algorith m 1 step 1 . Then ther e e xists a p ermutation matrix Π such that, with p r obab ility tending to 1 , 1. ˆ Z = Z Π , 2. k ˆ Θ − ΘΠ k F = O P  q J K 2 N  , 3. k ˆ R − Rk F = O P  √ J K  , 4. max i , j | ˆ R ( i , j ) − R ( i , j ) | = O P q K log( J K ) N ! , 5. Ther e exists a constan t v min > 0 ( speciﬁcally , v min = δ 2 M 4 ) such that with hig h pr obability , ˆ V ( i , j ) ≥ v min for all i , j. Lemma 5 (Lower boun d on stru ctural re sidual under under -ﬁtting) . Assume K 0 < K , an d Assumptions 2 and 3 hold . Then ther e e xist constants c > 0 and disjoint sets S ⊂ [ N ] , T ⊂ [ J ] , with | S | ≍ N / K , | T | ≥ c 1 J , su ch that for any estimated classiﬁcation ma trix ˆ Z (a nd the corres pondin g ˆ Θ , ˆ R ) we have deterministically    ( R − ˆ R ) S , T    ≥ c ζ √ N J √ K K 0 , (B.1) wher e the c onstant c = q c 3 0 c 1 2 √ 2 depend s only on c 0 (Assumption 2 ) a nd c 1 (Assumption 3 ), and is ind ependen t o f N , J , K , K 0 . Lemma 6 ( Control o f th e r esidual matrix) . Let X = R − R ∈ R N × J , we ha ve k X k = O P  √ N + √ J  . (B.2) Consequently , for an y sub matrix W = ( R − R ) S , T with S ⊆ [ N ] , T ⊆ [ J ] , we h ave k W k = O P  √ N + √ J  . 33 Lemma 7 (Lower bo u nd for T K 0 under und er-ﬁtting) . Assume K 0 < K and that Assumptions 1 – 4 , and 6 hold. Su ppose addition ally th at K 3 = o ( N ) as N → ∞ . Let the constants c , ζ , M , C noise , C signal , η 0 be as deﬁ ned in Theo r em 2 . Then ther e exists a con stant c low : = 2 η 0 C signal C noise + 1 + 3 η 0 such that for every K 0 < K , P  T K 0 ≥ c low √ J √ K K 0  − → 1 as N → ∞ . (B.3) Lemma 8 (Determ inistic uppe r bou nd fo r T K 0 ) . F or any ca ndidate n u mber of latent classes K 0 (includin g the true one) a nd fo r every r ea lization of th e data and the estimation pr ocedu r e, we have T K 0 ≤ √ M J , wher e M is the ma x imum r esponse cate gory intr o duced in Deﬁ nition 1 . Consequently , P  T K 0 ≤ √ M J  = 1 for all N . Lemma 9 ( Lower bound f or k R ∗ k under J = o ( N )) . Supp ose that Assumption 1 holds a nd J = o ( N ) , we h ave k R ∗ k = k Y N k √ N a . s . − − → 1 . In particular , for any ε > 0 , lim N →∞ P  k R ∗ k ≥ 1 − ε  = 1 , and σ 1 ( R ∗ ) = k R ∗ k = 1 + o P (1) . Append ix B.1. Pr o of of Lemma 4 Pr oof. Thro ugho ut, we deno te [ n ] = { 1 , 2 , . . . , n } fo r any positiv e integer n . Part 1: Perfect recovery o f class a ssignments. By Assumption 5 , when the tru e n umber of latent classes is K (i.e., K 0 = K ), the classiﬁcation estimation method M u sed in Algo rithm 1 Step 1 is consistent. That is, there exists a permu tation matrix Π such that P  ˆ Z = Z Π  → 1 as N → ∞ . This p roves statement 1 d irectly fr o m the assump tion. Part 2: Frobenius norm error of ˆ Θ . W e no w c o ndition on the h igh-pr obability event E (1) N = { ˆ Z = Z Π } . By the construction o f the estimato r in Algorithm 1 Step 1 (wh ich implements m ethod M ), the item param eter matr ix is estimated a s: ˆ Θ = R ⊤ ˆ Z ( ˆ Z ⊤ ˆ Z ) − 1 = R ⊤ Z Π ( Π ⊤ Z ⊤ Z Π ) − 1 = R ⊤ Z ( Z ⊤ Z ) − 1 Π , where we u sed th e facts that for a p ermutation m atrix Π , Π − 1 = Π ⊤ , an d ( Π ⊤ Z ⊤ Z Π ) − 1 = Π ⊤ ( Z ⊤ Z ) − 1 Π . The true item par ameter m atrix Θ satisﬁes R = Z Θ ⊤ , where R = E [ R ]. Since R = E [ R ] = Z Θ ⊤ , we have: Θ = R ⊤ Z ( Z ⊤ Z ) − 1 . Therefo re, ˆ Θ − ΘΠ = h R ⊤ Z ( Z ⊤ Z ) − 1 − R ⊤ Z ( Z ⊤ Z ) − 1 i Π = ( R − R ) ⊤ Z ( Z ⊤ Z ) − 1 Π . Since Π is an orthogo nal m atrix (per mutation matrices are o rthogo nal), we h av e k Π k = 1 (spectr al no r m) and k A Π k F = k A k F for any matrix A . Thus, we have k ˆ Θ − ΘΠ k F = k ( R − R ) ⊤ Z ( Z ⊤ Z ) − 1 k F . 34 Deﬁne W = R − R . The entries W ( i , j ) are in depend ent across i and j (co nditional on the laten t classes, and also uncon d itionally since th e laten t classes a r e ﬁxed) . Moreover, E [ W ( i , j )] = 0, and by the b inomial variance formula: V ar( W ( i , j )) = R ( i , j ) 1 − R ( i , j ) M ! ≤ M 4 , where the inequality h olds because th e fun ction h ( x ) = x (1 − x / M ) attains its max imum M / 4 a t x = M / 2, and R ( i , j ) ∈ [0 , M ]. Let D = ( Z ⊤ Z ) − 1 = d iag( N − 1 1 , N − 1 2 , . . . , N − 1 K ). Deﬁne A = W ⊤ Z D 1 / 2 ∈ R J × K , where D 1 / 2 = d iag( N − 1 / 2 1 , N − 1 / 2 2 , . . . , N − 1 / 2 K ). Then: k W ⊤ Z D k 2 F = k A D 1 / 2 k 2 F = tr  D 1 / 2 A ⊤ A D 1 / 2  = tr  A ⊤ A D  = K X k = 1 1 N k k A : k k 2 2 , where A : k denotes th e k -th colu mn of A . Now , observe that: A : k = 1 √ N k X i ∈C k W ( i , :) ⊤ , where W ( i , :) = ( W ( i , 1) , . . . , W ( i , J )) ⊤ . For each j ∈ [ J ], the j -th comp o nent of A : k is: A : k ( j ) = 1 √ N k X i ∈C k W ( i , j ) . For ﬁxed k and j , the summ ands { W ( i , j ) : i ∈ C k } are ind ependen t, zero- mean r andom variables, bounded by | W ( i , j ) | ≤ M (since R ( i , j ) ∈ { 0 , 1 , . . . , M } and R ( i , j ) ∈ [0 , M ]). The ir variances satisfy V ar( W ( i , j )) ≤ M / 4 as noted above. W e now com pute th e expected squared Euclidean nor m of A : k : E h k A : k k 2 2 i = J X j = 1 E h A : k ( j ) 2 i = J X j = 1 V ar ( A : k ( j ) ) (since E [ A : k ( j )] = 0) = J X j = 1 1 N k X i ∈C k V ar( W ( i , j )) (by indepen dence across i ) ≤ J X j = 1 1 N k X i ∈C k M 4 = M 4 · 1 N k · N k · J = M J 4 . Thus, E [ k A : k k 2 2 ] ≤ M J 4 for each k ∈ [ K ]. By Assumption 2 , there exists c 0 > 0 such that N k ≥ c 0 N / K for all k . Ther efore, we have E h k ˆ Θ − ΘΠ k 2 F i = E        K X k = 1 1 N k k A : k k 2 2        = K X k = 1 1 N k E h k A : k k 2 2 i ≤ K X k = 1 1 c 0 N / K · M J 4 = M K 2 J 4 c 0 N . Now , by Markov’ s inequality , for any ǫ > 0, we h ave P  k ˆ Θ − ΘΠ k F ≥ ǫ  ≤ E h k ˆ Θ − ΘΠ k 2 F i ǫ 2 ≤ M K 2 J 4 c 0 N ǫ 2 . For any ﬁxed η > 0, choose ǫ η = q M K 2 J 4 c 0 N η . Then, we have P  k ˆ Θ − ΘΠ k F ≥ ǫ η  ≤ η. 35 Hence, by d e ﬁn ition, we o btain k ˆ Θ − ΘΠ k F = O P        r J K 2 N        . This proves statem ent 2 . Part 3: Frobenius norm error of ˆ R . Conditionin g again o n the event E (1) N where ˆ Z = Z Π (wh ich occurs with probab ility tending to 1) , we have ˆ R = ˆ Z ˆ Θ ⊤ = Z Π ˆ Θ ⊤ = Z ( ˆ ΘΠ − 1 ) ⊤ = Z ( ˆ ΘΠ ⊤ ) ⊤ . Giv e n that the true expected response matrix is R = Z Θ ⊤ , we h av e ˆ R − R = Z ( ˆ ΘΠ ⊤ ) ⊤ − Z Θ ⊤ = Z  ( ˆ ΘΠ ⊤ ) ⊤ − Θ ⊤  = Z  ˆ ΘΠ ⊤ − Θ  ⊤ . Because k ˆ ΘΠ ⊤ − Θ k F = k ˆ Θ − ΘΠ k F , we h av e k ˆ R − Rk F = k Z  ˆ ΘΠ ⊤ − Θ  ⊤ k F ≤ k Z k · k ˆ ΘΠ ⊤ − Θ k F = k Z k · k ˆ Θ − ΘΠ k F . The m a trix Z has orthon ormal column s up to scaling : Z ⊤ Z = d iag( N 1 , . . . , N K ). Thus, its spectral n o rm is k Z k = √ λ max ( Z ⊤ Z ) = √ N max . By Assum ption 2 , N max ≤ 1 c 0 N K , so k Z k ≤ q N c 0 K . Combining this with statem ent 2, we obtain : k ˆ R − Rk F ≤ r N c 0 K · O P        r J K 2 N        = O P       r J c 0 · K       = O P  √ J K  . This p roves statement 3. Part 4: Unif orm e ntrywise error of ˆ R . W e begin b y recallin g that on the hig h -prob ability event E (1) N = { ˆ Z = Z Π } (which satisﬁes P ( E (1) N ) → 1), we have ˆ R ( i , j ) = ˆ Θ ( j , π ( k )) wh ere k = ℓ ( i ) and π is the perm utation co rrespon ding to Π . Moreover , by the construc tio n of th e estimator in Algor ith m 1 Step 1, ˆ Θ ( j , π ( k )) is precisely the sample mean of the respon ses to item j over all su bjects in the true class k , i.e., ˆ Θ ( j , π ( k )) = 1 N k X i ∈C k R ( i , j ) , where C k = { i : ℓ ( i ) = k } an d N k = | C k | . Fix an arbitr ary pair ( j , k ) with j ∈ [ J ] and k ∈ [ K ]. Conditio n o n th e latent class assignm ent ℓ ( which is ﬁxed in our model) and on E (1) N . Under the d ata-gener ating mechan ism in the LCM model, th e rand om variables { R ( i , j ) : i ∈ C k } are m u tually ind e p endent (by cond itional inde penden c e given ℓ , and unco nditiona lly since ℓ is ﬁxed) a nd e a ch fo llows a bino m ial d istribution with parameters M and Θ ( j , k ) / M . Conseq uently , each R ( i , j ) is boun ded within the interval [0 , M ] and has mean Θ ( j , k ). W e now apply Hoe ﬀ d ing’ s in equality in its p recise form. Let Y 1 , . . . , Y n be independent random variables such that a i ≤ Y i ≤ b i almost surely . Then for any t > 0, Hoe ﬀ ding ’ s in equality gi ves P            1 n n X i = 1 Y i − E h 1 n n X i = 1 Y i i     ≥ t        ≤ 2 exp  − 2 n 2 t 2 P n i = 1 ( b i − a i ) 2  . In o ur settin g , for the N k variables { R ( i , j ) : i ∈ C k } , we h av e a i = 0 , b i = M , an d P N k i = 1 ( b i − a i ) 2 = N k M 2 . Th erefor e , we g et P     ˆ Θ ( j , π ( k )) − Θ ( j , k )    ≥ t    E (1) N  ≤ 2 exp  − 2 N k t 2 M 2  . By Assum p tion 2 , there exists a constant c 0 > 0 such that N k ≥ c 0 N / K fo r all k . Hence, u n iformly in j , k , we have 36 P     ˆ Θ ( j , π ( k )) − Θ ( j , k )    ≥ t    E (1) N  ≤ 2 exp  − 2 c 0 N t 2 K M 2  . Now observe tha t on E (1) N , f or any i ∈ C k , ˆ R ( i , j ) = ˆ Θ ( j , π ( k )) and R ( i , j ) = Θ ( j , k ). Ther efore, we have max i ∈ [ N ] , j ∈ [ J ]    ˆ R ( i , j ) − R ( i , j )    = max k ∈ [ K ] , j ∈ [ J ]    ˆ Θ ( j , π ( k )) − Θ ( j , k )    . W e contro l this maximu m via a union b ound over the J K distinc t p airs ( j , k ): P max i , j    ˆ R ( i , j ) − R ( i , j )    ≥ t      E (1) N ! = P max j , k    ˆ Θ ( j , π ( k )) − Θ ( j , k )    ≥ t      E (1) N ! ≤ J X j = 1 K X k = 1 P     ˆ Θ ( j , π ( k )) − Θ ( j , k )    ≥ t    E (1) N  ≤ J K · 2 exp  − 2 c 0 N t 2 K M 2  = 2 J K exp  − 2 c 0 N t 2 K M 2  . For any ﬁxed η > 0, let t = M s K 2 c 0 N log 4 J K η ! . Then, we hav e 2 J K exp − 2 c 0 N t 2 K M 2 ! = 2 J K · η 4 J K = η/ 2 , which g i ves P          max i , j    ˆ R ( i , j ) − R ( i , j )    ≥ M s K 2 c 0 N log 2 J K η !         E (1) N          ≤ η/ 2 . Now we remove the c o nditionin g on E (1) N . Using the law of total pr obability , for any t > 0, we hav e P max i , j    ˆ R − R    ≥ t ! ≤ P  ( E (1) N ) c  + P max i , j    ˆ R − R    ≥ t      E (1) N ! . Since P (( E (1) N ) c ) → 0, there exists N 0 such that fo r all N ≥ N 0 , P (( E (1) N ) c ) ≤ η/ 2. Then, for N ≥ N 0 , we h av e P          max i , j    ˆ R − R    ≥ M s K 2 c 0 N log 4 J K η !          ≤ η, which m e ans max i , j    ˆ R ( i , j ) − R ( i , j )    = O P        r K log( J K ) N        . Part 5 : Lower bound on ˆ V ( i , j ) . Recall that ˆ V ( i , j ) = ˆ R ( i , j )  1 − ˆ R ( i , j ) / M  . Deﬁn e th e function h ( x ) = x  1 − x M  , x ∈ [0 , M ] . Under Assumption 1 , we hav e R ( i , j ) = Θ ( j , ℓ ( i )) ∈ [ δ M , (1 − δ ) M ]. By statement 4, with pro bability tendin g to 1, for all i , j , we have    ˆ R ( i , j ) − R ( i , j )    ≤ ǫ N , 37 where ǫ N = C q K log( J K ) N for some co nstant C > 0. Under Assumption 4 , ǫ N → 0. Hence, for su ﬃ cien tly large N , we have ǫ N ≤ δ M / 2. Consequently , ˆ R ( i , j ) ∈ [ δ M − ǫ N , (1 − δ ) M + ǫ N ] ⊆  δ M 2 ,  1 − δ 2  M  . Because h is a co ncave fun ction ( since h ′′ ( x ) = − 2 / M < 0), on the inter val [ δ M / 2 , (1 − δ/ 2) M ], its minim um is attained at o n e o f the en dpoints. Comp ute h  δ M 2  = δ M 2  1 − δ 2  = δ (2 − δ ) M 4 , h  1 − δ 2  M  =  1 − δ 2  M · δ 2 = δ (2 − δ ) M 4 . Thus, both e n dpoin ts yield the same value. Theref ore, for any x ∈ [ δ M / 2 , (1 − δ/ 2) M ], we h av e h ( x ) ≥ δ (2 − δ ) M 4 . Since δ ∈ (0 , 1 / 2 ] , we have 2 − δ ≥ δ , and hence δ (2 − δ ) M 4 ≥ δ 2 M 4 . Thus, with p robab ility tending to 1, for all i , j , we h av e ˆ V ( i , j ) ≥ δ 2 M 4 . W e can therefo r e set v min = δ 2 M 4 . Th is proves statemen t 5. Append ix B.2. Pr o of of Lemma 5 Pr oof. The argument is completely deterministic and uses only elementary cou nting, the pigeonho le p rinciple, and basic linear algebra. No p robabilistic statement is required u ntil the ﬁn al co nclusion. The proof proce eds in twelve self-contained steps. Step 1. A large sub-block inside each true class. For each true class k ∈ [ K ] and e a ch estimated class κ ∈ [ K 0 ] deﬁne N k κ : =     i ∈ C k : ˆ Z ( i , κ ) = 1     . Sure, we have P K 0 κ = 1 N k κ = N k . By the averaging princip le (a direct consequ e n ce of the p igeonh o le princip le), th e r e exists at least one estimated class κ ( k ) ∈ [ K 0 ] such that N k ,κ ( k ) ≥ N k K 0 . Assumption 2 gi ves N k ≥ c 0 N / K . Hence, we have N k ,κ ( k ) ≥ c 0 N K K 0 . Step 2. T wo true classes mapped to t he same estimated class. For each k ∈ [ K ], Step 1 giv es a non-empty set A k : = { κ ∈ [ K 0 ] : N k ,κ ≥ N k / K 0 } . Choose an arbitrary elem ent κ ( k ) ∈ A k and this deﬁnes a function κ : [ K ] → [ K 0 ]. Because K 0 < K , κ can not be injectiv e. Hence, by pigeonh ole pr in ciple, there exist distinct k 1 , k 2 ∈ [ K ] with κ ( k 1 ) = κ ( k 2 ) = : κ . These two true classes are the r efore merged, at least p a rtially , into the same estimated class κ . Step 3 . Co nstruction of t he row index set S . Deﬁne S 1 : =  i ∈ C k 1 : ˆ Z ( i , κ ) = 1  , S 2 : =  i ∈ C k 2 : ˆ Z ( i , κ ) = 1  , S : = S 1 ∪ S 2 . 38 From Step 1 we have the determin istic lower bou nds | S 1 | = N k 1 ,κ ≥ c 0 N K K 0 , | S 2 | = N k 2 ,κ ≥ c 0 N K K 0 . Using the u p per b ound in Assum ption 2 ( N k ≤ N / ( c 0 K )) and the fact tha t S 1 ⊆ C k 1 , S 2 ⊆ C k 2 , we o btain | S 1 | ≤ N k 1 ≤ N c 0 K , | S 2 | ≤ N k 2 ≤ N c 0 K , and co nsequently | S | = | S 1 | + | S 2 | ≤ 2 N c 0 K . Thus, we h av e 2 c 0 N K K 0 ≤ | S | ≤ 2 N c 0 K . Step 4 . Co nstruction of t he co lumn index set T . For the pair ( k 1 , k 2 ), we inv o ke Assumption 3 . Deﬁne T : =  j ∈ [ J ] : | Θ ( j , k 1 ) − Θ ( j , k 2 ) | ≥ ζ / 2  . Assumption 3 asserts that | T | ≥ c 1 J . T r i vially , we have | T | ≤ J . Step 5. The residual submatrix M : = ( R − ˆ R ) S , T . Because every individual in S is assigned to th e same estimated class κ , the ﬁtted expec ta tio n is constant on S × T : ˆ R ( i , j ) = ˆ Θ ( j , κ ) = : ˆ θ ( j ) , ∀ i ∈ S , j ∈ T . For the true expectation, we have R ( i , j ) =          Θ ( j , k 1 ) , i ∈ S 1 , Θ ( j , k 2 ) , i ∈ S 2 . Hence, M admits the block representatio n M = " M 1 M 2 # , M 1 ( i , j ) = Θ ( j , k 1 ) − ˆ θ ( j ) ( i ∈ S 1 ) , M 2 ( i , j ) = Θ ( j , k 2 ) − ˆ θ ( j ) ( i ∈ S 2 ) . Step 6. Construction of two unit vecto rs. W e now exhibit sp eciﬁc unit vector s u ∈ R | S | and v ∈ R | T | that yield a large value of | u ⊤ M v | , thereby provid ing a lower bound for k M k via the variational characterizatio n k M k = max k u k = k v k = 1 | u ⊤ M v | . Column vector v . For each j ∈ T , set s j : = sign  Θ ( j , k 1 ) − Θ ( j , k 2 )  ∈ {− 1 , 1 } , v j : = s j √ | T | . Then, we have k v k 2 2 = P j ∈ T 1 / | T | = 1. The signs convert the absolute di ﬀ erences | Θ ( j , k 1 ) − Θ ( j , k 2 ) | into a sum of non-negati ve ter ms. Row vector u. Recall that every r ow of M 1 equals Θ ( j , k 1 ) − ˆ θ ( j ) and ev ery row of M 2 equals Θ ( j , k 2 ) − ˆ θ ( j ). T o eliminate the unk n own co mmon part ˆ θ ( j ), we assign p ositi ve weights to r ows in S 1 and negative weigh ts to ro ws in S 2 , with e qual total m ass in abso lu te value. Deﬁne the b a lancing coe ﬃ cients α : = √ | S 2 | √ | S 1 | + | S 2 | , β : = √ | S 1 | √ | S 1 | + | S 2 | , and set u i : =            α/ √ | S 1 | , i ∈ S 1 , − β/ √ | S 2 | , i ∈ S 2 . 39 A direct computation gives k u k 2 2 = X i ∈ S 1 α 2 | S 1 | + X i ∈ S 2 β 2 | S 2 | = α 2 + β 2 = 1 , so u is ind eed a unit vector . The opp osing signs and the speciﬁc c hoice of α, β gu arantee th at th e tota l weig ht on S 1 , α p | S 1 | = s | S 1 || S 2 | | S 1 | + | S 2 | , is exactly the opposite o f th e to tal weight on S 2 , − β p | S 2 | = − s | S 1 || S 2 | | S 1 | + | S 2 | , which fo rces the can cellation o f ˆ θ ( j ) wh en we fo rm u ⊤ M v in the n ext step. Step 7 . Exa ct evaluation of u ⊤ M v . Ex pandin g the bilin ear for m u ⊤ M v giv es u ⊤ M v = X i ∈ S 1 X j ∈ T α √ | S 1 |  Θ ( j , k 1 ) − ˆ θ ( j )  s j √ | T | + X i ∈ S 2 X j ∈ T − β √ | S 2 |  Θ ( j , k 2 ) − ˆ θ ( j )  s j √ | T | = α p | S 1 | 1 √ | T | X j ∈ T  Θ ( j , k 1 ) − ˆ θ ( j )  s j − β p | S 2 | 1 √ | T | X j ∈ T  Θ ( j , k 2 ) − ˆ θ ( j )  s j . Observe that α p | S 1 | = √ | S 2 | √ | S 1 | + | S 2 | p | S 1 | = s | S 1 || S 2 | | S 1 | + | S 2 | = : γ , β p | S 2 | = √ | S 1 | √ | S 1 | + | S 2 | p | S 2 | = γ . Therefo re, we hav e u ⊤ M v = γ √ | T | X j ∈ T h  Θ ( j , k 1 ) − ˆ θ ( j )  −  Θ ( j , k 2 ) − ˆ θ ( j )  i s j = γ √ | T | X j ∈ T  Θ ( j , k 1 ) − Θ ( j , k 2 )  s j . By the deﬁnition o f s j , we h av e ( Θ ( j , k 1 ) − Θ ( j , k 2 )) s j = | Θ ( j , k 1 ) − Θ ( j , k 2 ) | ≥ 0, which giv es u ⊤ M v = γ √ | T | X j ∈ T | Θ ( j , k 1 ) − Θ ( j , k 2 ) | . Step 8. Lower bound using the separatio n condition. For every j ∈ T , Assumption 3 gu arantees | Θ ( j , k 1 ) − Θ ( j , k 2 ) | ≥ ζ / 2. Thus, we have u ⊤ M v ≥ γ √ | T | · | T | · ζ 2 = γ p | T | ζ 2 . Step 9. T ra nsfer to the spectr a l norm. Since k u k 2 = k v k 2 = 1, th e variational character ization of the spectral norm giv es k M k ≥ | u ⊤ M v | ≥ γ p | T | ζ 2 . Step 1 0. Quantitative lower bound for γ . From th e size estimates in Step 3, we hav e | S 1 | , | S 2 | ≥ c 0 N K K 0 , | S 1 | + | S 2 | ≤ 2 N c 0 K . 40 Consequently , we get γ = s | S 1 || S 2 | | S 1 | + | S 2 | ≥ s  c 0 N / ( K K 0 )  2 2 N / ( c 0 K ) = s c 3 0 N 2 K K 2 0 . Step 1 1. Combining the lower bo unds. By | T | ≥ c 1 J obtain e d fro m Step 4 and the estimate fo r γ , we get k M k ≥ ζ 2 s c 3 0 N 2 K K 2 0 p c 1 J = ζ 2 s c 3 0 c 1 N J 2 K K 2 0 = q c 3 0 c 1 2 √ 2 ζ √ N J √ K K 0 . Deﬁne the co nstant c : = q c 3 0 c 1 2 √ 2 . Then, we hav e established the deterministic inequality k ( R − ˆ R ) S , T k ≥ c ζ √ N J √ K K 0 . Step 1 2. Probability sta t ement. The construction of S , T and the su bsequent algeb raic estimates depend o nly on the ﬁxed ˆ Z and the mod e l parameters. They hold for every possible realization o f ˆ Z whe n ev er K 0 < K . Hence, regard less of the d istribution of ˆ Z , P       k ( R − ˆ R ) S , T k ≥ c ζ √ N J √ K K 0       = 1 . In p articular, the co nclusion hold s with pro b ability tending to 1 (indeed with prob ability 1) for any classiﬁer M . Append ix B.3. Pr o of of Lemma 6 Pr oof. W e prove this lemma by six steps. Step 1. Basic estimat e s. By the co nstruction of X , we have E [ X ( i , j )] = 0 for all i ∈ [ N ] , j ∈ [ J ], an d the en tr ies { X ( i , j ) } are mutually independ ent. Recall that R ( i , j ) ∈ { 0 , 1 , . . . , M } and R ( i , j ) ∈ [0 , M ], we have | X ( i , j ) | ≤ M . Recall that V ar( R ( i , j )) = R ( i , j )  1 − R ( i , j ) M  ≤ ma x x ∈ [0 , M ] x  1 − x M  = M 4 , and b ecause E [ X ( i , j )] = 0 , we h av e E [ X ( i , j ) 2 ] = V ar( R ( i , j )) ≤ M / 4. Step 2. Exact quantities required in Lemma 3 . Deﬁne σ ∗ 1 : = m ax i ∈ [ N ] v u t J X j = 1 E [ X ( i , j ) 2 ] , σ ∗ 2 : = m ax j ∈ [ J ] v u t N X i = 1 E [ X ( i , j ) 2 ] , σ ∗ ∗ : = m ax i , j k X ( i , j ) k ∞ . From the b ound s obtained in Step 1 , we o btain d eterministic up per b ounds for these qu antities: σ ∗ 1 ≤ r J · M 4 = √ M J 2 , σ ∗ 2 ≤ r N · M 4 = √ M N 2 , σ ∗ ∗ ≤ M . W ithou t co nfusion, we introd uce the symb o ls ˜ σ 1 : = √ M J 2 , ˜ σ 2 : = √ M N 2 , ˜ σ ∗ : = M , which ar e upper bounds fo r the true quantities: σ ∗ 1 ≤ ˜ σ 1 , σ ∗ 2 ≤ ˜ σ 2 , σ ∗ ∗ ≤ ˜ σ ∗ . 41 Step 3. Applying Lemma 3 wit h an inﬂa t ed threshold. Fix η = 1 2 , where this value satisﬁes the cond ition 0 < η ≤ 1 2 of Lem ma 3 . Let C η be the constant whose existence is g uaranteed by the lem m a (it depe n ds on ly on η ). For any t ≥ 0, applyin g Lemm a 3 to the matrix X yields P  k X k ≥ (1 + η ) ( σ ∗ 1 + σ ∗ 2 ) + t  ≤ ( N + J ) exp  − t 2 C η ( σ ∗ ∗ ) 2  . Because σ ∗ 1 ≤ ˜ σ 1 and σ ∗ 2 ≤ ˜ σ 2 , we h av e (1 + η )( ˜ σ 1 + ˜ σ 2 ) + t ≥ (1 + η )( σ ∗ 1 + σ ∗ 2 ) + t . Hence, th e event {k X k ≥ (1 + η )( ˜ σ 1 + ˜ σ 2 ) + t } is con tain ed in the ev e n t {k X k ≥ (1 + η ) ( σ ∗ 1 + σ ∗ 2 ) + t } . Thus, we have P  k X k ≥ (1 + η ) ( ˜ σ 1 + ˜ σ 2 ) + t  ≤ P  k X k ≥ (1 + η )( σ ∗ 1 + σ ∗ 2 ) + t  . Moreover , σ ∗ ∗ ≤ ˜ σ ∗ implies ( σ ∗ ∗ ) 2 ≤ ˜ σ 2 ∗ and th u s exp  − t 2 C η ( σ ∗ ∗ ) 2  ≤ exp  − t 2 C η ˜ σ 2 ∗  . Combining th e above ineq ualities, we obtain a co n venient up per boun d expre ssed en tirely in terms o f the deter - ministic c onstants ˜ σ 1 , ˜ σ 2 , ˜ σ ∗ : P  k X k ≥ (1 + η )( ˜ σ 1 + ˜ σ 2 ) + t  ≤ ( N + J ) exp  − t 2 C η ˜ σ 2 ∗  . Step 4. Choice of t and an explicit tail estimate. T aking t = ˜ σ 1 + ˜ σ 2 giv es P  k X k ≥ (2 + η ) ( ˜ σ 1 + ˜ σ 2 )  ≤ ( N + J ) exp  − ( ˜ σ 1 + ˜ σ 2 ) 2 C η ˜ σ 2 ∗  . (B.4) Since ˜ σ 1 + ˜ σ 2 = √ M 2 ( √ N + √ J ) a nd ˜ σ ∗ = M , we have ( ˜ σ 1 + ˜ σ 2 ) 2 = M 4  √ N + √ J  2 ≥ M 4 ( N + J ) . Consequently , we get ( ˜ σ 1 + ˜ σ 2 ) 2 C η ˜ σ 2 ∗ ≥ M 4 · N + J C η M 2 = N + J 4 C η M . Deﬁne the constan t C opt : = (2 + η ) √ M 2 . For o ur ch oice η = 1 2 , C opt = 2 . 5 √ M 2 = 1 . 2 5 √ M . Using Equation ( B.4 ) together with th e mon otonicity of the exponen tial f unction, we obtain P  k X k ≥ C opt  √ N + √ J   ≤ ( N + J ) exp  − N + J 4 C η M  . Step 5. Asymptotic O P bound. ( N + J ) exp  − N + J 4 C η M  tends to 0 as N → ∞ (it decays expo nentially in N + J ). Hence, for ev e r y ε > 0 there exists an integer N 0 (depen d ing on ε and on the ﬁxed co nstants M , η, C η ) su c h that for all N ≥ N 0 and all J (which ma y dep e n d on N or b e ﬁxed) , we have P  k X k ≥ C opt  √ N + √ J   < ε . By the deﬁnition o f the O P ( · ) notation, this is exactly k X k = O P  √ N + √ J  , which estab lishe s Equation ( B.2 ). Step 6. Ext e nsion to arbitrary submatrices. Let S ⊆ [ N ] and T ⊆ [ J ] b e any sub sets, and set W : = ( R − R ) S , T . For any matrix , the sp ectral no r m of a submatrix never exceeds that of th e full matrix, i.e., k W k ≤ k X k . Combining this with E q uation ( B.2 ) yields k W k = O P  √ N + √ J  , which co mplets the p r oof. 42 Append ix B.4. Pr o of of Lemma 7 Pr oof. From th e proof of Theor em 2 (in particular the analysis leading to the lower bound for σ 1 ( ˜ R )), there exists an ev ent E ( K 0 ) N with P ( E ( K 0 ) N ) → 1 such that on E ( K 0 ) N , σ 1 ( ˜ R ) ≥ C signal √ J √ K K 0 − C noise . Hen ce, on the same event, we hav e T K 0 = σ 1 ( ˜ R ) −  1 + r J N  ≥ C signal √ J √ K K 0 − C noise − 1 − r J N . (B.5) Deﬁne D N , K 0 : = √ J √ K K 0 . T h en, Assum p tion 6 is exactly C signal D N , K 0 ≥ C noise + 1 + 3 η 0 . Consequently , we get D N , K 0 ≥ d 0 : = C noise + 1 + 3 η 0 C signal . (B.6) Now we use the ad ditional co ndition K 3 = o ( N ). Since K 0 ≤ K − 1 < K , we have √ K K 0 √ N ≤ K 3 / 2 √ N = K 3 N ! 1 / 2 − → 0 . Thus, there exists an integer N 1 (depen d ing only o n th e co nstants η 0 and d 0 ) such that f or all N ≥ N 1 , K 3 / 2 √ N ≤ η 0 d 0 . Consequently , f o r every K 0 < K and every N ≥ N 1 , we h av e √ K K 0 √ N ≤ η 0 d 0 . (B.7) Observe that √ J / N can be expressed as q J N = √ K K 0 √ N D N , K 0 . Combining th is with Equa tio n ( B.7 ) yields, for all N ≥ N 1 and on E ( K 0 ) N , r J N ≤ η 0 d 0 D N , K 0 . (B.8) Substituting Equation ( B.8 ) into Equ ation ( B.5 ) gives T K 0 ≥ C signal D N , K 0 − C noise − 1 − η 0 d 0 D N , K 0 = C signal − η 0 d 0 ! D N , K 0 − ( C noise + 1) . (B.9) W e now show tha t the r ig ht-hand side of Equ ation ( B.9 ) is at least c low D N , K 0 . Th is is e q uiv alen t to C signal − η 0 d 0 − c low ! D N , K 0 ≥ C noise + 1 . Because D N , K 0 ≥ d 0 by Equation ( B.6 ), it su ﬃ ces to prove C signal − η 0 d 0 − c low ! d 0 ≥ C noise + 1 . (B.10) Computing the left-hand side of Equ ation ( B.10 ) g iv e s C signal − η 0 d 0 − c low ! d 0 = C signal d 0 − η 0 − c low d 0 = ( C signal − c low ) d 0 − η 0 . Recall the d eﬁnition o f c low : c low = 2 η 0 C signal C noise + 1 + 3 η 0 . Using the expr ession for d 0 and c low obtains ( C signal − c low ) d 0 = C signal − 2 η 0 C signal C noise + 1 + 3 η 0 ! C noise + 1 + 3 η 0 C signal = C noise + 1 + 3 η 0 − 2 η 0 = C noise + 1 + η 0 . 43 Therefo re, we hav e ( C signal − c low ) d 0 − η 0 = ( C noise + 1 + η 0 ) − η 0 = C noise + 1 , which m e ans th at Eq uation ( B.10 ) h olds with equ ality . Consequ ently , for all N ≥ N 1 and o n E ( K 0 ) N , we h av e T K 0 ≥ c low D N , K 0 . Finally , because P ( E ( K 0 ) N ) → 1 , we have P  T K 0 ≥ c low √ J √ K K 0  ≥ P ( E ( K 0 ) N ) − → 1 , which estab lishe s Equation ( B.3 ). Append ix B.5. Pr o of of Lemma 8 Pr oof. Fix an ar bitrary can didate n umber K 0 ≥ 1. Let ˆ Z ∈ { 0 , 1 } N × K 0 be the estimated classiﬁcation matrix return ed by the classiﬁer M in Algo rithm 1 . For each estimated class k ∈ [ K 0 ], deﬁne its index set ˆ C k = { i ∈ [ N ] : ˆ Z ( i , k ) = 1 } and its size n k = | ˆ C k | . By con struction of the classiﬁer, we ma y assume n k > 0 fo r all k . Oth erwise the estimation of ˆ Θ would be undeﬁn e d. The ﬁtted expected respon se matrix is ˆ R = ˆ Z ˆ Θ ⊤ , where ˆ Θ ∈ [0 , M ] J × K 0 . Consequ ently , for any item j ∈ [ J ] and any estimated class k , th e ﬁtted value ˆ R ( i , j ) is constant over all in d ividuals in ˆ C k . Denote this common value b y ˆ Θ jk ∈ [0 , M ]. For each pair ( j , k ), the quan tity ˆ V jk = ˆ Θ jk  1 − ˆ Θ jk M  is alw ays n on-negativ e. If ˆ V jk > 0 , the cor respond ing e n tries of the n ormalized re sid u al m atrix ˜ R (see E quation ( 2 )) ar e well- d eﬁned a n d, fo r i ∈ ˆ C k , ˜ R ( i , j ) = R ( i , j ) − ˆ Θ jk q N ˆ V jk . If ˆ V jk = 0, then nece ssarily ˆ Θ jk ∈ { 0 , M } and every R ( i , j ) for i ∈ ˆ C k must equal ˆ Θ jk (otherwise the a verage could not be exactly 0 or M ) . In this case the numera to r P i ∈ ˆ C k ( R ( i , j ) − ˆ Θ jk ) 2 is zero, and the deﬁnition of Equation ( 2 ) sets ˜ R ( i , j ) = 0 for all i ∈ ˆ C k . Henc e , regardless of the value of ˆ V jk , the contribution of the block ( ˆ C k , j ) to th e square d Frobeniu s n orm o f ˜ R can be b ound e d uniformly . W e now der i ve an upper bound fo r k ˜ R k 2 F . For a ﬁxed block ( j , k ) and assuming ﬁrst that ˆ V jk > 0 , we hav e X i ∈ ˆ C k ˜ R ( i , j ) 2 = 1 N ˆ V jk X i ∈ ˆ C k  R ( i , j ) − ˆ Θ jk  2 . Because each R ( i , j ) takes values in { 0 , 1 , . . . , M } , we h av e the elemen ta r y in equality R ( i , j ) 2 ≤ M R ( i , j ). Su mming over i ∈ ˆ C k yields X i ∈ ˆ C k R ( i , j ) 2 ≤ M X i ∈ ˆ C k R ( i , j ) = M n k ˆ Θ jk . Hence, we g et X i ∈ ˆ C k  R ( i , j ) − ˆ Θ jk  2 = X i ∈ ˆ C k R ( i , j ) 2 − n k ˆ Θ 2 jk ≤ M n k ˆ Θ jk − n k ˆ Θ 2 jk = n k ˆ Θ jk ( M − ˆ Θ jk ) . Now observe tha t ˆ Θ jk ( M − ˆ Θ jk ) = M ˆ V jk . Th erefor e , we get X i ∈ ˆ C k  R ( i , j ) − ˆ Θ jk  2 ≤ n k M ˆ V jk . 44 Substituting this in to the expre ssion for the block contribution gives X i ∈ ˆ C k ˜ R ( i , j ) 2 ≤ 1 N ˆ V jk · n k M ˆ V jk = n k M N . If ˆ V jk = 0, then by c onstruction ˜ R ( i , j ) = 0 for a ll i ∈ ˆ C k , so the lef t-hand side is zero, a n d the inequality P i ∈ ˆ C k ˜ R ( i , j ) 2 ≤ n k M N remains valid b ecause the r ight-hand side is non -negati ve. Thus, for e very block ( j , k ), we have X i ∈ ˆ C k ˜ R ( i , j ) 2 ≤ n k M N . Now summing over all items j = 1 , . . . , J and all estimated classes k = 1 , . . . , K 0 giv es k ˜ R k 2 F = K 0 X k = 1 J X j = 1 X i ∈ ˆ C k ˜ R ( i , j ) 2 ≤ K 0 X k = 1 J X j = 1 n k M N = M N  K 0 X k = 1 n k  J . Since the estimated classes par tition the set of individuals, we have P K 0 k = 1 n k = N . Consequently , we have k ˜ R k 2 F ≤ M N · N · J = M J , and th e refore k ˜ R k F ≤ √ M J . For any m atrix, the spectral n orm d oes n ot exceed the Frobeniu s norm, i.e . , σ 1 ( ˜ R ) ≤ k ˜ R k F . Hen ce, we have σ 1 ( ˜ R ) ≤ √ M J . Recall from Equa tio n ( 4 ) th at the test statistic is deﬁned as T K 0 = σ 1 ( ˜ R ) −  1 + √ J / N  . Drop ping the negati ve ter m yields the d e sired bo u nd T K 0 ≤ √ M J . This ineq uality holds for every possible realization of the data and o f the e stima tio n proced ure, regard less of whether the candidate m odel is correctly spe ciﬁed or not. In particular, it holds with probability on e for all samp le sizes N . Append ix B.6. Pr o of of Lemma 9 Pr oof. Recall from Equatio n ( 1 ) th at R ∗ ( i , j ) = R ( i , j ) −R ( i , j ) √ N V ( i , j ) . Assumptio n 1 guaran tees V ( i , j ) ≥ M δ (1 − δ ) > 0, h e nce each en tr y is well deﬁned. Let Y = √ N R ∗ . T hus, Y ’ s ( i , j )-th entry is Y i j = √ N R ∗ ( i , j ). Th en the entries { Y i j } are indepen d ent (be c ause the R ( i , j ) are indepen d ent giv en th e latent structu re, and the latent structu re is ﬁxed), an d satisfy E [ Y i j ] = 0 , V a r( Y i j ) = 1 , | Y i j | ≤ C , where C : = r M δ (1 − δ ) < ∞ . W e ﬁrst establish an a lm ost sure lower bo und. Consider th e ﬁrst co lu mn Y ( · , 1) = ( Y 11 , . . . , Y N 1 ) ⊤ . The rando m variables { Y 2 i 1 } N i = 1 are independent, bo unded by C 2 , and satisfy E [ Y 2 i 1 ] = 1 (i.e., Y i 1 is standardized). For any ε > 0, Hoe ﬀ ding ’ s ine quality y ields P     1 N N X i = 1 Y 2 i 1 − 1    > ε  ≤ 2 exp  − 2 N ε 2 C 4  . (B.11) 45 The right-hand side of Equa tio n ( B.1 1 ) is summable over N because P ∞ N = 1 e − cN < ∞ for any c > 0. By th e Borel–Cantelli lem m a, for every ﬁxed ε > 0, we have P     1 N N X i = 1 Y 2 i 1 − 1    > ε in ﬁnitely of ten  = 0 . T akin g a countab le sequ ence ε m = 1 / m ( m = 1 , 2 , . . . ) and in tersecting the correspo nding almo st sure events, we obtain a lmost su r ely lim N →∞ 1 N N X i = 1 Y 2 i 1 = 1 . Hence, we g et k Y ( · , 1) k F √ N =  1 N N X i = 1 Y 2 i 1  1 / 2 a . s . − − → 1 . Since k Y k ≥ k Y ( · , 1) k F , we h av e lim inf N →∞ k Y k √ N ≥ 1 alm o st sure ly . (B.12) Next we der i ve an almost su r e u pper bound. For Y , we h av e ˜ ˜ σ 1 : = m ax i ∈ [ N ] v u t J X j = 1 E [ Y i j ] 2 = √ J , ˜ ˜ σ 2 : = m ax j ∈ [ J ] v u t N X i = 1 E [ Y i j ] 2 = √ N , ˜ ˜ σ ∗ : = max i , j k Y i j k ∞ ≤ C . Fix any η ∈ (0 , 1 / 2]. Lemma 3 guarantees th e existence of a constant C η > 0 such that for every t ≥ 0, P  k Y k ≥ (1 + η )( ˜ ˜ σ 1 + ˜ ˜ σ 2 ) + t  ≤ ( N + J ) exp  − t 2 C η ˜ ˜ σ 2 ∗  . (B.13) Now ﬁx an arbitrary ˜ δ > 0. Choose η = ˜ δ/ 2 for 0 < ˜ δ ≤ 1. Because J = o ( N ), there exists N 0 such that for all N ≥ N 0 , √ J ≤ ˜ δ 4(1 + ˜ δ/ 2) √ N . Consequently , we get (1 + η ) √ J = (1 + ˜ δ/ 2) √ J ≤ ˜ δ 4 √ N . T ake t = ˜ δ 4 √ N in Eq uation ( B.13 ). Then f or all N ≥ N 0 , we h av e (1 + η )( √ N + √ J ) + t ≤ (1 + ˜ δ/ 2) √ N + ˜ δ 4 √ N + ˜ δ 4 √ N = (1 + ˜ δ ) √ N . Hence, the event {k Y k ≥ (1 + ˜ δ ) √ N } is contained in the ev ent {k Y k ≥ ( 1 + η )( √ N + √ J ) + t } , an d by Equation ( B.13 ), we have P  k Y k ≥ (1 + ˜ δ ) √ N  ≤ ( N + J ) exp  − t 2 C η C 2  = ( N + J ) exp  − ˜ δ 2 N 16 C η C 2  . (B.14) The right-hand side o f E quation ( B.14 ) is summ able ov er N b ecause it deca y s expo n entially in N . By the Borel–Cantelli lem m a, the event {k Y k ≥ (1 + ˜ δ ) √ N } occu rs o nly ﬁnitely many times almost surely . T herefor e, we ge t lim sup N →∞ k Y k √ N ≤ 1 + ˜ δ almost sur ely . 46 Since ˜ δ > 0 was ar bitrary , we ca n take a coun table sequence ˜ δ m ↓ 0 (e.g. , ˜ δ m = 1 / m ) an d intersect the correspo n d- ing almo st sure events to obtain lim sup N →∞ k Y k √ N ≤ 1 almost surely . (B.15) Combining Equation s ( B.12 ) and ( B.1 5 ) yie ld s k Y k √ N a . s . − − → 1 . Because R ∗ = Y / √ N , we have k R ∗ k = k Y k / √ N a . s . − − → 1 . Alm ost su r e c on vergence im plies conv e rgen ce in pro ba- bility , so for any ε > 0, lim N →∞ P  k R ∗ k ≥ 1 − ε  = 1 , and in p articular σ 1 ( R ∗ ) = k R ∗ k = 1 + o P (1). T his completes th e p roof. Ap pendix C. SC-LCM algorit hm and its consistency Here, we introdu ce SC-LCM, a simple spectral clu stering algorithm fo r estimating the laten t class m embership matrix Z under th e latent class m odel for o r dinal categorical data with polytomo us responses. The alg o rithm takes the top K left singular vectors of R and ap p lies k - means clustering to its rows. Under mild cond itio ns, we pr ove that the proced u re con sistently recovers the tru e latent classes even as N , J , and K a ll diverge (i.e., the large-s cale setting) . Our analysis begins with an or acle case where the pop ulation parameter s are k nown. Let R = E [ R ] = Z Θ ⊤ be the N × J expected respon se matrix. T o simplify our theor etical analysis, we let the rank of Θ b e K . Thu s, R h as a low-rank structure with ran k K . Sin c e R h as ra nk K , consider its com pact singu lar v alue dec o mposition R = U Σ V ⊤ , where U ∈ R N × K satisﬁes U ⊤ U = I K , V ∈ R J × K satisﬁes V ⊤ V = I K , and Σ = diag( σ 1 , . . . , σ K ) with σ 1 ≥ · · · ≥ σ K > 0. The f ollowing lemma shows th at U has exactly K distinct rows a n d tha t these rows are perfectly alig ned with th e latent class me m berships. Lemma 10 . Under the laten t class model, the left sing ular vectors U of R satisfy that a ll r o ws belonging to the same latent class are iden tical, an d for any two distinct cla sses k , l, k U ( i , :) − U ( j , :) k F = r 1 N k + 1 N l , i ∈ C k , j ∈ C l . By Lemm a 10 , app ly ing k -mean s with K clusters to the r ows of U recovers the classiﬁcation matrix Z exactly up to a p ermutation o f labels. In practice we only ob serve R , not R . L et R = ˆ U ˆ Σ ˆ V ⊤ + ˆ U ⊥ ˆ Σ ⊥ ˆ V ⊤ ⊥ be the full SVD of R , where ˆ U ∈ R N × K contains the left singular vectors co r respond ing to the K largest singular values. T he practical SC-LCM algor ithm is summar ized in Algorith m 3 . Algorithm 3 Sp ectral Clustering for Laten t Class Models (SC-LCM) Require: Observed response m atrix R ∈ { 0 , 1 , . . . , M } N × J , n umber of latent classes K Ensure: Estimated classiﬁcation matrix ˆ Z 1: Compu te the to p K left sin g ular vector s ˆ U ∈ R N × K of R . 2: Apply k -mean s alg o rithm to the rows of ˆ U with K clusters to obtain ˆ Z . 47 The intuition beh in d SC-LCM is that ˆ U is a perturbed version of U ( up to an o rthogo nal ro tation), an d by Lemma 10 the rows of U are perfe c tly separable. Hence, under a su ﬃ ciently small p erturbatio n, k -means on ˆ U will recover th e tru e cla sses with high accuracy . T o establish the co n sistency of SC-LCM, we introduce a para m eter that governs both the signal strength and the sparsity of th e d ata. D e ﬁne ρ : = max j ∈ [ J ] , k ∈ [ K ] Θ ( j , k ) , which is the maximum expected response across a ll items and latent classes. Th is quantity directly a ﬀ ects the scale of the entries of th e observed matrix R : a larger ρ lead s to larger expected responses and thus a denser o bservation matr ix, whereas a smaller ρ pushes the e xpected respo nses toward zero , m a king the data sparser . In our asympto tic regime we a llow ρ → 0 , wh ich correspo nds to a spar se setting wh ere mo st en tries o f R are zero. The fo llowing assum ption controls the speed of this decay . Assumption 7 (Sp arsity scalin g) . ρ max( N , J ) ≥ M 2 log( N + J ) . W e also deﬁne th e scaled item parameter matrix Θ 0 : = Θ /ρ . By construction, we have every entry of Θ 0 lies in [0 , 1]. Let σ K ( Θ 0 ) d enote the K -th largest sing ular value of Θ 0 . T o measure the di ﬀ erence between Z and ˆ Z , we consider the Clustering err or used in [ 15 ]. This metr ic is deﬁne d as err( ˆ Z , Z ) = min π ∈ S K max k ∈ [ K ] |C k T ˆ C c π ( k ) | + | ˆ C π ( k ) T C c k | N k , where ˆ C k = { i : ˆ Z ( i , k ) = 1 } an d S K is the set o f permutations of { 1 , . . . , K } . The following theorem guaran tees SC-LCM’ s estimation c o nsistency under the LCM m o del. Theorem 6 ( Consistency of SC-LCM) . Un der Assumption 7 , the estimato r ˆ Z pr oduced b y Algorithm 3 satisﬁes, with pr oba bility tendin g to 1 as N → ∞ , err( ˆ Z , Z ) = O ( K 2 N max max( N , J ) log( N + J ) N 2 min ρσ 2 K ( Θ 0 ) ) . When N min ≍ N / K , N max ≍ N / K , and J = o ( N ), the bound in Theorem 6 reduces to err( ˆ Z , Z ) = O P  K 3 log N ρσ 2 K ( Θ 0 )  . When ρσ 2 K ( Θ 0 ) ≫ K 3 log N , we have err( ˆ Z , Z ) P − → 0 , wh ich shows SC-L CM’ s estimation con sistency . Append ix C.1. P r oof of Lemma 10 Pr oof. From the compact sing ular value de composition R = U Σ V ⊤ and the factorization R = Z Θ ⊤ , we obtain U = R V Σ − 1 = Z ( Θ ⊤ V Σ − 1 ) . Setting X U = Θ ⊤ V Σ − 1 ∈ R K × K , we have U = Z X U and U ⊤ U = I K . This structure—a membersh ip matrix Z post-multiplied by a squ a re m a trix X U with ortho normal colu m ns—is exactly the on e con sidered in Lemm a 2.1 of [ 18 ] for the eigenvectors of th e stochastic block mo del’ s mean ma tr ix. App lying that lem ma directly yields the d e sired pro perties of the rows of U and the distan c e fo rmula. Append ix C.2. P r oof of The o r em 6 Pr oof. Set W : = R − R . For each pair ( i , j ) deﬁne W i j = W ( i , j ) e i ˜ e ⊤ j where { e i } and { ˜ e j } are the standard basis vectors in R N and R J . Th e m atrices { W i j } are in depend ent, cen tred, a n d satisfy k W i j k ≤ M . Mo reover , E [ W ( i , j ) 2 ] = V ar( R ( i , j )) = R ( i , j )  1 − R ( i , j ) M  ≤ R ( i , j ) ≤ ρ. Hence, we h av e     X i , j E [ W i j W ⊤ i j ]     ≤ ρ J ,     X i , j E [ W ⊤ i j W i j ]     ≤ ρ N . 48 Under Assumption 7 , app lying th e m a trix Bernstein in equality [ 25 ] with a su ﬃ ciently large constan t C 3 yields P  k W k ≥ C 3 p ρ max( N , J ) log( N + J )  ≤ ( N + J ) − 2 − − − − → N →∞ 0 , which g i ves k W k ≤ C 3 p ρ max( N , J ) log( N + J ) . (C.1) From R = Z Θ ⊤ and Θ = ρ Θ 0 , we h av e σ K ( R ) ≥ p N min ρ σ K ( Θ 0 ) . (C.2) The Davis–Kahan sin Θ theorem [ 31 ] g uarantees the existence of an orthogo nal m a trix O ∈ R K × K such that k ˆ U O − U k F ≤ 2 √ 2 K k R − Rk σ K ( R ) . Inserting Equations ( C.1 ) and ( C.2 ) gives k ˆ U O − U k F ≤ C 4 s K max( N , J ) log( N + J ) N min ρ σ 2 K ( Θ 0 ) , (C.3) where C 4 : = 2 √ 2 C 3 . Lemma 10 shows that the ro ws o f U are co nstant within eac h tr ue latent class: for i ∈ C k , U ( i , :) = u k for some vector u k ∈ R K . Mo reover , for any two distinct classes k , l , Lemma 10 gives k u k − u l k F = q 1 N k + 1 N l . Deﬁne d kl : = q 1 N k + 1 N l , ∆ kl : = 1 √ N k + 1 √ N l . A simple inequality gives ∆ kl ≤ √ 2 d kl for all k , l . Set ς : = q 2 K N max N min k ˆ U O − U k F . For any d istinct classes k , l , we hav e √ K ς k ˆ U O − U k F ∆ kl = √ K q 2 K N max N min k ˆ U O − U k F k ˆ U O − U k F ∆ kl = √ N min √ 2 N max ∆ kl ≤ √ N min √ 2 N max √ 2 d kl = √ N min √ N max d kl ≤ d kl . Hence, by Lem ma 2 o f [ 15 ], we ob tain err( ˆ Z , Z ) = O ( ς 2 ) . Substituting the deﬁnition of ς and Equation ( C.3 ) giv e s err( ˆ Z , Z ) = O ( K N max N min · K max( N , J ) log( N + J ) N min ρσ 2 K ( Θ 0 ) ) = O ( K 2 N max max( N , J ) log( N + J ) N 2 min ρσ 2 K ( Θ 0 ) ) . References [1] Agresti, A. , 2012. Cat egori cal data analysis. volume 792. J ohn W ile y & Sons. [2] Akaike, H., 1998. Information theory and an exte nsion of the m aximum likel ihood principl e, in: Selecte d papers of hirotugu akaike. Springer , pp. 199–213. [3] Bandeira, A.S. , van Handel, R. , 2016. Sharp nonasympto tic bounds on the norm of random m atrice s with independe nt entries. Annals of Probabil ity 44, 2479 – 2506. [4] Bicke l, P .J., Sarkar , P ., 2016. Hypothesis testing for automated community detec tion in network s. Journal of the Royal Statist ical Society Series B: Statistica l Methodology 78, 253–273. [5] Biernacki , C., Celeux , G., Gov aert, G., 2003. Choosing starting v alues for the em algorithm for getting the highest likel ihood in multi v ariat e gaussian mixture models. Computationa l Statistics & Data Analysis 41, 561–575. [6] Chen, Y ., Li, X., Liu, J., Y ing, Z., 2017. Regulari zed latent class analysis with applicat ion in cogniti ve diagnosis. Psychometrika 82, 660–692. [7] Collins, L.M., Lanza, S.T . , 2013. Latent class and latent transiti on analysis: Wit h appli catio ns in the social, behavi oral, and health s cienc es. John Wi ley & Sons. 49 [8] Dempster , A.P ., Laird, N.M., Rubin, D.B., 1977. Maximum like lihood from incomplete data via the EM algorit hm. Journal of the Royal Statist ical Societ y: S eries B (Methodologi cal) 39, 1–22. [9] Goodman, L. A., 1974. Explo ratory latent s tructur e analysis using both identi ﬁable and unidentiﬁa ble models. Biometrika 61, 215–231. [10] Gu, Y ., Dunson, D.B., 2023. Bayesia n p yramids: Identiﬁable multilayer discrete latent structure m odels for discrete data. Journal of the Roya l Statistical Society Series B: Stati stical Methodolo gy 85, 399–426. [11] Gu, Y ., Xu, G., 2020. Par tial identiﬁabi lity of restricted latent class m odels. Annals of Statistic s 48, 2082–2107. [12] Hagenaars, J.A., McCutcheon, A.L., 2002. Applied late nt class analysis. Cambridge Univ ersity Press. [13] Hu, J., Zhang, J ., Qin, H. , Y an, T ., Zhu, J., 2021. Using maximum entry-wise devi ation to test the goodness of ﬁt for stochastic block models. Journal of the American Statist ical Association 116, 1373–1382. [14] Jin, J., Ke, Z.T ., Luo, S., W ang, M., 2023. Optimal estimation of the number of netw ork communities. Journal of the American Stati stical Associati on 118, 2101–2116. [15] Joseph, A., Y u, B., 2016. Impact of regula rizat ion on spectra l clustering. Annals of Statisti cs 44, 1765–1791. [16] Keri bin, C., 2000. Consiste nt estimat ion of the order of mixture models. Sankhy ¯ a: The Indian J ournal of Statistics, Series A , 49–66. [17] Lei, J. , 2016. A goodness-of-ﬁt test for stochastic block m odels. Annals of Statist ics 44, 401–424. [18] Lei, J. , Rinaldo, A., 2015. Consistenc y of spectral clustering in stochastic block models. Annals of Statistics 43, 215 – 237. [19] L yu, Z., Chen, L ., Gu, Y ., 2025. Degre e-hete rogeneo us laten t class analysis for high-dimensional discrete data. Journal of the American Statist ical Associati on 120, 2435–2448. [20] L yu, Z. , Gu, Y ., 2025. S pectral cluster ing with likel ihood reﬁnement is optimal for latent class recov ery . arXi v preprint arXiv :2506.07167 . [21] Qing, H., 2024a. Finding mixed m emberships in cate gorica l data . Informat ion Sciences 676, 120785. [22] Qing, H., 2024b . Grade of membership analysis for multi-la yer ordinal cate gorical data. Statistica S inica 38. [23] Qing, H., 2025. Mixe d membership estimati on for cate gorica l data with weighted responses. TEST 34, 612–659. [24] Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics , 461–464. [25] Tropp, J.A., 2012. User -friendl y tail bounds for sums of random matrices. F oundati ons of Computatio nal Mathemati cs 12, 389–434. [26] V on Davi er, M., 2008. A general diagnostic model applied to language testing data. British Journal of Mathemat ical and Statistic al Psychology 61, 287–307. [27] W ang, M., Hanges, P .J., 2011. Latent class procedures: Application s to organi zatio nal research. Organiza tional Research Methods 14, 24–31. [28] W oodbury , M.A., Cliv e, J. , Garson Jr , A., 1978. Mathematical typolo gy: a grade of m embership techni que for obtain ing disease deﬁniti on. Computers and Biomedical Research 11, 277–298. [29] W u, Q., Hu, J. , 2024. A spectral based goodness-of-ﬁt test for stochasti c block models. Statistics & Probabilit y Letters 209, 110104. [30] Xu, G., Shang, Z ., 2018. Identifyin g latent structures in restricted latent class models. Journal of the America n Statistica l Association 113, 1284–1295. [31] Y u, Y ., W ang, T ., Samwort h, R.J., 2015. A useful variant of the davi s –kahan theorem for statistic ians. Biomet rika 102, 315–323. [32] Zeng, Z. , Gu, Y ., Xu, G., 2023. A T ensor-EM Method for L arge -Scale Latent Class Analysis with Binary Responses. Psychometrika 88, 580–612. [33] Zhang, A.Y ., Zhou, H.Y . , 2024. Leav e-one-out singular subspace perturb ation ana lysis for spectral cl ustering . Annals of Stati stics 52, 2004–2033. 50

Goodness-of-Fit Tests for Latent Class Models with Ordinal Categorical Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment