Necessary and Sufficient Conditions and a Provably Efficient Algorithm for Separable Topic Discovery

1 Necessary and Suf ﬁcient Conditions and a Pro v ably Ef ﬁcient Algorithm for Separable T opic Discov ery W eicong Ding, ∗ Prakash Ish war , IEEE Senior Member an d V enkates h Saligrama, IEEE Senior Member Abstract —W e develop necessa ry and sufﬁcient conditions and a nov el prov ably consistent and efﬁcient algorithm for discov ering topics (latent factors) from observ ations (documents) that are realized from a probabilistic mixture of shared latent factors that ha ve certain properties. Our fo cus is on the class of topic models in which each shared latent factor contains a novel wo rd that is unique to that factor , a p roperty that has come to be known as separabilit y . Our algorithm i s based on the key insight that the nove l words correspond t o the extre me points of the con vex hull formed by the row-vec tors of a suitably normalized word co-occurrence matrix. W e le verage this geometric insight to establish polynomial compu tation and sample complexity bound s based on a few isotropic random projections of the ro ws of the normalized word co-oc currence matrix. Our proposed random- projections-based algorithm is n aturally amenable to an efﬁcient distributed imp lementation an d is attractiv e for modern web - scale d istributed data mining applications. Index T erms —T opic Modeli ng, Sep arability , Random P rojec- tion, Solid Angle, Necessary an d S ufﬁcient Conditi ons. I . I N T R O D U C T I O N T OPIC m odeling r efers to a family of g enerative mo dels and associated algorithm s for discovering the (latent) topical stru cture sha red by a large corpu s o f do cuments. They are important for organizing, search ing, and making sense of a large text c orpus [1]. In this paper we d escribe a novel geometric app roach, with provable statistical and computa- tional efﬁciency guaran tees, for learn ing the latent topics in a do cument collection. Th is work is a culminatio n of a series of recent publication s o n certain stru cture-leveraging metho ds for topic modeling with pr ovable theoretical guaran tees [ 2]– [5]. W e con sider a co rpus of M d ocumen ts, in dexed b y m = 1 , . . . , M , each com posed of words f rom a ﬁxed vocabulary of size W . The distinct words in th e vocab ulary are indexed by w = 1 , . . . , W . Each document m is viewed as an un ordered “bag of words” and is repr esented b y an empir ical W × 1 word-coun ts vector X m , wh ere X w, m is the n umber o f times that word w app ears in docu ment m [1] , [5] –[7]. The entire docume nt corpus is then rep resented by the W × M matr ix X =  X 1 , . . . , X M  . 1 A “to pic” is a W × 1 d istribution over the vocabulary . A top ic mod el posits th e existence o f K < min( W , M ) latent topics that ar e shar ed amo ng all M docum ents in the corpus. The to pics can be collectively represented by th e K columns β 1 , . . . , β K of a W × K column- stochastic “topic matrix” β . E ach docum ent m is conceptu ally mo deled as bein g genera ted indepen dently of a ll 1 When it is c lear from the conte xt, we will use X w,m to represen t either the empirica l word-count or , by suitabl e column-normaliz ation of X , the e mpirical word-fre quenc y . other do cuments throu gh a two-step pro cess: 1) ﬁrst draw a K × 1 docum ent-speciﬁc distribution over topics θ m from a prior distribution Pr( α ) on the pr obability simplex with some hyper-parameter s α ; 2) then draw N iid words acco rding to a W × 1 docu ment-speciﬁc word d istribution over the vocab ulary given by A m = P K k =1 β k θ k,m which is a conve x combinatio n (probab ilistic mixture) o f the latent topics. Our goal is to estimate β f rom the matrix o f empirical observations X . T o appreciate the d ifﬁculty of the prob lem, consider a typical benchma rk dataset such as a news article collectio n from the New Y o rk T imes (NYT) [8] that we use in our experiments. I n this dataset, afte r suitable pr e-proce ssing, W = 14 , 943 , M = 300 , 000 , a nd, on a verage, N = 298 . Thus, N ≪ W ≪ M , X is very sparse, an d M is very large. T y pically , K ≈ 1 00 ≪ min( W, M ) . This estimation p roblem in top ic mod eling has been ex- tensiv ely stud ied. T he prev ailing appr oach is to com pute th e MAP/ML estimate [1]. The tru e posterior of β g i ven X , howe ver , is intractable to compute an d the associated MAP and ML estimation pr oblems are in fact NP-hard in the gen eral case [9], [10]. This necessitates the use of sub-o ptimal metho ds based on app roximatio ns and heuristics such as V ariational- Bayes and MCMC [6], [1 1]–[1 3]. While they p roduce impr es- si ve empiric al results o n many real-world datasets, guarante es of asympto tic consistency or efﬁciency for these ap proache s are either weak or non-existent. T his makes it d ifﬁcult to ev aluate mod el ﬁdelity : failure to pro duce satisfactory results in new datasets could be due to the u se of approx imations and heuristics or due to mode l mis-speciﬁcation which is mor e fundam ental. Furthermo re, these sub- optimal app roaches are computatio nally intensive fo r large text corp ora [5], [7]. T o overcome the hardne ss o f the topic estimatio n prob lem in its full ge nerality , a new ap proach has emerged to lear n the topic mod el b y imposing add itional structure on the m odel parameters [3 ], [5], [7], [ 9], [14 ], [15]. T his p aper fo cuses on a key stru ctural pr operty o f the topic matrix β called topic separa bility [3] , [5 ], [7] , [1 5] wher ein every latent topic contains at least one word that is novel to it, i.e ., the word is unique to th at topic and is absent from the other top ics. Th is is, in essence, a property o f the support of th e laten t topic matrix β . The topic separ ability property can b e motiv ated by th e fact that for many rea l-world datasets, the em pirical topic estimates pr oduced b y p opular V ariatio nal-Bayes an d Gibbs Samp ling app roaches are ap prox imately separable [5], [7]. Moreover , it h as recently been shown th at the separa bility proper ty will be approximately satisﬁed with h igh prob ability when the dimen sion of the vocabulary W scales sufﬁciently faster than th e n umber o f topics K and β is a r ealization 2 of a Dirichlet p rior tha t is typically used in pr actice [16 ]. Therefo re, separability is a natural appro ximation for most high-d imensional topic models. Our approach exploits the following ge ometric implication of the key sep arability structu re. If we associate eac h word in the vocabulary with a r ow-vector o f a suitably norm alized empirical word c o-occur rence matrix, the set of nov el words correspond to t he extreme points of the conv ex hu ll form ed by the row-vectors of all words. W e leverage this geo metric insight and develop a provably consistent and efﬁcient alg o- rithm. Informally spea king, we establish the following r esult: Theorem 1. I f th e topic matrix is separable and th e mixing weights satisfy a min imum informatio n-theoretically necessary technical con dition, th en o ur pr oposed algorithm runs in polynomial time in M , W, N , K , and estimates the topic matrix co nsistently as M → ∞ with N ≥ 2 held ﬁxed. Mor eover , our pr oposed algorithm ca n estimate β to within an ǫ eleme nt-wise err or with a pr obability at least 1 − δ if M ≥ P oly ( W, 1 / N , K , log(1 /δ ) , 1 /ǫ ) . The asymptotic setting M → ∞ with N held ﬁxed is motiv ated by text corpora in wh ich the n umber o f words in a single d ocument is small while the numbe r of docum ents is large. W e n ote th at our algorithm can b e app lied to any family of topic mo dels whose topic mixing weig hts prior Pr( α ) satis- ﬁes a minimu m infor mation-the oretically n ecessary technic al condition . In contrast, th e standard Bayesian a pproac hes such as V ar iational-Bayes or M CMC need to b e hand-d esigned separately for e ach speciﬁc topic mixing weights prio r . The high light of our ap proach is to identify the novel words as extreme points throu gh approp riately deﬁn ed random pro- jections . Sp eciﬁcally , we p roject th e row-vector o f each word in an ap propr iately norm alized word co-o ccurren ce m atrix along a few indep enden t and isotropically distributed random directions. The f raction of times th at a word attains the maxi- mum value along a ran dom direction is a m easure of its degree of robustness as an extreme point. This p rocess of ran dom projection s followed by countin g the number o f times a word is a maximizer can be efﬁciently computed and is robust to the pe rturbation s indu ced by sampling no ise associated with having only a very small number of words per docu ment N . In addition to be ing computationally efﬁcient, it turns out that this random pro jections based appro ach (1) requires the minimum informa tion-theo retically n ecessary technical conditions on the to pic prio r for asymptotic consistency , and (2) can be naturally par allelized and distributed. As a con sequence, it can p rovably achie ve the efﬁciency guar antees of a centralized method while requirin g insign iﬁcant communicatio n between distributed do cument collec tions [5]. This is attracti ve fo r web - scale topic mo deling o f la rge distributed text corpor a. Another ad vance of this paper is the id entiﬁcation of nec- essary a nd sufﬁcient cond itions on the mixin g weights for consistent separ able topic estimation. In previous work we showed that a simplicia l condition on the mixing weigh ts is both necessary and sufﬁcient for consistently detecting all the novel words [4]. In this paper we complete the ch aracterization by showing that an afﬁne independ ence con dition o n the mixing weights is necessary a nd sufﬁcient for consistently estimating a sep arable top ic matrix. These conditio ns are satisﬁed by practical choices of topic pr iors suc h as the Dirichlet distribution [6 ]. All these nece ssary condition s are informa tion-theo retic and a lgorithm- indepen dent, i.e., they are irrespective of the speciﬁc statistics o f th e o bservations o r the algor ithms that are used. Th e provable statistical and computatio nal ef ﬁciency guarantees of our pr oposed algorithm hold tr ue under th ese ne cessary a nd suf ﬁcient conditions. The rest o f this paper is organized a s follows. W e revie w related work on topic modelin g as well as the separability proper ty in various do mains in Sec. II. W e introdu ce the sepa - rability prop erty on β , the simp licial an d afﬁne in depend ence condition s on mixing weigh ts, and the extreme p oint geometry that motiv ates our approac h in Sec. III. W e then discuss how the solid angle can b e used to iden tify robust extreme poin ts to deal with a ﬁnite number of samples (word s p er docum ent) in Sec. IV. W e describ e our overall algo rithm and sketch its ana lysis in Sec. V. W e d emonstrate th e perf ormance of our approach in Sec. VI on various syn thetic and rea l-world examples. Proo fs o f all results appear in the appendice s. I I . R E L AT E D W O R K The idea of modeling text documents as mixture s of a few semantic topics was ﬁrst pro posed in [17 ] wher e the m ixing weights were assum ed to be deter ministic. Latent Dirichlet Allocation (LD A) in the seminal work o f [6] extended this to a pro babilistic setting by mo deling top ic mixing weights using Dirichlet priors. This setting has been furth er extended to inclu de other topic prio rs such a s the log-no rmal prior in the Correlated T opic Model [1 8]. L D A mode ls and their deriv ativ es have been su ccessful on a wide range of prob lems in ter ms of ach ieving good empirical p erforma nce [1 ], [13]. The prev ailing ap proach es for estimation an d inference problem s in topic mode ling a re based on MAP or M L estima- tion [1]. Howev er , the com putation of posterio r distributions condition ed on observations X is intractable [6]. Moreover , th e MAP estimation o bjective is no n-conve x an d has b een shown to b e N P -hard [ 9], [1 0]. Theref ore various approximatio n and heuristic strategies have been employed. These ap proach es fall into two major categories – sampling app roaches and optimization app roaches. Most sampling ap proach es are b ased on Markov Chain Monte Carlo (MCMC) algo rithms th at seek to generate (appro ximately) indep endent samp les from a Markov Chain that is carefully d esigned to ensure that the sample distribution co n verges to th e true posterior [1 1], [19]. Op timization app roaches are typically b ased on the so- called V ariational-Bayes methods. These methods optimize the parameters of a simpler parametric distribution so that it is close to the tru e po sterior in terms of KL d iv ergence [6], [12] . Expectation -Maximizatio n-type algorithms ar e ty pically used in these methods. In practice, while b oth V ariational- Bayes and MCMC algorith ms have similar perf ormance , V ariational- Bayes is typica lly faster than MCMC [1 ], [ 20]. Nonnegative Matrix Factorization (NMF) is an alternative approa ch fo r topic estimation . NMF-based me thods exploit the fact that bo th the topic matrix β and the mix ing weights are non negativ e and attem pt to decomp ose the emp irical 3 observation matrix X into a pr oduct of a n onnegative topic matrix β and the matrix o f mixing w eights by minimizing a cost function o f th e f orm [20]–[2 3] M X m =1 d ( X m , β θ m ) + λψ ( β , θ 1 , . . . , θ M ) , where d ( , ) is some measure of closeness and ψ is a regular iza- tion term wh ich enfo rces desirable pr operties, e.g., sparsity , on β and the mix ing weights. The NMF pr oblem, howe ver, is also known to b e non-c on vex an d N P -hard [24] in gener al. Su b- optimal strategies such a s alternatin g minimiz ation, gr eedy gradient descent, an d h euristics a re used in practice [22]. In co ntrast to the ab ove app roaches, a new app roach has recently emerged w hich is b ased on imposing add itional structure on th e mod el par ameters [3], [ 5], [7] , [9] , [14], [15]. Th ese appr oaches show that the topic discovery pro b- lem lends itself to p rovably consistent and polyno mial-time solutions b y mak ing assumption s abou t th e structur e of the topic matrix β and the distribution of the mixing weights. In this category of app roache s are metho ds based on a tenso r decomp osition of the mom ents o f X [ 14], [2 5]. The algor ithm in [25] uses second order empirical mo ments and is shown to be asymptotically consistent wh en the topic matrix β has a special sparsity structu re. The algorithm in [14] uses the third ord er tensor of observations. It is, however , strongly tied to the speciﬁc structure o f the Dir ichlet pr ior on the mixing weights and req uires knowledge of the concentratio n parameters o f the Dirichlet distribution [14] . Furtherm ore, in practice these approache s are computatio nally intensive an d require some in itial coarse dimensionality redu ction, g radient descent speedu ps, and GPU ac celeration to process large-scale text corp ora like the NY T dataset [ 14]. Our work falls into th e family of ap proache s that exploit the separ ability pro perty of β and its g eometric imp lications [3], [5], [7 ], [9 ], [1 5], [26], [27 ]. An asymptotically consistent polyno mial-time topic estimation algorithm was ﬁr st pr oposed in [9]. Howe ver , this method requires solv ing W linear pro - grams, e ach with W variables a nd is comp utationally imprac- tical. Su bsequen t work imp roved the computational efﬁciency [15], [23], b ut theo retical guarantees of asymptotic consistency (when N ﬁxed, and the n umber o f d ocumen ts M → ∞ ) are un clear . Algorithms in [7] and [3 ] are both practical and provably co nsistent. Each requir es a stron ger an d slightly different technical co ndition on the topic m ixing weigh ts than [9]. Speciﬁcally , [7] im poses a full-rank con dition on th e second-o rder correlation ma trix of the mix ing weights and propo ses a Gr am-Schmid t p rocedu re to id entify the extreme points. Similarly , [3] imposes a diago nal-do minance con dition on th e same second- order correlation matr ix an d prop oses a random projection s based ap proach . These appr oaches are tied to the speciﬁc condition s imp osed and they both fail to detect all the novel words and estimate to pics when the imposed conditio ns (wh ich ar e su fﬁcient but not necessary for consistent nov el word detection or topic estimatio n) f ail to hold in some e xample s [5]. The random projections based algorithm propo sed in [5] is bo th practical and provably consistent. Furthermo re, it req uires fewer con straints on the topic mixing weights. W e note that the separ ability prop erty has been exploited in other recent work as well [26], [27]. In [27], a singular value d ecomposition based ap proach is p roposed for topic estimation. In [ 26], it is shown that the stan dard V ariatio nal- Bayes ap prox imation can b e asy mptotically co nsistent if β is separable. Howe ver , the add itional co nstraints propo sed essentially boil down to the requir ement that each document contain pr edomin antly only on e top ic. In add ition to assuming the existence of such “pu re” docu ments, [26] also requires a strict initialization. It is thus un clear how th is can b e achieved using only the observations X . The separab ility pro perty h as been re-d iscovered a nd ex- ploited in th e literature ac ross a numb er of different ﬁelds and ha s fou nd applicatio n in several proble ms. T o the best of our knowledge, this concept was ﬁrst introduced as the Pur e Pixel I ndex assumption in the Hy perspectral Ima ge u nmixin g problem [28]. This work assumes th e existence o f pixels in a hype r-spectral imag e contain ing p redomin antly one species. Separability h as a lso b een studied in the NMF literature in the context of ensuring the u niquen ess of NMF [29 ]. Sub sequent work h as led to the development of NMF algo rithms that exploit separ ability [23 ], [ 30]. Th e un iqueness and cor rectness results in this line of work has primar ily fo cused on the noiseless case. W e ﬁnally n ote that separability has also been r ecently exp loited in the p roblem o f learning multiple ranking preferen ces fro m p airwise comparison s for p ersonal recommen dation systems and in formation retr iev al [31], [32 ] and ha s led to provably con sistent and e fﬁcient estimation algorithm s. I I I . T O P I C S E PA R A B I L I T Y , N E C E S S A RY A N D S U FFI C I E N T C O N D I T I O N S , A N D T H E G E O M E T R I C I N T U I T I O N S In this section , we u nravel the key ideas that moti vate o ur algorithm ic approac h by focusing on the ideal case where there is no “samp ling-no ise”, i.e. , each doc ument is inﬁnitely long ( N = ∞ ). In the next section , we will turn to the ﬁnite N case. W e rec all that β and X den ote the W × K topic matrix and the W × M empirical word c ounts/freq uency m atrix respectively . Also, M , W , and K d enote, respectively , the n umber of docume nts, the vocabulary size, and the nu mber o f topics. For conv enience, we gro up the do cument- speciﬁc mixing weigh ts, the θ m ’ s, in to a K × M weight matrix θ =  θ 1 , . . . , θ M  and the doc ument-spec iﬁc d istributions, the A m ’ s, into a W × M docume nt distribution matrix A =  A 1 , . . . , A M  . The g ener- ativ e procedu re that descr ibes a top ic m odel then im plies th at A = β θ . In the ideal case considered in this s ection ( N = ∞ ), the empirical word fr e quency matrix X = A . N otation: A vector a without speciﬁcation will deno te a c olumn- vector , 1 the all-ones co lumn vector of suitable size, X i the i -th column vector and X j the j - th row vector of matr ix X , an d ¯ B a suitab ly row-normalized version ( described later) o f a nonnegative matrix B . Also, [ n ] := { 1 , . . . , n } . A. K ey Structu ral P r op erty: T opic Separability W e ﬁrst introduce separab ility as a key stru ctural pro perty of a top ic ma trix β . Formally , 4 Deﬁnition 1. (Separability) A to pic matrix β ∈ R W × K is separable if for each to pic k , there is some wor d i such that β i,k > 0 and β i,l = 0 , ∀ l 6 = k . T o pic separab ility imp lies that each topic c ontains word(s) which app ear only in that to pic. W e ref er to these words as the novel words of the K to pics. Figure 1 shows an examp le Fig. 1. An exampl e of separab le topic matrix β (left ) and the underlying geometri c structure (right) of the row space of the normalized document distrib ution matrix ¯ A . Note: the word ordering is only for visualizati on and has no bearing on separabil ity . Solid circl es represent rows of ¯ A . Empty circle s repre sent ro ws of ¯ X when N is ﬁnite (in the ideal case, ¯ A = ¯ X ). Project ions of ¯ A w ’ s (resp. ¯ X w ’ s) along a random isotropic direction d can be used to identify nove l words. of a separable β with K = 3 topics. W o rds 1 and 2 are novel to topic 1 , words 3 and 4 to topic 2 , and word 5 to topic 3 . Other words that appear in multiple topics are called non- novel word s (e.g., word 6 ). I dentifyin g th e novel words f or K distinct topics is the key step of ou r proposed appr oach. W e note that separability has been empirically obser ved to be ap prox imately satisﬁed by topic estimates p roduc ed b y V ariational-Bayes and MCMC based algor ithms [5], [7], [26 ]. More f undam entally , in very recent work [16] , it has b een shown that topic separability is an inevitable consequence of having a relatively small numbe r of top ics in a very large vocab ulary (high -dimension ality). In particu lar , whe n the K columns (topics) of β are indepen dently sampled fro m a Dirichlet distribution ( on a ( W − 1) -dimension al pr obability simplex), the re sulting top ic matrix β will be ( approx imately) separable with pr obability ten ding to 1 as W scales to inﬁnity sufﬁciently faster than K . A Dirichlet prior on β is widely- used in smo othed setting s of top ic mo deling [1]. As we will discu ss next in Sec. II I-C, the top ic separab ility proper ty c ombined with ad ditional conditions on the secon d- order statistics of the mix ing weights leads to an in tuitiv ely appealing g eometric prop erty th at c an be exploited to develop a p rovably co nsistent and efﬁcient topic estimatio n algorithm. B. Conditio ns o n th e T opic Mixing W eights T o pic sep arability alon e doe s not gu arantee that there will be a uniq ue β that is consistent with all the obser vations X . T his is illustrated in Fig. 2 [4]. T herefor e, in an e ffort to d ev elop p rovably consistent topic estimation alg orithms, a num ber o f different condition s have been imposed o n the topic m ixing we ights θ in the literatu re [3 ], [5 ], [7 ], [9], [15]. Complemen ting the work in [4] which iden tiﬁes necessary an d sufﬁcient conditions for consistent detection of novel words, in this paper we identify necessary and sufﬁ cient condition s fo r consistent estimation of a separable topic matr ix. Our necessity results are in formation-th eor etic and algorithm-in depend ent in nature, meaning that they are independen t of any sp eciﬁc statistics of the observations and the alg orithms used. The novel words and th e topics can only be identiﬁed up to a permutatio n and this is accou nted for in our results. Let a := E ( θ m ) and R := E ( θ m θ m ⊤ ) be the K × 1 expectation vector and the K × K cor relation matrix o f the weight pr ior Pr( α ) . W ithout lo ss of g enerality , we can assume that the elemen ts of a are strictly po siti ve since oth erwise some to pic(s) will not appear in th e cor pus. A key quan tity is ¯ R := diag ( a ) − 1 R diag( a ) − 1 which m ay be viewed as a “normalize d” secon d-mom ent matrix of the weight vector . Th e following cond itions are central to ou r results. Condition 1. ( Simplicial Condition) A matrix B is (r ow- wise) γ s -simplicial if any r ow-vector of B is a t a Eu clidean distance of at least γ s > 0 fr o m the conve x hu ll o f the r emaining r o w-vectors. A topic model is γ s -simplicial if its normalized seco nd-mome nt ¯ R is γ s -simplicial. Condition 2. (Afﬁne-Independence) A ma trix B is (r ow- wise) γ a -afﬁne-ind ependen t if min λ k P K k =1 λ k B k k 2 / k λ k 2 ≥ γ a > 0 , where B k is the k -th r ow of B and the minimu m is over all λ ∈ R K such that λ 6 = 0 and P K k =1 λ k = 0 . A topic mo del is γ a -afﬁne-ind ependen t if its n ormalized seco nd- moment ¯ R is γ a -afﬁne-ind ependen t. Here, γ s and γ a are called the simplicial and afﬁne- indepen dence con stants respectively . T hey a re con dition num - bers wh ich mea sure the degree to which th e condition s th at they are respectively associated with hold . Th e lar ger that these condition num bers ar e, the easier it is to estimate the topic matrix. Going forward, w e will say th at a matrix is simplicial (resp. afﬁne indep endent) if it is γ s -simplicial (resp. γ a - afﬁne-independe nt) fo r some γ s > 0 (resp. γ a > 0 ). Th e simplicial condition was ﬁrst p roposed in [9 ] and th en fu rther in vestigated in [4]. Th is p aper is the ﬁrst to ide ntify afﬁne- indepen dence as both ne cessary and sufﬁcient fo r consistent separable to pic estimation . Before we discu ss their geometric implications, we point o ut that afﬁne-indepen dence is stron ger than the simplicial conditio n: Proposition 1. ¯ R is γ a -afﬁne-ind ependen t ⇒ ¯ R is at lea st γ a -simplicial. The re verse imp lication is false in gener al. The Simplicial Condition is bot h Necessary and Sufﬁcient for Novel W ord Det ection: W e ﬁrst focu s o n detecting all the novel words of the K d istinct topics. For this task, the simplicial con dition is an algo rithm-ind epende nt, inf ormation - theoretic n ecessary condition. Form ally , Lemma 1 . (Simp licial Con dition is Necessary for Novel W or d Detection [4, Lemma 1 ]) Let β be separable an d W > K . If ther e exists an algorithm tha t ca n consistently ide ntify all novel words of all K topic s fr om X , then ¯ R is simplicial. The key insight b ehind this result is that when ¯ R is non - 5       1 0 0 0 1 0 0 0 1 0 0 1 . . .         ← θ 1 → ← θ 2 → ← 0 . 5 θ 1 + 0 . 5 θ 2 →   =       1 0 0 0 1 0 0 0 1 0 . 5 0 . 5 0 . . .         ← θ 1 → ← θ 2 → ← 0 . 5 θ 1 + 0 . 5 θ 2 →   β (1) θ β (2) θ Fig. 2. Example showing that topic separabilit y alone does not guarante e a unique solution to the problem of estimatin g β from X . Here, β 1 θ = β 2 θ = A is a document distributio n m atrix that is consistent with two differ ent topic matrices β (1) and β (2) that are both separable. simplicial, we can co nstruct two distinct sepa rable topic matr i- ces with different sets of novel words which induce the sam e distribution on the empir ical ob servations X . Geom etrically , the simplicial condition g uarantee s tha t the K rows of ¯ R will be extreme points of the conv ex h ull that they themselves f orm. Therefo re, if ¯ R is not simplicial, there will exist at least one redund ant topic which is just a convex combin ation of the other topics. It turns out that ¯ R being simplicial is also sufﬁcient f or consistent novel word d etection. T his is a direct co nsequen ce of th e c onsistency gu arantees of our approach as outlined in Theorem 3. Afﬁne-Independence is Necessary and Sufﬁcient for Sep- arable T opic Estimation: W e n ow f ocus on estimating a separable topic matr ix β , which is a strong er requ ire- ment than d etecting novel words. It naturally r equires condi- tions that are stronger than the simplicial condition . Afﬁne- indepen dence tu rns out to be an a lgorithm- indepen dent, informa tion-theo retic necessary condition. Formally , Lemma 2. (A fﬁne-Indep endence is Necessary for Separable T opic Estimation) Let β b e separable with W ≥ 2 + K . If ther e exis ts an algorithm that can consistently estimate β fr om X , then its n ormalized secon d-momen t ¯ R is afﬁne- indepen dent. Similar to Lemma 1 , if ¯ R is no t afﬁne-indepen dent, we can construct two distinct and separ able top ic matrices that induce the same distribution on the observation which makes consistent topic estimation impo ssible. Geometrically , every point in a conve x set can be decomp osed uniqu ely as a conv ex combin ation of its extreme points, if, and only if, the extreme points are afﬁne-inde penden t. Hen ce, if ¯ R is not afﬁne-independen t, a non -novel word can be assigned to different subsets of to pics. The sufﬁciency o f the afﬁne-indep endence co ndition in separable top ic estimation is again a direct co nsequence of the c onsistency guar antees of our ap proach as in Theorems 3 and 4. W e note that since afﬁne-indep endence im plies the simplicial co ndition (Pr oposition 1), afﬁne-indepen dence is sufﬁcient for novel word detection as well. Connection to Ot her Co nditions on t he Mixing W eig hts: W e brieﬂy discuss other con ditions on the m ixing weights θ that have b een explo ited in the literature. In [7 ], [15] , R (equiv alently ¯ R ) is assumed to h av e fu ll-rank (with minimu m eigenv alue γ r > 0 ). In [3 ], ¯ R is assumed to b e diagona l- dominan t, i.e., ∀ i, j, i 6 = j, ¯ R i,i − ¯ R i,j ≥ γ d > 0 . They are both sufﬁcient condition s for detecting all th e novel words o f all distinct top ics. Th e con stants γ r and γ d are co ndition num bers which measu re the degre e to which the f ull-rank and diag onal- dominan ce condition s hold respectively . They are counterp arts of γ s and γ a and like them, the larger they are, the easier it is to consistently detect the n ovel words and estimate β . The relationsh ips between these condition s are summ arized in Proposition 2 and illustrated in Fig. 3. Fig. 3. Relation ships between Simplic ial, Afﬁne -Independe nce, Full Rank, and Diagonal Dominance conditions on the normalize d second-moment ¯ R . Proposition 2. Let ¯ R be the normalized second -moment o f the to pic prior . Then , 1) ¯ R is full rank with m inimum eigenvalue γ r ⇒ ¯ R is at least γ r -afﬁne-ind ependen t ⇒ ¯ R is at least γ r - simplicial. 2) ¯ R is γ d -diagona l-domina nt ⇒ ¯ R is at least γ d - simplicial. 3) ¯ R bein g diagona l-domin ant neither implies nor is im- plied by ¯ R being afﬁne-independ ent (o r full-rank). W e no te that in our earlier work [5] , the provable guarantees for estimating the separable top ic m atrix requ ire ¯ R to hav e full rank . The analy sis in this p aper pr ovably exten ds the guaran tees to the afﬁne-indep endenc e c ondition . C. Geometric Implicatio ns and Rand om Pr ojections Ba sed Algorithm W e now demonstrate the geometric implication s of topic separability comb ined with the simplicial/ afﬁne-independ ence condition on the top ic mixing we ights. T o high light the key ideas we fo cus on the ideal case wh ere N = ∞ . The n, th e empirical d ocumen t word-fre quency matrix X = A = β θ . Novel W or ds are Extreme Points: T o expose the under lying geometry , we norm alize the rows of A and θ to ob tain row-stochastic matrices ¯ A := diag( A1 ) − 1 A and ¯ θ := diag( θ 1 ) − 1 θ . Th en since A = β θ , we have ¯ A = ¯ β ¯ θ where ¯ β := dia g( A1 ) − 1 β diag( θ 1 ) is a r ow-normalized “topic matrix” which is both row-stochastic and separable with th e same sets of novel words as β . Now consider the row vectors of ¯ A and ¯ θ . First, it can be shown that if ¯ R is simplicial (cf. Condition 1) then, with high probab ility , no row of ¯ θ will be in the co n vex hull of 6 the others (see Appen dix D). Next, the separability prop erty ensures th at if w is a novel word o f top ic k , then ¯ β wk = 1 and ¯ β wj = 0 ∀ j 6 = k so that ¯ A w = ¯ θ k . Re visiting th e example in Fig. 1 , th e r ows o f ¯ A which correspo nd to novel words, e.g., words 1 th rough 5 , are all row-vectors of ¯ θ and togeth er fo rm a conv ex hu ll of K extreme points. For e xample, ¯ A 1 = ¯ A 2 = ¯ θ 1 and ¯ A 3 = ¯ A 4 = ¯ θ 2 . If, howe ver , w is a n on-n ovel word, then ¯ A w = P k ¯ β wk ¯ θ k li ves inside the con vex h ull of the rows of ¯ θ . In Fig. 1, row ¯ A 6 which corresp onds to n on-n ovel word 6 , is in side the con vex hull of ¯ θ 1 , ¯ θ 2 , ¯ θ 3 . In summary , the novel words can b e detected as extrem e points of all the row-vectors of ¯ A . Also, multip le novel w ords of the same topic cor respond to the same extreme p oint (e.g., ¯ A 1 = ¯ A 2 = ¯ θ 1 ). Formally , Lemma 3. Let ¯ R be γ s simplicial an d β be sep arable. Then , with pr ob ability at least 1 − 2 K exp( − c 1 M ) − e x p( − c 2 M ) , the i -th r ow of ¯ A is an e xtr eme point of the co n vex hull span ned by all th e r ows of ¯ A if, a nd only if, word i is novel. Her e the constant c 1 := γ 2 s a 4 min / 4 λ max and c 2 := γ 4 s a 4 min / 2 λ 2 max . The model parameters are deﬁn ed as follows. a min is the m inimum element of a . λ max is the ma ximum sin gular-value of ¯ R . T o see how identifyin g novel words c an help us estimate β , recall tha t the row-vectors of ¯ A correspon ding to novel words coincid e with the rows o f ¯ θ . Thus ¯ θ is known once one novel word for ea ch top ic is k nown. Also, for all words w , ¯ A w = P k ¯ β wk ¯ θ k . Thu s, if we can uniqu ely deco mpose ¯ A w as a conve x com bination of the extreme points, then th e coefﬁcients of the decomp osition will giv e us the w -th row of ¯ β . A uniqu e deco mposition exists with high p robab ility when ¯ R is af ﬁne-indep endent and can b e found b y so lving a constrained linear regression prob lem. This gives us ¯ β . Finally , noting that diag ( A1 ) ¯ β = β diag ( θ 1 ) , β can be recovered by suitably r enorma lizing rows and then column s of ¯ β . T o sum up, Lemma 4 . Let A a nd one novel wor d per distinct topic be given. I f ¯ R is γ a afﬁne-indep endent, then, with pr obab ility at least 1 − 2 K exp( − c 1 M ) − exp( − c 2 M ) , β ca n be r ecover e d uniquely via constrained linear r e gr ession. Her e the constant c 1 := γ 2 a a 4 min / 4 λ max and c 2 := γ 4 a a 4 min / 2 λ 2 max . The model parameters are d eﬁned as follows. a min is the minimum element of a . λ max is the ma ximum sin gular-value of ¯ R . Lemmas 3 an d 4 tog ether p rovide a geom etric appro ach for learnin g β from A (equi valently ¯ A ): (1) Find extreme points of rows of ¯ A . Cluster the rows of ¯ A that correspond to the same extreme point into the same g roup. (2) Express the remainin g rows of ¯ A as conv ex combina tions of the K distinct extreme points. (3) Renormalize ¯ β to ob tain β . Detecting Extreme Points using Random Projections: A key contribution of ou r appr oach is an efﬁcient rando m projections based alg orithm to detect n ovel words as extreme points. The idea is illustrated in Fig. 1: if we pro ject every point of a convex body onto an isotro pically distributed r andom direction d , the maximu m (o r minimum ) p rojection value must cor respond to one of the extreme p oints with p robability 1 . On the other hand , the n on-novel words will n ot have the maximu m projection value alon g any ran dom directio n. Therefo re, by repeated ly projectin g all the points onto a few isotropically distributed rand om directio ns, we can detect all the extreme p oints with very high pro bability as th e nu mber of random directio ns increase. An explicit boun d on the numbe r of projections need ed app ears in Theor em 3. Finite N in Practice: The geometric intuition discussed above was based on the r ow-vectors of ¯ A . When N = ∞ , ¯ A = ¯ X the matrix of row-normalized emp irical word- freque ncies of all docume nts. If N is ﬁnite but very large, ¯ A can be well- approx imated by ¯ X than ks to the law of large numb ers. Howe ver , in real-word text cor pora, N ≪ W ( e.g., N = 298 while W = 14 , 94 3 in the NYT dataset) . Therefore, the r ow- vectors of ¯ X ar e signiﬁcan tly perturb ed away from the ideal rows of ¯ A as illustrated in Fig. 1. W e discuss the effect o f small N a nd ho w we address the accompanying issues n ext. I V . T O P I C G E O M E T RY W I T H F I N I T E S A M P L E S : W O R D C O - O C C U R R E N C E M A T R I X R E P R E S E N TA T I O N , S O L I D A N G L E , A N D R A N D O M P RO J E C T I O N S B A S E D A P P RO A C H The extreme po int geometr y sketched in Sec. III-C is p er- turbed when N is small as highlig hted in Fig. 1. Speciﬁcally , the rows of the empirical word- freque ncy matrix X deviate from the rows o f A . This cr eates several pro blems: (1) p oints in th e conv ex h ull correspon ding to non- novel words m ay also b ecome “outlier” extreme points (e.g., ¯ X 6 in Fig. 1); (2) some extreme points that correspon d to novel w ords ma y no longer b e extrem e (e.g ., ¯ X 3 in Fig. 1); (3) multiple novel words correspond ing to the same extreme point may become multiple distinct extreme poin ts (e.g., ¯ X 1 and ¯ X 2 in Fig. 1 ). Unfortu nately , these issues do n ot vanish as M increases with N ﬁxed – a regime which captures the characteristics of typical benchm ark datasets – bec ause the dimension ality of the r ows (equal to M ) also increases. T here is n o “averaging” effect to smoothen -out th e sampling no ise. Our solu tion is to seek a new repr esentation, a statistic o f X , which c an no t on ly sm oothen out th e sampling noise of in- dividual docu ments, but also p reserve the same extreme point geometry indu ced by the separ ability and afﬁne indepen dence condition s. In add ition, we also develop an extreme point robustness measure that natu rally ar ises within our rando m projection s b ased framework. This robustness m easure ca n be used to d etect an d exclu de the “outlier” extreme points. A. Normalized W or d Co-occurr ence Matrix Representation W e co nstruct a suitably normalized word co-o ccurren ce matrix from X as our new rep resentation. Th e co-occurren ce matrix conver ges alm ost surely to an ideal statistic as M → ∞ for any ﬁxed N ≥ 2 . Simultaneously , in th e asympto tic lim it, the original novel words co ntinue to corresp ond to extreme points in the new rep resentation and overall extreme po int geometry is pr eserved. The new repre sentation is (con ceptually) constru cted as follows. First rand omly divide all the words in e ach docum ent into two eq ual-sized independent halves and obtain two W × K empirical word-f requen cy matrices X an d X ′ each contain ing N / 2 words. Th en norm alize their rows like in Sec. II I-C to 7 obtain ¯ X and ¯ X ′ which are ro w-stochastic. The empirical w ord co-occu rrence matrix of size W × W is th en gi ven by b E := M ¯ X ′ ¯ X ⊤ (1) W e no te that in our ran dom projectio n b ased appr oach, b E is n ot explicitly con structed by multip lying ¯ X ′ and ¯ X . Instead, we k eep ¯ X ′ and ¯ X and exploit their sparsity proper ties to redu ce the comp utational complexity of all sub sequent processing. Asymptotic Co nsistency: The ﬁrst nice prop erty of the word co-occu rrence r epresentation is its asymp totic consistency when N is ﬁxed. As the nu mber of d ocumen ts M → ∞ , the empirical b E conv erges, almost sur ely , to an ideal word co-occu rrence matrix E o f size W × W . Formally , Lemma 5 . ( [32, Lemma 2]) Let b E be the empirical wor d co-occu rr ence ma trix deﬁned in Eq. (1) . The n, b E M →∞ − − − − − − − − − − → almost su r ely ¯ β ¯ R ¯ β ⊤ =: E (2) wher e ¯ β := diag − 1 ( β a ) β diag( a ) and ¯ R := diag − 1 ( a ) R diag − 1 ( a ) . Furthermor e, if η := min 1 ≤ i ≤ W ( β a ) i > 0 , then Pr( k b E − E k ∞ ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 η 4 M N / 20) . Here ¯ R is th e same norm alized secon d-mom ent of th e top ic priors as d eﬁned in Sec. I II an d ¯ β is a row-normalized version of β . W e make note of th e ab use of notion for ¯ β which was deﬁned in Sec. III-C. It can be shown that the ¯ β deﬁne d in Lemma 5 is th e limit of the one deﬁned in Sec. II I-C as M → ∞ . The conv ergence r esult in Lemma 5 shows that the word co-occu rrence rep resentation E can be consistently estimated by b E as M → ∞ and the deviation vanishes expon entially in M which is large in typic al ben chmark d atasets. Novel W ords are Extreme P oints: Ano ther reason fo r u sing this word co-o ccurren ce represen tation is that it p reserves the extreme point geome try . Consider the ideal word co- occurre nce matrix E = ¯ β ( ¯ R ¯ β ⊤ ) . It is straightf orward to show that if ¯ β is separable an d ¯ R is simplicial then ( ¯ R ¯ β ⊤ ) is also simplicial. Using th ese facts it is possible to establish the following cou nterpar t of Lemma 3 for E : Lemma 6 . (Novel W or ds a r e Extreme P o ints [5, Lemma 1]) Let ¯ R be simplicial a nd β be separable. Th en, a wor d i is novel if, and only if, the i - th r ow of E is an extr eme point of the con vex hu ll spa nned by a ll the r ows of E . In another w ords, the novel words corresp ond to the extreme points of all the row-vectors of the ideal word co-occurre nce matrix E . C onsider the example in Fig. 4 which is based on the same top ic matrix β as in Fig. 1. Her e, E 1 = E 2 , E 3 = E 4 , and E 5 are K = 3 distinct extrem e points of all row-vectors of E and E 6 , which co rrespon ds to a non -novel word, is in side the con vex hu ll. Once th e novel words are d etected as extrem e points, we can follow the same pro cedure a s in Lem ma 4 and express each row E w of E as a unique conve x com bination of the K extreme rows o f E or equiv alently th e rows of ( ¯ R ¯ β ⊤ ) . The weights o f the c onv ex combination are the ¯ β wk ’ s. W e can then apply the same row and colu mn ren ormalization to ob tain β . The following re sult is the counterpa rt o f Lemma 4 for E : Lemma 7. Let E and one n ovel word for each distinct topic be given . I f ¯ R is a fﬁne-indep endent, then β can b e r ecover ed uniquely via con strained linear r e gr ession. One can f ollow the same steps as in the pr oof of Le mma 4. The only add itional step is to check th at ¯ R ¯ β ⊤ =  ¯ R , ¯ RB  is afﬁne-independe nt if ¯ R is afﬁne-indep endent. W e note that the ﬁnite sampling noise pertu rbation b E − E is still no t 0 but vanishes as M → ∞ (in contra st to the ¯ X representatio n in Sec. III- C). Howe ver , ther e is still a possibility of observin g “o utlier” extreme points if a non-n ovel word lies on the facet of the conve x h ull of th e rows of E . W e next in troduce an extreme p oint robustness measur e based on a cer tain solid an gle that na turally arises in o ur rando m projection s based ap proach, and d iscuss how it c an be u sed to detect and d istinguish between “true” novel words and su ch “outlier” extrem e p oints. B. Solid Angle E xtr eme P oint Robustness Measur e T o han dle the impact o f a small but no nzero pertur bation k b E − E k ∞ , we develop an e xtreme point “ro bustness” measure. This is necessary fo r n ot only ap plying our app roach to r eal- world d ata but also to establish ﬁnite sample co mplexity bound s. Intu iti vely , a robustness measure sho uld be able to distinguish between the “true” extreme p oints (r ow vectors that are novel words) and the “o utlier” extrem e po ints (row vectors of non -novel w ords that bec ome extreme p oints du e to the nonzero per turbation ). T ow ards this go al, we leverage a k ey geometric quan tity , namely , the Normalized Solid Angle sub - tended by the con vex hu ll of the ro ws of E at an e xtreme point. T o visualize this quantity , we revisit our r unning example in Fig. 4 and indicate the solid angles attached to each extreme point by the shad ed regions. It tu rns out that this geometr ic quantity naturally a rises in the con text of random projections that was d iscussed earlier . T o see this connectio n, in Fig. 4 observe that the shaded region attache d to any extreme point Fig. 4. An exampl e of separab le topic matrix β (left ) and the underlying geometri c s tructure (right) in the word co-oc currence representati on. Note: the word ordering is only for visuali zation and has no bearing on separabilit y . The example topic matrix β is the same as in Fig. 1. Solid circle s represent the rows of E . The shaded regions depict the solid angles subtended by each ext reme point. d 1 , d 2 , d 3 are isotropic random direction s along which each ext reme point has maximum projectio n value. They can be used to estimate the solid angles. 8 coincides precisely with th e set of direction s along which its projection is larger (taking sign into accoun t) than that o f any o ther p oint (whether extreme or not). For examp le, in Fig. 4 the pr ojection of E 1 = E 2 along d 1 is la rger th an that of any o ther point. Thus, the solid angle attached to a point E i (whether extreme or not) can be fo rmally deﬁn ed as the set of directions { d : ∀ j : E j 6 = E 1 , h E i , d i > h E j , d i} . This set is nonem pty o nly f or extrem e p oints. T he solid angle deﬁned above is a set. T o derive a scalar robustness measure from this set and tie it to the idea of random projections, we adopt a statistical perspe ctiv e and deﬁne the n ormalized solid angle o f a point as th e p r o bability th at th e point will have the maximum projection value alon g an isotrop ically distributed random direction . Concretely , for the i -th word (row vector), the normalized solid angle q i is deﬁned as q i := Pr( ∀ j : E j 6 = E i , h E i , d i > h E j , d i ) (3) where d is drawn from an isotropic distribution in R W such as the spher ical Gaussian. The condition E i 6 = E j in Eq. (3) is intro duced to exclu de the mu ltiple novel word s of the same topic th at corre spond to the same extreme p oint. For instan ce, in Fig. 4 E 1 = E 2 , Hence, for q 1 , j = 2 is excluded . T o make it p ractical to han dle ﬁn ite sample estimatio n no ise we rep lace the condition E j 6 = E i by th e condition k E i − E j k ≥ ζ f or some suitably deﬁn ed ζ . As illustrated in Fig . 4, the solid an gle for all the extreme points ar e strictly positive given ¯ R is γ s -simplicial. On the other hand, for i th at is non-n ovel, the correspon ding solid angle q i is zero b y deﬁnition . Hen ce the extreme point geometry in Lemm a 6 can be re-expressed in term of solid angles as follows: Lemma 8. (Novel W o r ds hav e P ositive So lid Ang les) Let ¯ R be simplicial and β be separable. Th en, wor d i is a novel wor d if, an d only if, q i > 0 . W e denote the smallest solid ang le among the K distinct extreme p oints by q ∧ > 0 . This is a robust cond ition number of the co n vex hu ll fo rmed b y the rows of E and is related to the simplicial con stant γ s of ¯ R . In a real-world dataset we h av e access to o nly an em pirical estimate b E of th e ideal word co- occurre nce matrix E . If w e replace E with b E , then the resu lting emp irical solid angle estimate b q i will be very close to the ideal q i if b E is close enoug h to E . Then , the solid angles of “outlier ” e xtreme points will be close to 0 wh ile they will be b ound ed away fro m zero for the “true” extrem e points. One can then hope to correctly identify all K extrem e po ints by rank-ordering a ll empirical solid angle estima tes and selectin g the K distinct row-vectors that h av e the largest solid angles. This forms the basis of our propo sed algorith m. The problem now boils down to ef ﬁciently estimating the solid angles and establishing th e asympto tic conv ergence of the e stimates as M → ∞ . W e next discuss how rando m pro jections can be used to achieve these g oals. C. Efﬁcient S olid Angle Estimation via Random Pr ojections The d eﬁnition of the norm alized solid ang le in Eq. (3 ) motiv ates an efﬁcient algorithm based o n rand om pr ojections to estima te it. For con venience, we ﬁrst r ewrite Eq. (3) as q i = E " I {∀ j : k E j − E i k ≥ ζ , E i d ≥ E j d } # (4) and then pr opose to estimate it b y ˆ q i = 1 P P X r =1 I ( ∀ j : b E i,i + b E j,j − 2 b E i,j ≥ ζ / 2 , b E i d r > b E j d r ) (5) where d 1 , . . . , d P ∈ R W × 1 are P iid dir ections drawn from an isotropic d istribution in R W . Algo rithmically , by E q. (5), we approxim ate the solid angle q i at the i -th word (ro w-vector) by ﬁrst p rojecting all th e row-vectors onto P iid iso tropic random dir ections and th en calculating the fr action of tim es each row-vector achieves the maximum projection value. It turns out th at th e cond ition b E i,i + b E j,j − 2 b E i,j ≥ ζ / 2 is equiv alent to k E i − E j k ≥ ζ in term s of its ab ility to exclu de multiple novel words from the same topic and is adopte d for its simp licity . 2 This pro cedure o f taking ran dom projectio ns followed by calculating the number o f times a word is a maximizer via Eq. (5 ) provides a consistent estima te o f th e so lid angle in Eq. (3) as M → ∞ and th e number of projections P in creases. The h igh-level id ea is simp le: a s P increases, the empirica l av erage in Eq. 5 converges to th e correspo nding e xpectatio n. Simultaneou sly , as M increases, b E a.s. − − → E . Overall, the approx imation b q i propo sed in Eq ( 5) u sing rand om projectio ns conv erges to q i . This ran dom projectio ns based approa ch is also compu ta- tionally e fﬁcient fo r the fo llowing reasons. First, it en ables us to avoid the explicit constru ction of the W × W dim ensional matrix b E : Recall that each co lumn of X and X ′ has no more than N ≪ W non zero e ntries. Hence X and X ′ are both sparse. Since b Ed = M ¯ X ′ ( ¯ X ⊤ d ) , the pro jection can be calculated u sing two sparse m atrix-vector multiplicatio ns. Second, it turn s ou t that the numb er of projections P need ed to gu arantee consistency is small. In fact in Theorem 3 we provide a sufﬁcient upper bou nd for P which is a polynomial function of log( W ) , lo g(1 /δ ) an d other mod el para meters, where δ is the pro bability that the algo rithm fails to detect all the d istinct novel words. Parallelization, Distributed and Online Set tings: Ano ther advantage of the p roposed random pro jections based appr oach is th at it can be pa rallelized and is natu rally ame nable to online or distributed settings. Th is is based on the f ollowing observation that each p rojection has an additive structu re: b Ed r = M ¯ X ′ ¯ X ⊤ d r = M M X m =1 ¯ X m ′ ¯ X m ⊤ d r . The P projection s can also be c omputed independ ently . There- fore, • In a distributed setting in which the do cuments ar e store d on distributed servers, we can ﬁrst share the same ra ndom 2 W e abuse the symbol ζ by using it to indic ate differe nt thresholds in these conditi ons. 9 directions across servers and then aggregate the projectio n values. The commu nication co st is only th e “par tial” projection values and is theref ore insigniﬁcant [5] a nd does no t scale as the numb er of ob servations N , M increases. • In a n online setting in which the docu ments are streamed in an online fashion [20], we only need to keep all the projection values and update the p rojection values ( hence the em pirical solid a ngle estimates) when new docum ents arrive. The additive and ind ependen t stru cture guaran tees that the statistical efﬁciency o f these variations ar e th e same as the centralized “b atch” implemen tation. For the r est of this pa per , we only f ocus on the centralized versio n. Outline of Overall Appr oach: Our overall appro ach can be summarized as follows. (1) Estimate the empirical solid angles usin g P iid isotropic ran dom direction s as in Eq . 5. (2) Select the K words with distinct word co-occu rrence patterns (r ows) that have the largest empirica l solid an gles. (3) Estimate the to pic ma trix usin g con strained linear regression as in Lem ma 4. W e will discuss the details of our overall approa ch in th e next section and e stablish guaran tees f or its computatio nal and statistical e fﬁciency . V . A L G O R I T H M A N D A N A LY S I S Algorithm 1 describes th e main step s o f ou r overall ra ndom projecton s based algor ithm which we call RP . The two main steps, n ovel word detectio n and topic m atrix estimatio n are outlined in Algorithms 2 an d 3 respectively . Algorithm 2 outlines the random pr ojection and rank -orde ring steps. Al- gorithm 3 descr ibes the constrained linear regression and the renorm alization steps in a combin ed way . Algorithm 1 RP Input: T ext d ocumen ts ¯ X , ¯ X ′ ( W × M ) ; Number of top ics K ; Numb er of iid rando m projectio ns P ; T oleran ce pa- rameters ζ , ǫ > 0 . Output: Estimate of th e topic m atrix b β ( W × K ) . 1: Set of No vel W ord s I ← NovelW ord Detect( ¯ X , ¯ X ′ , K , P , ζ ) 2: ˆ β ← E stimateT opics( I , ¯ X , ¯ X ′ , ǫ ) Computational Efﬁciency: W e ﬁrst summar ize the computa- tional ef ﬁciency of Algorithm 1: Theorem 2. Let the numb er of novel wor ds for each topic be a constan t r elative to M , W , N . Then, the runn ing time of Algorithm 1 is O ( M N P + W P + W K 3 ) . This efﬁciency is ach iev ed by exploiting the spa rsity o f X and the p roperty that there ar e o nly a small number of novel words in a typical vocabulary . A detailed an alysis o f the computatio nal complexity is presented in the app endix. Here we point out that in order to upper boun d t he co mputation time of th e linear r egression in Algorith m 3 we used O ( W K 3 ) for W matrix inv ersions, one for each of the words in the vocab ulary . In pr actice, a gradient descent implementa tion can be used for the constrained line ar regression which is Algorithm 2 NovelW ordDetect ( via Random Pro jections) Input: ¯ X , ¯ X ′ ; Nu mber of topics K ; Num ber of pr ojections P ; T olerance ζ ; Output: The set of all novel words o f K d istinct topics I . 1: ˆ q i ← 0 , ∀ i = 1 , . . . , W , b E ← M ¯ X ′ ¯ X ⊤ . 2: for all r = 1 , . . . , P do 3: Sample d r ∈ R W from a n isotropic pr ior . 4: v ← M ¯ X ′ ¯ X ⊤ d r 5: i ∗ ← arg max 1 ≤ i ≤ W v i , ˆ q i ∗ ← ˆ q i ∗ + 1 / P 6: ˆ J i ∗ ← { j : b E i ∗ ,i ∗ + b E j,j − 2 b E i ∗ ,j ≥ ζ / 2 } 7: for all k ∈ ˆ J c i ∗ do 8: ˆ J k ← { j : b E k,k + b E j,j − 2 b E k,j ≥ ζ / 2 } 9: if {∀ j ∈ ˆ J k , v k > v j } t hen 10: ˆ q k ← ˆ q k + 1 / P 11: end if 12: end for 13: end f or 14: I ← ∅ , k ← 0 , j ← 1 15: while k < K do 16: i ← index of the j th largest value o f { ˆ q 1 , . . . , ˆ q W } . 17: if {∀ p ∈ I , b E p,p + b E i,i − 2 b E i,p ≥ ζ / 2 } then 18: I ← I ∪ { i } , k ← k + 1 19: end if 20: j ← j + 1 21: end while 22: Return I . Algorithm 3 EstimateT o pics Input: I = { i 1 , . . . , i K } set o f n ovel word s, on e f or each of the K topics; b E ; precision parameter ǫ Output: b β , which is the estimate of the β matrix 1: b E ∗ w = h b E w, i 1 , . . . , b E w, i K i 2: Y = ( b E ∗⊤ i 1 , . . . , b E ∗⊤ i K ) ⊤ 3: for all i = 1 , . . . , W do 4: Solve b ∗ := ar g min b k b E ∗ i − bY k 2 5: subject to b j ≥ 0 , P K j =1 b j = 1 6: using precision ǫ for the stop ping-cr iterion. 7: b β i ← ( 1 M X i 1 ) b ∗ 8: end f or 9: b β ← c olumn normalize b β much m ore efﬁcient. W e also note that these W optim ization problem s are decou pled given the set of detected n ovel words. Therefo re, they can be parallelized in a straightforward mann er [5]. Asymptotic Consistency and Statistica l Efﬁciency: W e now summarize the asympto tic con sistency an d sample comp lexity bound s for Algo rithm 1 . The analy sis is a co mbination of the consistency of the novel w ord detection step (Algorithm 2) and the topic estimation step (A lgorithm 3). W e state the r esults for both o f these steps. First, f or detecting all the n ovel words of the K distinct topics, we h av e the following result: Theorem 3 . Let topic matrix β be separable and ¯ R be γ - simplicial. If the pr ojection d ir ections are iid sampled fr om any isotr opic d istribution, then A lgorithm 2 can id entify all 10 the n ovel words of th e K distinct topics a s M , P → ∞ . Furthermore , ∀ δ ≥ 0 , if M ≥ 20 log(2 W /δ ) N ρ 2 η 4 and P ≥ 8 log(2 W /δ ) q 2 ∧ (6) then Algo rithm 2 fails with pr ob ability at most δ . Th e model parameters ar e deﬁ ned a s follows. ρ = min { d 8 , π d 2 q ∧ 4 W 1 . 5 } wher e d = (1 − b ) 2 γ 2 /λ max , d 2 , (1 − b ) γ , λ max is the ma ximum eigen value o f ¯ R , b = max j ∈C 0 ,k ¯ β j,k , and C 0 is the set of non-n ovel wor ds. F inally , q ∧ is the minimu m solid an gle of the e xtr eme points o f the conve x hull of the r ows of E . The detailed pr oof is pr esented in the append ix. T he results in Eq. (6 ) pr ovide a sufﬁcient ﬁn ite samp le comp lexity bound for novel word d etection. T he bou nd is polyno mial with respect to M , W, K , N , log( δ ) and o ther m odel param eters. The num ber of projec tions P th at impacts the c omputatio nal complexity scales as log ( W ) / q 2 ∧ in this suf ﬁcient bound wher e q ∧ can be u pper boun ded by 1 /K . I n pra ctice, we have fo und that setting P = O ( K ) is a good choice [5]. W e no te th at the result in Theorem 3 on ly requ ires the simplicial cond ition wh ich is the minimum con dition requ ired for consistent novel word detectio n (Lem ma 1). T his theor em holds true if the topic pr ior ¯ R satisﬁes strong er conditions such as afﬁne-indep endenc e. W e also point out th at o ur proo f in this pape r holds for any isotr op ic distribution on the random projection directions d 1 , . . . , d P . The p revious result in [5 ], howe ver , only applies to some speciﬁc isoto pic distributions such as the Spherical Gaussian o r the uniform distribution in a unit ba ll. In practice, we use Spherica l Gau ssian sin ce sampling from su ch prior is simp le and requ ires only O ( W ) time for g enerating eac h r andom direction. Next, given the successful d etection of the set of n ovel words for all top ics, we have the following result f or the accurate estimation o f th e sep arable topic matr ix β : Theorem 4. Let topic matrix β be separable and ¯ R be γ a - afﬁne-indep endent. Given the successful detection of novel wor ds for all K distinct top ics, th e outpu t of A lgorithm 3 b β p − → β element-wise (u p to a colu mn permutation) . Speciﬁ- cally , if M ≥ 2560 W 2 K log( W 4 K /δ ) N γ 2 a a 2 min η 4 ǫ 2 (7) then ∀ i , k , b β i,k will be ǫ close to β i,k with pr obability at least 1 − δ , for any 0 < ǫ < 1 . η is the same as in Theorem 3 . a min is th e minimum va lue in a . W e note that the sufﬁcient sample comp lexity boun d in Eq. (7) is again p olynom ial in ter ms o f all the mo del pa- rameters. Here we only requir e ¯ R to be a fﬁne-independen t. Combining Theorem 3 and T heorem 4 gives the co nsistency and sample comp lexity bou nds of o ur overall app roach in Algorithm 1. V I . E X P E R I M E N TA L R E S U LT S In this section , we p resent experimen tal results on b oth syn- thetic an d real world datasets. W e repo rt different perfo rmance measures that h ave been com monly used in th e topic mo deling literature. When the g round tru th is av ailable (Sec. VI- A), we use the ℓ 1 r econstruction err or between the grou nd truth topics an d the e stimates af ter prope r topic alignmen t. For the real-world text corpu s in Sec. VI-B, we r eport the held- out pr obability , which is a standar d measur e used in th e topic m odeling literature. W e also qua litatively (seman tically) compare the topics extracted by the d ifferent approac hes using the to p probable words for each topic. A. Semi-syn thetic text co rpus In order to validate our pr oposed algorithm, we g enerate “semi-synthetic” text corpor a by sampling fro m a synthetic, yet realistic, gro und tr uth topic mod el. T o en sure that the semi-synthetic data is similar to real-world d ata, in ter ms of dimensiona lity , spar sity , and othe r characteristics, we use th e following gener ativ e procedu re adapted from [ 5], [7 ]. W e ﬁrst train an L D A m odel (with K = 100 ) on a rea l- world dataset using a stand ard Gibb s Samp ling metho d with default parameters (as describ ed in [1 1], [33] ) to obtain a topic matrix β 0 of size W × K . The real-world dataset that we use to generate our synthetic data is derived f rom a New Y ork T imes (NYT) articles dataset [8]. The original vocabulary is ﬁrst prun ed b ased on docu ment frequen cies. Speciﬁcally , as is standard practice, o nly words that appear in more tha n 500 docu ments ar e retained . Ther eafter, again as per standard practice, th e words in the so -called stop-word list are delete d as rec ommend ed in [34]. Afte r th ese steps, M = 3 00 , 000 , W = 1 4 , 943 , and the average d ocumen t leng th N = 298 . W e then generate semi-syn thetic datasets, for various values of M , by ﬁxing N = 30 0 and u sing β 0 and a Dirichlet topic prior . As suggested in [11] and used in [5], [7], we use symmetric hyper-parameter s ( 0 . 03 ) for the Dirichlet topic prior . The W × K top ic m atrix β 0 may no t b e separab le. T o enforce sep arability , we create a new sepa rable ( W + K ) × K dimensiona l topic matrix β sep by inserting K sy nthetic novel words ( one per topic) having suitable prob abilities in eac h topic. Speciﬁcally , β sep is co nstructed by tran sforming β 0 as follows. First, f or eac h synthetic novel word in β sep , the value of the sole no nzero en try in its r ow is set to the prob ability of the most pro bable word in the to pic (co lumn) of β 0 for which it is a novel word. Th en the resulting ( W + K ) × K d imensional no nnegative matr ix is renormalize d colu mn- wise to make it colu mn-stoch astic. Finally , we gen erate semi- synthetic datasets, for various values of M , by ﬁxing N = 30 0 and u sing β sep and th e same sym metric Dir ichlet top ic prior used for β 0 . W e use the name Semi-Syn to re fer to datasets that are generated using β 0 and the nam e Semi-Syn + Novel for datasets generated using β sep . In o ur pro posed ran dom p rojections ba sed algo rithm, wh ich we call RP , we set P = 150 × K , ζ = 0 . 05 , an d ǫ = 10 − 4 . W e com pare RP against the provably efﬁcient algorith m RecoverL2 in [7] a nd th e standard G ibbs Samp ling based LD A algorith m (d enoted by Gibb s) in [1 1], [33 ]. In or der to measure the perfo rmance of d ifferent algorith ms in our experiments b ased on semi-synth etic data, we c ompute the ℓ 1 norm of the r econstruction err or between b β and β . Since 11 all colum n per mutations of a given top ic m atrix c orrespon d to the same topic m odel (for a co rrespon ding perm utation of the topic m ixing weights), we use a bipartite graph match ing algorithm to optimally match the colu mns of b β with those of β (based on min imizing the sum of ℓ 1 distances between all pairs o f matching colu mns) bef ore compu ting the ℓ 1 norm of the reconstruction e rror b etween b β and β . The results on both Semi-Syn + Novel NYT and Semi-Syn NYT are summarized in Fig. 5 fo r a ll thr ee a lgorithms for various ch oices o f the numb er of docu ments M . W e note th at in th ese ﬁgu res the ℓ 1 norm of the e rror has been nor malized by the n umber of top ics ( K = 10 0 ). 50000 200000 300000 500000 1e+06 2e+06 0 0.1 0.2 0.3 0.4 0.5 # of Documents M L1 error RP RecoverL2 Gibbs 50000 200000 300000 500000 1e+06 2e+06 0 0.1 0.2 0.3 0.4 0.5 # of Documents M L1 error RP RecoverL2 Gibbs Fig. 5. ℓ 1 norm of the error in estimati ng the topic matrix β for various M ( K = 100 ): (T op) Semi-Syn + Novel NYT ; (Bottom) Semi-Syn NYT . RP is the proposed algorit hm, Recov erL2 is a prov ably efﬁc ient algorithm from [7], and Gibbs is the Gibbs S ampling approxi mation algori thm in [11]. In RP , P = 150 K , ζ = 0 . 05 , and ǫ = 10 − 4 . As Fig. 5 shows, when the separability con dition is strictly satisﬁed ( Semi-S yn + Novel ), the reco nstruction er ror of RP conv erges to 0 as M beco mes large and outperfo rms the approx imation-b ased Gibbs. When the separ ability cond ition is not strictly satisﬁ ed ( Semi-Sy n ), the recon struction err or of RP is co mparable to Gibbs (a practical ben chmark ). Solid An gle and Model Selection: In our pr oposed algorithm RP , the number o f topics K (the mod el-order ) nee ds to be speciﬁed. When K is unav ailable, it needs to b e estimated from the data. Although not the fo cus of th is work, Algo- rithm 2, which identiﬁes n ovel w ords by sortin g and clustering the e stimated solid angles o f words, can be suitably modiﬁed to estimate K . Indeed , in the idea l scenario where there is no samp ling noise ( M = ∞ , b E = E , and ∀ i, ˆ q i = q i ), only n ovel words have positi ve solid angles ( ˆ q i ’ s) and the r ows of b E correspo nd- ing to the novel words of th e same topic are iden tical, i.e., the distance b etween the rows is zer o or, equ iv alently , they are within a neigh borho od of size zero of e ach other . Thus, the number of distinct neighb orhoo ds of size zer o amo ng the non - zero solid a ngle words equals K . In the non ideal case M is ﬁnite. If M is sufﬁciently large, one can expect tha t the estimated solid an gles of n on-novel words will not a ll be zero. T hey are, howe ver , likely to be much sma ller than those of novel words. T hus to reliab ly estimate K one should no t o nly exclude word s with exactly zero solid ang le e stimates, but also tho se ab ove som e non zero threshold. When M is ﬁnite, th e the r ows of b E c orrespon ding to th e novel words of th e same topic are unlikely to be id en- tical, but if M is sufﬁ ciently large they are likely to be close to each o ther . Thus, if the thresh old ζ in Algorithm 2, w hich determines the size of the neig hborh ood for clustering all novel words b elongin g to the sam e topic, is made sufﬁciently small, then each neigh borho od will have only novel words be longing to th e same to pic. W ith the two modiﬁcatio ns d iscussed above, the num ber of distinct neighbo rhood s of a suitably no nzero size ( determine d by ζ > 0 ) among the words whose solid angle e stimates ar e larger th an some th reshold τ > 0 w ill provide an estimate of K . The values of τ an d ζ should, in prin ciple, decrease to zero as M increases to inﬁnity . Leaving th e task of unraveling th e depend ence of τ and ζ o n M to future work, here we only pro- vide a brief empir ical validation on b oth th e Semi-Sy n + Novel and Semi-Syn NYT datasets. W e set M = 2 , 00 0 , 000 so that the reconstru ction erro r ha s essentially conver ged (see Fig. 5 ), and consider different ch oices of the threshold ζ . W e run Algor ithm 2 with K = 100 , P = 15 0 × K , and a n ew line o f co de: 16 ’: ( if { ˆ q i = 0 } , break ); inserted between lines 16 and 1 7 (this corresp onds to τ = 0 ). Th e input hyperp arameter K = 10 0 is not the actual nu mber of estimated topics. I t should be interpreted as specifying an upper bound on the nu mber o f topics. The value of (little) k when Algorithm 2 ter minates (see lines 14–2 1) p rovides an estimate of the number of top ics. Figure 6 illustrates how the so lid ang les of all words, sor ted in descending o rder, decay for different cho ices o f ζ and how they c an be used to detect the novel words and estimate the value of K . W e note th at in b oth th e sem i-synthetic datasets, for a wide r ange of values of ζ (0.1 –5), the modiﬁed Algorithm 2 co rrectly estimates the value of K as 1 00 . When ζ is large (e.g., ζ = 10 in Fig. 6), m any in terior points would be declared as novel word s an d mu ltiple id eal novel words would be g rouped into one clu ster r esulting. This cau ses K to be u nderestimate d (46 and 4 1 in Fig. 6). B. Real- world data W e now describ e results on the ac tual real-world NYT dataset that was u sed in Sec. VI-A to co nstruct the semi- synthetic datasets. Since groun d tru th topics are unav ailable, we measur e perfor mance u sing th e so-called predictive held- out log-pr obab ility . Th is is a stan dard measure which is typically used to ev aluate how well a learned to pic model ﬁts real- world data. T o calculate this f or eac h of the three topic estimation method s ( Gibbs [11], [33 ], RecoverL2 [7], and RP), we ﬁrst rand omly select 60 , 000 do cuments to test the go odness o f ﬁt an d use the re maining 240 , 00 0 docu ments to produ ce an estimate b β of the top ic matrix. Next we assume a D irichlet p rior o n th e topics and estimate its concen tration hyper-parameter α . In Gibbs, this estimate b α is a by prod uct of the algor ithm. In RecoverL2 a nd RP this can b e estimated 12 2000 4000 6000 8000 10000 12000 14000 200 400 600 800 1000 1200 1400 Words Solid Angle ζ = 0.1 ζ = 1 ζ = 5 ζ = 10 46 100 100 100 2000 4000 6000 8000 10000 12000 14000 200 400 600 800 1000 1200 1400 Words Solid Angle ζ = 0.1 ζ = 1 ζ = 5 ζ = 10 100 41 100 100 Fig. 6. Solid-angles (in descending order) of all 14943 + 100 words in the Semi-Syn + Sep NYT dataset (left) and all 14943 words in the Semi-Syn NYT dataset (right) estima ted (for diffe rent v alues of ζ ) by Algor ithm 2 with K = 100 , P = 150 × K , M = 2 , 000 , 000 , and a new line of code: 16’: ( if { ˆ q i = 0 } , break ); inserted between lines 16 and 17. The values of j and (little) k when Algorithm 2 terminates are indicate d, respecti ve ly , by the position of the vert ical dashed line and the rectangular box next to it for diffe rent ζ . from b β a nd X . W e then calcu late the prob ability of ob serving the test d ocuments gi ven the learned to pic model b β and b α : log Pr( X test | b β , b α ) Since an exact ev aluation of th is pred icti ve log-likelihood is in tractable in general, we calculate it using the MCMC based app roximatio n pr oposed in [19 ] which is now a standard approx imation tool [33] . For RP , we use P = 1 50 × K , ζ = 0 . 05 , and ǫ = 10 − 4 as in Sec. VI-A. W e repo rt the held- out log probab ility , no rmalized by the total n umber of words in the test do cuments, averaged acro ss 5 training/testing splits. The results a re summarized in T ab le I. As shown in T able I, T ABLE I N O R M A L I Z E D H E L D - O U T L O G P R O B A B I L I T Y O F R P , R E C O V E R L 2 , A N D G I B B S S A M P L I N G O N N Y T T E S T D A TA . T H E M E A N ± S T D ’ S A R E C A L C U L AT E D F R O M 5 D I FF E R E N T R A N D O M T R A I N I N G - T E S T I N G S P L I T S . K Reco verL2 Gibbs RP 50 -8.22 ± 0.56 -7.42 ± 0.45 -8.54 ± 0.52 100 -7.63 ± 0.52 -7.50 ± 0.47 -7.45 ± 0.51 150 -8.03 ± 0.38 -7.31 ± 0.41 -7.84 ± 0.48 200 -7.85 ± 0.40 -7.34 ± 0.44 -7.69 ± 0.42 Gibbs has the b est descriptive power for ne w documents. RP and RecoverL2 have similar , but somewhat lower values than Gibbs. Th is may be attributed to missing novel words that appear o nly in the test set and are cru cial to the su ccess o f RecoverL2 and RP . Speciﬁcally , in real-world examples, th ere is a mo del-mismatch as a result o f w hich the data likelihoo ds of RP an d RecoverL2 suffer . Finally , we qualitatively access the to pics produced by our RP algorithm . W e show some example topics extracted b y RP trained on the en tir e NYT dataset of M = 30 0 , 000 documen ts in T able I I 3 For ea ch to pic, its most frequ ent words ar e listed. As can be seen, the e stimated topics d o form recog nizable themes th at c an be assigned meanin gful labels. T he fu ll list of all K = 100 to pics estimated on the NYT dataset c an be found in [3] . 3 The zzz preﬁx in the NYT vocab ulary is used to annotate certain special named entitie s. For example, zzz nﬂ annotates NFL. T ABLE II E X A M P L E S O F T O P I C S E S T I M AT E D B Y R P O N N Y T T opic la- bel W ords in decreasin g order of estimate d probabilitie s “weather” weathe r wind air s torm rain cold “feeling” feeling sense lov e character heart emotion “election” elec tion zzz ﬂorida ballot vote zzz al gore recount “game” yard game team season play zzz nﬂ V I I . C O N C L U S I O N A N D D I S C U S S I O N This paper pr oposed a pr ovably consistent and efﬁcient al- gorithm fo r top ic discovery . W e co nsidered a natur al structural proper ty – topic separability – on the topic matrix and ex- ploited its geometric implications. W e resolved the necessary and sufﬁcient con ditions that can g uarantee consistent novel words detection as well as separable topic estima tion. W e then propo sed a random projection s ba sed alg orithm that has not only provably p olyno mial statistical and com putationa l complexity but a lso state-of-the -art per forman ce on semi- synthetic and re al-world datasets. While we focused on the standard c entralized batch imple- mentation in this paper, it turns ou t that our rando m projec tions based scheme is n aturally amen able to an efﬁcient distributed implementatio n which is of interest whe n the docu ments are stored on a network o f distributed servers. This is because the iid isotro pic projectio n dir ections can be pr ecompute d an d shared acro ss docu ment servers, and counts, pr ojections, and co-occu rrence matr ix co mputation s h av e an additive stru cture which allows partial comp utations to be perfor med at each docume nt server locally and then ag gregated at a fu sion center with only a small comm unication cost. It tu rns o ut that the distributed implem entation can provably m atch th e polyno mial co mputation al and statistical efﬁciency g uarantees of its centr alized counterp art. As a consequ ence, it provide s a provably efﬁcient alternativ e to the distributed topic e sti- mation p roblem wh ich has b een tackle d using variations of MCMC or V ariation al-Bayes in the literature [ 20], [35]–[ 37] This is appealin g for modern web-scale databases, e.g ., th ose generated by T witter Streaming. A comprehen si ve th eoretical and empirical in vestigation o f the distrib uted variation of our algorithm can be found in [ 5]. 13 Separability of genera l mea sures: W e d eﬁned and studied the n otion of separability for a W × K topic m atrix β which is a ﬁn ite collection of K pro bability distributions over a ﬁnite set (of size W ). It turns out that we can extend the notion separ ability to a ﬁnite collection o f mea sures over a measurable space. This necessitates making a small technical modiﬁcation to th e deﬁnition of separa bility to acco mmodate the po ssibility of only having “novel subsets” that hav e zero measure. W e also show that our generalized d eﬁnition of sep- arability is equ iv alent to the so-called irreducibility pro perty of a ﬁnite collectio n of measures that has recently been studied in the context of mix ture mod els to establish con ditions for the identiﬁability of the mixing co mpon ents [ 38], [39]. Consider a co llection of K measur es ν 1 , . . . , ν K over a mea- surable space ( X , F ) , where X is a set an d F is a σ -alg ebra over X . W e d eﬁne th e g eneralized notion of sepa rability f or measures as fo llows. Deﬁnition 2. (Separability) A collection of K measures ν 1 , . . . , ν K over a measurable space ( X , F ) is sep arable if for all k = 1 , . . . , K , inf A ∈F : ν k > 0 max j : j 6 = k ν j ( A ) ν k ( A ) = 0 . (8) Separability req uires that f or ea ch measure ν k , th ere ex- ists a sequence of measurable sets A ( k ) n , of nonzer o mea- sure with respec t to ν k , such that, for all j 6 = k , the ratios ν j ( A ( k ) n ) /ν k ( A ( k ) n ) vanish asymp totically . I ntuitively , this m eans that for e ach measure there exists a sequ ence of nonzer o-measure measurable subsets that are asymptotically “novel” for that measure. When X is a ﬁnite set as in topic modeling , this reduces to the existence of novel words as in Deﬁnition 1 and A ( k ) n are simply the sets of novel words for topic k . The separability property just deﬁned is equivalent to the so- called irred ucibility proper ty . Inform ally , a co llection of m ea- sures is irr educible if only no nnegative linear combin ations of them c an pr odu ce a measur e . F ormally , Deﬁnition 3. (Irreducibility) A collectio n o f K me asur es ν 1 , . . . , ν K over a measurable space ( X , F ) is irr ed ucible if the following condition holds: If ∀ A ∈ F , P K k =1 c k ν k ( A ) ≥ 0 , then for all k = 1 , . . . , K , c k ≥ 0 . For a collectio n of n onzero measur es, 4 these two pr operties are e quiv alent. Formally , Lemma 9. A collection of nonzer o measures ν 1 , . . . , ν K over a measurable space ( X , F ) is irr educible if and only if it is separable. I n pa rticular , a top ic matrix β is irr educib le if an d only if it is separable. The proof app ears in Appen dix M. T o pic models like LD A discussed in this pa per belong to the larger family of Mixed Membersh ip Latent V ariab le Models [13] wh ich have b een successfully em ployed in a variety of problem s that include text analysis, genetic ana lysis, network commun ity detection, and ran king and prefere nce discovery . 4 A measure ν is nonzero if there exist s at least one measurable set A for which ν ( A ) > 0 . The structure-leveraging app roach pro posed in this paper can be pote ntially extended to this larger family of models. Some initial steps in this direction for r ank and preference data are explored in [32]. Finally , in this en tire p aper, the topic matrix is assum ed to be separ able. While exact separab ility may b e an idealization , as shown in [16] , appro ximate separability is bo th theo reti- cally inevitable and p ractically enco untered when W ≫ K . Extendin g the results of this work to app roximately separab le topic matrices is an in teresting direction for futu re w ork. Som e steps in this direc tion are explored in [40 ] in the context of learning mixed m embership Mallows mo dels f or rankings. A C K N O W L E D G M E N T This article is b ased upon work supp orted by the U.S. AFOSR un der award numb er # F A9550- 10-1- 0458 (subaward # A1795) and the U.S. NSF un der award n umbers # 15276 18 and # 12 1899 2. The views and conclusions contain ed in this article are tho se of the autho rs and shou ld not be interpr eted as necessarily representin g th e ofﬁcial policies, either expressed or implied, o f th e ag encies. A P P E N D I X A. Pr oof o f Lemma 1 Pr o of. Th e proo f is by contradic tion. W e will show that if ¯ R is non-simplicial, we can con struct two topic matrices β (1) and β (2) whose sets o f n ovel word s are not identical and yet X has the same d istribution under both mod els. The difference between con structed β (1) and β (2) is no t a result o f colu mn permutatio n. Th is will imp ly th e impossibility of consistent novel word d etection. Suppose ¯ R is non -simplicial. Then we can assume, with- out loss of generality , th at its ﬁrst row is within the con - vex hu ll of the remainin g rows, i.e., ¯ R 1 = P K j =2 c j ¯ R j , where ¯ R j denotes th e j -th row of ¯ R , and c 2 , . . . , c K ≥ 0 , P K j =2 c j = 1 are con vex combin ation weights. Compactly , e ⊤ ¯ Re = 0 wher e e := [ − 1 , c 2 , . . . , c K ] ⊤ . Recalling that ¯ R = dia g ( a ) − 1 R diag( a ) − 1 , where a is a positive vector and R = E ( θ m θ m ⊤ ) by deﬁnition, we have 0 = e ⊤ ¯ Re = (dia g( a ) − 1 e ) ⊤ E ( θ m θ m ⊤ )(diag( a ) − 1 e ) = E ( k θ m ⊤ diag( a ) − 1 e k 2 2 ) , which implies that θ m ⊤ diag( a ) − 1 e a.s. = 0 . Fro m this it follows that if we deﬁne tw o non neg- ativ e row vectors b 1 := b  a − 1 1 , 0 , . . . , 0  and b 2 = b  (1 − α ) a − 1 1 , αc 2 a − 1 2 , . . . , αc K a − 1 K  , wh ere b > 0 , 0 < α < 1 a re constants, then b 1 θ m a.s. = b 2 θ m for any d istribution on θ m . Now we constru ct two sep arable topic matrices β (1) and β (2) as follows. Let b 1 be the ﬁrst row a nd b 2 be the second in β (1) . Let b 2 be the ﬁrst row and b 1 the secon d in β (2) . Let B ∈ R W − 2 × K be a valid separable topic matrix. Set the remaining ( W − 2) rows o f both β (1) and β (2) to be B ( I K − diag( b 1 + b 2 )) . W e can cho ose b to be small en ough to ensure that each element of ( b 1 + b 2 ) is strictly less than 1 . This will en sure th at β (1) and β (2) are column-stoc hastic 14 and therefor e valid separab le to pic matrices. Observe that b 2 has at lease two nonzero componen ts. Th us, word 1 is novel for β (1) but non- novel for β (2) . By c onstruction , β (1) θ a.s. = β (2) θ , i.e., the distribution of X condition ed on θ is the sam e for both models. Margina lizing over θ , the distribution of X u nder each top ic matrix is th e same. Th us n o algorith m can c onsistently distingu ish between β (1) and β (2) based on X . B. Pr oof o f Lemma 2 Pr o of. Th e proof is by contr adiction. Suppose that ¯ R is not afﬁne-independe nt. The n there exists a λ 6 = 0 with 1 ⊤ λ = 0 such that λ ⊤ ¯ R = 0 so that λ ⊤ ¯ R λ = 0 . Recalling tha t ¯ R = diag( a ) − 1 R diag( a ) − 1 , we h av e, 0 = λ ⊤ ¯ R λ = (diag( a ) − 1 λ ) ⊤ E ( θ m θ m ⊤ )(diag( a ) − 1 λ ) = E ( k θ m ⊤ diag( a ) − 1 λ k 2 ) , which implies that θ m ⊤ diag( a ) − 1 λ a.s. = 0 . Since λ 6 = 0 , we can assume, without loss of g enerality , that the ﬁrst t elements o f λ , λ 1 , . . . , λ t > 0 , the next s elemen ts of λ , λ t +1 , . . . , λ t + s < 0 , a nd the remain ing elemen ts are 0 for some s, t : s > 0 , t > 0 , s + t ≤ K . Therefo re, if we deﬁne two n onnegative and n onzero row vectors b 1 := b  λ 1 a − 1 1 , . . . , λ t a − 1 t 0 , . . . , 0  and b 2 := − b  0 , . . . , 0 , λ t +1 a − 1 t +1 , . . . , λ s a − 1 s , 0 , . . . , 0  , where b > 0 is a constant, then b 1 θ m a.s. = b 2 θ m . Now we construct two topic matrices β (1) and β (2) as follows. Let b 1 be the ﬁr st row an d b 2 the second in β 1 . Le t b 2 be the ﬁrst row and b 1 the second in β 2 . Let B ∈ R W − 2 × K be a valid topic matrix an d assume that it is separable . Set the remaining ( W − 2) rows o f both β (1) and β (2) to be B ( I K − diag( b 1 + b 2 )) . W e can choo se b to b e small enoug h to ensu re that each eleme nt of ( b 1 + b 2 ) is strictly less tha n 1 . This will ensu re that β (1) and β (2) are column- stochastic an d therefor e valid top ic matrices. W e note that the suppor ts o f b 1 and b 2 are disjoint and both are non- empty . They appea r in distinct topics. By co nstruction, β (1) θ a.s. = β (2) θ ⇒ the distribution of the observation X conditio ned on θ is th e same for b oth models. Margin alizing over θ , the distributions of X und er the topic matrice s are th e same. Thu s no algo rithm can distingu ish between β 1 and β 2 based on X . C. Pr oof of Pr oposition 1 and Pr oposition 2 Proposition 1 and Proposition 2 summ arizes the relation- ships between the full-ra nk, afﬁne-ind ependen ce, simplicial, and diagonal-do minance co nditions. Here we consider all th e pairwise implication sep arately . (1) ¯ R is γ a -afﬁne-independ ent ⇒ ¯ R is at least γ a -simplicial. Pr o of. By d eﬁnition of afﬁne in depend ence, k P K k =1 λ k ¯ R k k 2 ≥ γ a k λ k 2 > 0 for all λ ∈ R K such th at P K k =1 λ k = 0 and λ 6 = 0 . If for ea ch i ∈ [ K ] we set λ k = 1 fo r k = i and choose λ k ≤ 0 , ∀ k 6 = i then (i) k λ k 2 ≥ 1 , (ii) {− λ k , k 6 = i } are con vex weights, i.e., they are nonnegative and sum to 1 , and (iii) P K k =1 λ k ¯ R k = ¯ R i − P k 6 = i ( − λ k ) ¯ R k . Th erefor e, for all i ∈ [ K ] , k ¯ R i − P k 6 = i ( − λ k ) ¯ R k k 2 ≥ γ a > 0 which pr oves that ¯ R is at least γ a -simplicial. For the rev erse implication, con sider ¯ R =     1 0 0 . 5 0 . 5 0 1 0 . 5 0 . 5 0 . 5 0 . 5 1 0 0 . 5 0 . 5 0 1     . It is simplicial but is no t af ﬁne independen t (the 1 , 1 , − 1 , − 1 combinatio n of the 4 rows would be 0 ). (2) ¯ R is full r ank with minim um eigenv alue γ r ⇒ ¯ R is at least γ r -afﬁne-independ ent. Pr o of. Th e Rayleig h-quo tient ch aracterization o f the min- imum eigenv alue o f a symmetric, po siti ve-deﬁnite matrix ¯ R gives min λ 6 = 0 k λ ⊤ R k 2 / k λ k 2 = γ r > 0 . Therefo re, min λ 6 = 0 , 1 ⊤ λ =0 k λ ⊤ R k 2 / k λ k 2 ≥ γ r > 0 . One can construct examples that contrad ict the re verse imp lication: ¯ R =   1 0 1 0 1 1 1 1 2   . which is afﬁne ind epende nt, but not linear independen t. (3) ¯ R is γ d -diagon al-domin ant ⇒ ¯ R is at least γ d -simplicial. Pr o of. Notin g that ¯ R i,i − ¯ R i,j ≥ γ d > 0 for all i, j , then the d istance of the ﬁrst row of ¯ R , ¯ R 1 , to any con- vex comb ination of the remain ing r ows, K P j =2 c j ¯ R j , wh ere c 2 , . . . , c K are convex com bination weig hts, can b e lower bound ed by , k ¯ R 1 − K P j =2 c j ¯ R j k 2 ≥ | ¯ R 1 , 1 − K P j =2 c j ¯ R j, 1 | = | K P j =2 c j ( ¯ R 1 , 1 − ¯ R j, 1 ) | ≥ γ d > 0 . There fore, ¯ R is at least γ d -simplicial. It is straightforward to construct examples tha t contradict the reverse implication : ¯ R =   1 0 1 0 1 1 1 1 2   . which is afﬁne indep endent, hen ce simplicial, b ut not diago nal- dominan t. (4) ¯ R being diago nal-dom inant neither implies nor is imp lied by ¯ R being afﬁne-indep endent. Pr o of. Con sider th e following two examples: ¯ R =   1 0 1 0 1 1 1 1 2   . and ¯ R =     1 0 0 . 5 0 . 5 0 1 0 . 5 0 . 5 0 . 5 0 . 5 1 0 0 . 5 0 . 5 0 1     . They are the examples for the two side s of this assertion. 15 D. Pr oof of Lemma 3 Pr o of. Recall tha t ¯ A = ¯ β ¯ θ wher e ¯ A and ¯ θ are row-norm alized version of A and θ , ¯ β := diag ( A1 ) − 1 β diag( θ 1 ) . ¯ β is row- stochastic and is separable if β is separ able. If w is a novel word of topic k , ¯ β wk = 1 and ¯ β wj = 0 , ∀ j 6 = k . W e have then ¯ A w = ¯ θ k . If w is a non-n ovel word, ¯ A w = P k ¯ β wk ¯ θ k is a conve x comb ination o f th e rows of ¯ θ . W e next prove that if ¯ R is γ s -simplicial with som e co nstant γ s > 0 , then, the ran dom matr ix ¯ θ is also simplicial with hig h probab ility , i.e., for any c ∈ R K such tha t c k = 1 , c j ≤ 0 , j 6 = k , P j 6 = k − c j = 1 , k ∈ [ K ] , the M -dim ensional vector c ⊤ ¯ θ is not all-zero with high prob ability . In another words, we need to show that the max imum abso lute value of the M entries in c ⊤ ¯ θ is strictly positi ve . No ting that the m -th entry of c ⊤ ¯ θ (scaled by M ) is M c ⊤ ¯ θ m = c ⊤ diag( a ) − 1 θ m + c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m the absolute value ca n b e lo wer bounded as follows, | M c ⊤ ¯ θ m | ≥| c ⊤ diag( a ) − 1 θ m | −| c ⊤ (diag( X d θ d / M ) − 1 − diag( a ) − 1 ) θ m | (9) The key idea s are: ( i ) as M incr eases, the second term in Eq. (9) conv erges to 0 , and ( ii ) th e m aximum o f the ﬁrst term in Eq. (9) amo ng m = 1 , . . . , M is strictly above zero with high p robab ility . For ( i ) , recall that a = E ( θ m ) an d 0 ≤ θ m k ≤ 1 , by Hoeffding’ s lemma ∀ t > 0 , Pr( k X d θ d / M − a k ∞ ≥ t ) ≤ 2 K exp( − 2 M t 2 ) Also note tha t ∀ 0 < ǫ < 1 , k X d θ d / M − a k ∞ ≤ ǫa 2 min / 2 ⇒k (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) k ∞ ≤ ǫ ⇒| c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | ≤ ǫ where a min is th e minimu m entry of a . The last inequality is true since P K k =1 θ m k = 1 . In sum, we have Pr( | c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | > ǫ ) ≤ 2 K ex p( − M ǫ 2 a 4 min / 2) (10) For ( ii ) , recall that ¯ R is γ s -simplicial and k c ⊤ ¯ R k ≥ γ s . Ther efore, c ⊤ ¯ Rc = c ⊤ ¯ R ¯ R † ¯ Rc ≥ γ 2 s λ max where λ max is the maxim um sing ular value of ¯ R . Noting that ¯ R = diag( a ) − 1 E ( θ m θ m ⊤ ) dia g( a ) − 1 , we g et E ( | c ⊤ diag( a ) − 1 θ m | 2 ) ≥ γ 2 s λ max (11) For con venience, let x m := | c ⊤ diag( a ) − 1 θ m | 2 ≤ 1 /a 2 min . Then, by Ho effding’ s lemma, Pr( E ( x m ) − M X m =1 x m / M ≥ γ 2 s 2 λ max ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) Combining Eq. (1 1) we get Pr( M X m =1 x m / M ≤ γ 2 s 2 λ max ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) Hence Pr( M max m =1 x m ≤ γ 2 s 2 λ max ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) (12) i.e., th e maxim um absolu te value of th e ﬁrst term in Eq. (9) is g reater than γ s / √ 2 λ max with h igh probability . T o sum up, if we set ǫ = γ s / √ 2 λ max in Eq. ( 10), we get Pr( M max m =1 | c ⊤ ¯ θ m | = 0) ≤ Pr( M max m =1 x m ≤ γ 2 s 2 λ max ) + Pr( | c ⊤ (diag( a P d θ d ) − I ) θ m | > ǫ ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) + 2 K ex p( − M γ 2 s a 4 min / 4 λ max ) T o summarize, the probab ility that ¯ θ is not simplicial is at most ex p( − M γ 4 s a 4 min / 2 λ 2 max ) + 2 K exp( − M γ 2 s a 4 min / 4 λ max ) . This conv erges to 0 expo nentially fast as M → ∞ . Ther efore, with high probab ility , all the row-vectors of ¯ θ are extreme points o f the con vex hull th ey form and this concludes ou r proof . E. Pr oof o f Lemma 4 Pr o of. W e ﬁrst show that if ¯ R is γ a afﬁne-independe nt, ¯ θ is also afﬁne-indepe ndent with high p robab ility , i. e., ∀ c ∈ R K such that c 6 = 0 , P k c k = 0 , c ⊤ ¯ θ is n ot all-ze ro vector with high probab ility . Our proo f is similar to that of Lem ma 3. W e ﬁrst re-write the m -th entry of c ⊤ ¯ θ (with some scaling) as, M c ⊤ ¯ θ m = c ⊤ diag( a ) − 1 θ m + c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m and lo wer bound its absolute v alue b y | M c ⊤ ¯ θ m | ≥| c ⊤ diag( a ) − 1 θ m | −| c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | (13 ) W e will then show that: ( i ) as M increases, the secon d term in Eq. (13) conv erges to 0 , and ( ii ) the maximu m of the ﬁrst term in Eq. (13) amon g M iid samples is strictly above zer o with high p robab ility . For ( i ) , by the Cauchy-Schwartz inequality | c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | ≤k c k 2 k (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m k 2 ≤k c k 2 k (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) k ∞ Here the last ineq uality is tr ue since θ m k ≤ 1 , P k θ m k = 1 . Similar to Eq. (10), we h av e, Pr( | (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | ≥ k c k 2 ǫ ) 16 ≤ 2 K exp( − M ǫ 2 a 4 min / 4) (14) for any 0 < ǫ < 1 , a min is the minimum entry of a . For ( i i ) , recall that by deﬁnition, k c ⊤ ¯ R k 2 ≥ γ a k c k 2 . Hence c ⊤ ¯ Rc ≥ γ 2 a k c k 2 2 /λ max . Therefore, by the construction o f ¯ R , we have, E ( | c ⊤ diag( a ) − 1 θ m | 2 / k c k 2 2 ) ≥ γ 2 a λ max (15) For convenience, let x m := | c ⊤ diag( a ) − 1 θ m | 2 / k c k 2 2 ≤ 1 /a 2 min . Following the sam e procedur e as in Eq. ( 12), we have, Pr( M max m =1 x m ≤ γ 2 a 2 λ max ) ≤ exp( − M γ 4 a a 4 min / 2 λ 2 max ) (16 ) Therefo re, if we set in Eq. (14) ǫ = γ / √ 2 λ max , we g et, Pr( M max m =1 | c ⊤ ¯ θ m | ≤ 0) ≤ exp( − M γ 4 a a 4 min / 2 λ 2 max ) + 2 K ex p( − M γ 2 a a 4 min / 4 λ max ) In sum mary , if ¯ R is γ a afﬁne-independe nt, ¯ θ is also a fﬁne- indepen dent with high pr obability . Now we tu rn to prove L emma 4. By Lemm a 3 , detecting K distinct n ovel word s for K topics is equiv alent to kn owing ¯ θ up to a row permutation. Noting that ¯ A w = P k ¯ β wk ¯ θ k . it follows tha t ¯ β wk , k = 1 , . . . , K is one optimal solutio n to the following constrain ed optim ization prob lem: min k ¯ A w − K X k =1 b k ¯ θ k k 2 s.t b k ≥ 0 , K X k =1 b k = 1 Since ¯ θ is afﬁne-independen t with hig h probability , there- fore, this optimal solu tion is unique with high p robab ility . If this is no t true , then there would exist two distinct solutio ns b 1 1 , . . . , b 1 K and b 2 1 , . . . , b 2 K such tha t ¯ A w = P K k =1 b 1 k ¯ θ k = P K k =1 b 2 k ¯ θ k . P b 1 k = P b 2 k = 1 . W e would then ob tain K X k =1 ( b 1 k − b 2 k ) ¯ θ k = 0 where the coefﬁcients b 1 k − b 2 k are not all zero and P k b 1 k − b 2 k = 0 . This would contradict the afﬁne-indepen dence deﬁn ition. Finally , we check the reno rmalization steps. Recall tha t since diag( A1 ) ¯ β = β diag( θ 1 ) , diag( A1 ) can be dire ctly obtained f rom the o bservations. So we can ﬁrst reno rmalize the rows of ¯ β . Removing diag( θ 1 ) is then simp ly a column renorm alization ope ration (recall that β is column -stochastic). It is no t nec essary to kn ow the exact the value of diag( θ 1 ) . T o sum up , by solving a constrain ed linear regression followed by suitable row r enorma lization, we can o btain a unique solution which is th e grou nd truth topic matrix. This conclud es the proof of Lemm a 4. F . Pr oof of Lemma 5 Lemma 5 establishes the second o rder co -occurr ence es- timator in Eq . (1). W e ﬁrst pr ovide a generic meth od to establish th e explicit c onv ergence bou nd for a function ψ ( X ) of d random variables X 1 , . . . , X d , then apply it to establish Lemma 5 Proposition 3. Let X = [ X 1 , . . . , X d ] be d random vari- ables an d a = [ a 1 , . . . , a d ] b e positive constants. Let E := S i ∈I {| X i − a i | ≥ δ i } for some constants δ i > 0 , and ψ ( X ) be a co ntinuou sly d iffer en tiable function in C := E c . If fo r i = 1 , . . . , d , P r( | X i − a i | ≥ ǫ ) ≤ f i ( ǫ ) ar e the individu al conver gence rates and max X ∈C | ∂ i ψ ( X ) | ≤ C i , then, Pr( | ψ ( X ) − ψ ( a ) | ≥ ǫ ) ≤ X i f i ( δ i ) + X i =1 f i ( ǫ dC i ) Pr o of. Since ψ ( X ) is continu ously d ifferentiable in C , ∀ X ∈ C , ∃ λ ∈ (0 , 1 ) such that ψ ( X ) − ψ ( a ) = ∇ ⊤ ψ ((1 − λ ) a + λ X ) · ( X − a ) Therefo re, Pr( | ψ ( X ) − ψ ( a ) | ≥ ǫ ) ≤ Pr( X ∈ E )+ Pr( d X i =1 | ∂ i ψ ((1 − λ ) a + λ X ) || X i − a i | ≥ ǫ | X ∈ C ) ≤ X i ∈I Pr( | X i − a i | ≥ δ i )+ d X i =1 Pr(max x ∈C | ∂ i ψ ( x ) || X i − a i | ≥ ǫ/ d ) = X i ∈I f i ( δ i ) + X i =1 f i ( ǫ dC i ) Now we turn to p rove L emma 5. Recall that ¯ X and ¯ X ′ are o btained fro m X by ﬁrst splittin g eac h user’ s compariso ns into two ind epend ent halves an d th en re- scaling the rows to make th em row-stochastic henc e ¯ X = diag − 1 ( X1 ) X . Also recall th at ¯ β = diag − 1 ( β a ) β diag( a ) , ¯ R = dia g − 1 ( a ) R diag − 1 ( a ) , and ¯ β is row stoch astic. For any 1 ≤ i, j ≤ W , b E i,j = M 1 M P m =1 X ′ i,m ( M X m =1 X ′ i,m X j,m ) 1 M P m =1 X i,m = 1 / M M P m =1 ( X ′ i,m X j,m ) (1 / M M P m =1 X ′ i,m )(1 / M M P m =1 X j,m ) = 1 M N 2 M ,N ,N P m =1 ,n =1 ,n ′ =1 I ( w m,n = i ) I ( w ′ m,n ′ = j ) 1 M N M ,N P m =1 ,n =1 I ( w m,n = i ) 1 M N M ,N P m =1 ,n =1 I ( w ′ m,n = i ) := F i,j ( M , N ) G i ( M , N ) H j ( M , N ) From th e Stron g Law of Large Numb ers and the gener ativ e topic modeling p rocedur e, F i,j ( M , N ) a.s. − − → E ( I ( w m,n = i ) I ( w ′ m,n ′ = j )) = ( β R β ⊤ ) i,j := p i,j 17 G i ( M , N ) a.s. − − → E ( I ( w ′ m,n = i )) = ( β a ) i := p i H i ( M , N ) a.s. − − → E ( I ( w m,n = j )) = ( β a ) j := p j and ( β R β ⊤ ) i,j ( β a ) i ( β a ) j = E i,j by deﬁnition. Using McDiar mid’ s inequality , we o btain Pr( | F i,j − p i,j | ≥ ǫ ) ≤ 2 ex p( − ǫ 2 M N ) Pr( | G i − p i | ≥ ǫ ) ≤ 2 exp( − 2 ǫ 2 M N ) Pr( | H j − p j | ≥ ǫ ) ≤ 2 ex p( − 2 ǫ 2 M N ) In order to calculate Pr {| F i,j G i H j − p i,j p i p j | ≥ ǫ } , we ap ply the results from Propo sition 3. Let ψ ( x 1 , x 2 , x 3 ) = x 1 x 2 x 3 with x 1 , x 2 , x 3 > 0 , an d a 1 = p i,j , a 2 = p i , a 3 = p j . Let I = { 2 , 3 } , δ 2 = γ p i , and δ 3 = γ p j . Then | ∂ 1 ψ | = 1 x 2 x 3 , | ∂ 2 ψ | = x 1 x 2 2 x 3 , and | ∂ 3 ψ | = x 1 x 2 x 2 3 . If F i,j = x 1 , G i = x 2 , and H j = x 3 , then F i,j ≤ G i , F i,j ≤ H j . Then n ote that C 1 = max C | ∂ 1 ψ | = max C 1 G i H j ≤ 1 (1 − γ ) 2 p i p j C 2 = max C | ∂ 2 ψ | = max C F i,j G 2 i H j ≤ max C 1 G i H j ≤ 1 (1 − γ ) 2 p i p j C 3 = max C | ∂ 3 ψ | = max C F i,j G i H 2 j ≤ max C 1 G i H j ≤ 1 (1 − γ ) 2 p i p j By applying Pr oposition 3, we get Pr {| F i,j G i H j − p i,j p i p j | ≥ ǫ } ≤ exp( − 2 γ 2 p 2 i M N ) + exp( − 2 γ 2 p 2 j M N ) + 2 exp( − ǫ 2 (1 − γ ) 4 ( p i p j ) 2 M N / 9) + 4 exp( − 2 ǫ 2 (1 − γ ) 4 ( p i p j ) 2 M N / 9) ≤ 2 exp( − 2 γ 2 η 2 M N ) + 6 e x p( − ǫ 2 (1 − γ ) 4 η 4 M N / 9) where η = min 1 ≤ i ≤ W p i . There are many strategies for optimizing the free parameter γ . W e set 2 γ 2 = (1 − γ ) 4 9 and solve for γ to obtain Pr {| F i,j G i H j − p i,j p i p j | ≥ ǫ } ≤ 8 exp( − ǫ 2 η 4 M N / 20) Finally , by applying the u nion bou nd to the W 2 entries in b E , we obtain th e claim ed result. G. Pr oof of Lemma 6 Pr o of. W e ﬁrst show that when ¯ R is γ s simplicial and β is separable, then Y = ¯ R ¯ β ⊤ is at least γ s -simplicial. W ithout loss o f gen erality we assume that word 1 , . . . , K are the n ovel words for topic 1 to K . By deﬁn ition, ¯ β ⊤ = [ I K , B ] hence Y = ¯ R ¯ β ⊤ =  ¯ R , ¯ RB  . Therefo re, fo r co n vex combin ation weights c 2 , . . . , c K ≥ 0 such that P K j =2 c j = 1 , k Y 1 − K X j =2 c j Y j k ≥ k ¯ R 1 − K X j =2 c j ¯ R j k ≥ γ s > 0 Therefo re the ﬁr st row vector Y 1 is at least γ s distant away from the conve x hull o f the remainin g rows. Similarly , any row of Y is at lea st γ s distant away from the conve x hull of the rem aining rows hence Y is at least γ s simplicial. Th e rest of the p roof will be exactly the same as fo r Le mma 6. H. Pr oof o f Lemma 7 Pr o of. W e ﬁrst show that when ¯ R is γ a afﬁne indep endent and β is separa ble, then Y = ¯ R ¯ β ⊤ is at least γ a afﬁne indepen dent. Similarly as in the pro of of Lemma 6, w e assume that word 1 , . . . , K are the novel words for topic 1 to K . By d eﬁnition, ¯ β ⊤ = [ I K , B ] hen ce Y = ¯ R ¯ β ⊤ =  ¯ R , ¯ RB  . ∀ λ ∈ R K such that λ 6 = 0 , P K k =1 λ k = 0 , then, k K X k =1 Y k k 2 / k λ k 2 ≥ k K X k =1 ¯ R k k 2 / k λ k 2 ≥ γ a Hence Y is afﬁne in depend ent. The The rest o f the proof will be exactly the same as that for Lemma 4. W e no te tha t on ce the novel words for K topics are detection, we c an use o nly the co rrespond ing colu mns of E for linear regression. Formally , let E ∗ be the W × K matrix formed by the c olumns of the E th at corr espond to K distinct novel words. Then, E ∗ = ¯ β ¯ R . The rest of the proof is ag ain the sam e as that for Lemma 4 . I. Pr oof of Lemma 8 Pr o of. W e ﬁrst check that if q w > 0 , w must be a n ovel word. W ithout loss of gen erality let word 1 , . . . , K be novel word s for K distinct topics. ∀ w , E w = P ¯ β wk E k . ∀ d ∈ R W , h E w , d i = X ¯ β wk h E k , d i ≤ ma x k h E k , d i and the la st equality holds if, an d only if, th ere e xist some k such that ¯ β wk = 1 which implies w is a novel words. W e then show that fo r a novel word w , q w > 0 . W e need to show f or each topic k , when d is sampled fr om an isotropic distribution in R W , th ere exist a set of direction s d with non zero pro bability such that h E k , d i > h E l , d i for l = 1 , . . . , K , l 6 = k . First, one can check by d eﬁnition that Y = ( E ⊤ 1 , . . . , E ⊤ K ) ⊤ = ¯ R ¯ β ⊤ is a t least γ s -simplicial if ¯ R is γ s -simplicial. Let E ∗ 1 be the pro jection of E 1 onto the simplex f ormed by the r emaining row vectors E 2 , . . . , E K . By the orthogo nality principle, h E 1 − E ∗ 1 , E k − E ∗ 1 i ≤ 0 for k = 2 , . . . , K . Therefo re, for d 1 = E ⊤ 1 − E ∗⊤ 1 , E 1 d 1 − E k d 1 = k d 1 k 2 − ( E k − E ∗ 1 ) d 1 ≥ γ 2 s > 0 Due to the continuity of the inner produ ct, there exist a neighbo r on the un ite sphe re aroun d d 1 / k d 1 k 2 that E 1 has maximum projection value. T his conclude our proof . J. Pr oof of Theo r em 2 Pr o of. W e ﬁrst co nsider the rand om projectio n steps ( step 3 to 12 in Alg . 2). For p rojection along direc tion d r , we ﬁr st calcu- late p rojection values r = ¯ X ′ ¯ X ⊤ d r , ﬁnd the maxim izer ind ex i ∗ and th e corr espondin g set ˆ J i ∗ , and then evaluate I ( ∀ j ∈ ˆ J w , v w > v j ) for all the words w in ˆ J c i ∗ = { 1 , . . . , W } \ ˆ J i ∗ . ( I ) Th e set ˆ J c i ∗ have up to |C k | elements asymp totically , where k is the topic associated with word i ∗ . This is co nsidered a small constan t O (1) ; ( I I ) Note that b Ed r = M ¯ X ′ ( ¯ X ⊤ d r ) and each colu mn of ¯ X has at mo st N ≪ W nonze ro entr ies. Calculating the W × 1 projection value vector v requires two sparse matrix-vector m ultiplications and takes O ( M N ) time. 18 Finding the max imum req uires W runn ing time; ( I I I ) T o ev aluate one set ˆ J i ← { j : b E i,i + b E j,j − 2 b E i,j ≥ ζ / 2 } we need to calcu late b E i,j , j = 1 , . . . , W . This can b e viewed as projectin g b E along d = e i and takes O ( M N ) . W e also note that the diagon al en tries E w, w , w = 1 , . . . , W can be calculated once u sing O ( W ) tim e. T o sum up, these step s takes O ( M N P + W P ) runnin g time. W e then con sider th e detectin g an d c lustering steps (step 14 to 21 in Alg. 2). W e no te that all the cond itions in Step 17 have been calcu lated in the p revious steps, and recall that the number of n ovel words are small con stant per topic, then, this step will req uire a runn ing time of O ( K 2 ) . W e last co nsider the topic estimation steps in Algorith m 3. Here all the co rrespon ding inpu ts fo r the linear regression have already been co mputed in the pro jection step . Each linear regression has K variables and we upper boun d its runnin g time by O ( K 3 ) . Calcu lating the row-norma lization factors 1 M X1 r equires O ( M N ) time. T he row and column re-nor malization eac h requires at most O ( W K ) ru nning tim e. Overall, we need a O ( W K 3 + M N ) running time . Other steps are also efﬁcient. Splitting each d ocumen t into two in depend ent halves takes linear tim e in N for each docume nt since we can achieve it using ran dom permu tation over N items. T o g enerate each ran dom direction d r requires O ( W ) comp lexity if we use th e spher ical Gaussian p rior . While we can d irectly sort the em pirical estimated solid ang les (in O ( W lo g ( W )) time) , w e only search fo r the words with largest solid a ngles whose number is a con stant w .r .t W , therefor e it would take only O ( W ) time. K. Pr oof of The or em 3 W e fo cus on the case wh en the rand om projection directio ns are sam pled from any isotropic distribution. Our proof is not tied to the sp ecial fo rm of the d istribution; just its isotrop ic nature. W e ﬁrst provide som e u seful p ropo sitions. W e denote by C k the set of all n ovel word of top ic k , fo r k ∈ [ K ] , and denote by C 0 the set of all non-novel words. W e ﬁrst show , Proposition 4. Let E i be the i -th r ow of E . Suppose β is separable and ¯ R is γ s -simplicial, th en the following is true: F or all k ∈ [ K ] , k E i − E j k E i,i − 2 E i,j + E j,j i ∈ C k , j ∈ C k 0 0 i ∈ C k , j / ∈ C k ≥ (1 − b ) γ s ≥ (1 − b ) 2 γ 2 s /λ max wher e b = max j ∈C 0 ,l ¯ β j,l and λ max > 0 is the maximum eigen value of ¯ R Pr o of. W e focus on th e case k = 1 since the proofs for other values o f k are an alogou s. Le t ¯ β i be the i -th r ow vector of matrix ¯ β . T o show the ab ove results, r ecall th at E = ¯ β ¯ R ¯ β ⊤ . Then k E i − E j k = k ( ¯ β i − ¯ β j ) ¯ R ¯ β ⊤ k E i,i − 2 E i,j + E j,j = ( ¯ β i − ¯ β j ) R ′ ( ¯ β i − ¯ β j ) ⊤ . It is clear that when i , j ∈ C 1 , i.e., th ey are both novel word for the same top ic, ¯ β i = ¯ β j = e 1 . Hence , k E i − E j k = 0 and E i,i − 2 E i,j + E j,j = 0 . When i ∈ C 1 , j / ∈ C 1 , we have ¯ β i = [1 , 0 , . . . , 0 ] , ¯ β j = [ ¯ β j,i , ¯ β j, 2 , . . . , ¯ β j,K ] with ¯ β j, 1 < 1 . Then, ¯ β i − ¯ β j = [1 − ¯ β j,i , − ¯ β j, 2 , . . . , − ¯ β j,K ] = (1 − ¯ β j,i )[1 , − c 2 , . . . , − c K ] := (1 − ¯ β j,i ) e ⊤ and P K l =2 c l = 1 . Therefo re, deﬁning Y := ¯ R ¯ β ⊤ , we g et k E i − E j k 2 = (1 − ¯ β j,i ) k Y 1 − K X l =2 c l Y l k 2 Noting that Y is at least γ s -simplicial, we have k E i − E j k 2 ≥ (1 − b ) γ s where b = max j ∈C 0 ,k ¯ β j,k < 1 . Similarly , n ote that k e ⊤ ¯ R k ≥ γ and let ¯ R = U Σ U ⊤ be its singula r value d ecompo sition. If λ max is the m aximum eigenv alue o f ¯ R , then we have E i,i − 2 E i,j + E j,j = (1 − ¯ β j, 1 ) 2 ( e ⊤ ¯ R ) U Σ − 1 U ⊤ ( e ⊤ ¯ R ) ⊤ ≥ (1 − b ) 2 γ 2 s /λ max . The inequ ality in the last step f ollows from the ob servation that e ⊤ R ′ is within th e co lumn spa ce spanned by U . The results in Pro position 4 pr ovide two statistics fo r identifyin g novel words of th e same topic, k E i − E j k and E i,i − 2 E i,j + E j,j . While the ﬁrst is straig htforward, the latter is efﬁcient to ca lculate in practice with be tter compu tational complexity . Speciﬁcally , its emp irical version, the set J i in Algorithm 2 J i = { j : b E i,i − b E i,j − b E j,i + b E j,j ≥ d/ 2 } can b e used to discover the set o f novel words of th e same topics asymptotically . F ormally , Proposition 5. If k b E − E k ∞ ≤ (1 − b ) 2 γ 2 s / 8 λ max , then, 1) F or a novel wor d i ∈ C k , J i = C c k 2) F or a no n-novel wor d j ∈ C 0 , J i ⊃ C c k Now we start to show that Alg orithm 2 can detect all the novel words of the K distinct rankings consistently . As illustrated in Lem ma 8, we detect the novel word s by ranking orderin g the solid angles q i . W e denote the minimum solid angle of the K extreme points by q ∧ . Our pro of is to show that th e estimated solid angle in E q (5), ˆ p i = 1 P P X r =1 I {∀ j ∈ J i , b E j d r ≤ b E i d r } ( 17) conv erges to the ideal solid ang le q i = Pr {∀ j ∈ S ( i ) , ( E i − E j ) d ≥ 0 } (18) as M , P → ∞ . d 1 , . . . , d P are iid d irections drawn fr om a isotropic distribution. For a novel word i ∈ C k , k = 1 , . . . , K , let S ( i ) = C c k , and for a non- novel w ord i ∈ C 0 , let S ( i ) = C c 0 . T o show the conver gence of ˆ p i to p i , we consider an intermediate quantity , p i ( b E ) = Pr {∀ j ∈ J i , ( b E i − b E j ) d ≥ 0 } First, by Hoeffding ’ s lemma , we have th e following result. 19 Proposition 6. ∀ t ≥ 0 , ∀ i , Pr {| ˆ p i − p i ( b E ) | ≤ t } ≥ 2 exp( − 2 P t 2 ) (19) Next we show the conv ergence of p i ( b E ) to solid ang le q i : Proposition 7. Con sider the case when k b E − E k ∞ ≤ d 8 and ¯ R is γ s -simplicial. If i is a novel wor d, then, q i − p i ( b E ) ≤ W √ W π d 2 k b E − E k ∞ Similarly , if j is a no n-novel wor d, we have, p j ( b E ) − q i ≤ W √ W π d 2 k b E − E k ∞ wher e d 2 , (1 − b ) γ s , d = (1 − b ) 2 γ 2 s /λ max . Pr o of. First n ote that, by th e deﬁnition of J i and Prop osi- tion 4, if k b E − E k ∞ ≤ d 8 , then, for a novel word i ∈ C k , J i = S ( i ) . And for a non- novel word i ∈ C 0 , J i ⊇ S ( i ) . For conv enience, let A j = { d : ( b E i − b E j ) d ≥ 0 } A = \ j ∈J i A j B j = { d : ( E i − E j ) d ≥ 0 } B = \ j ∈S ( i ) B j For i being a novel word, we consider q i − p i ( b E ) = Pr { B } − Pr { A } ≤ Pr { B \ A c } Note th at J i = S ( i ) when k b E − E k ≤ d/ 8 , Pr { B \ A c } = Pr { B \ ( [ j ∈S ( i ) A c j ) } ≤ X j ∈S ( i ) Pr { ( \ l ∈S ( i ) B l ) \ A c j } ≤ X j ∈S ( i ) Pr { B j \ A c j } = X j ∈S ( i ) Pr { ( b E i − b E j ) d < 0 , and ( E i − E j ) d ≥ 0 } = X j ∈S ( i ) φ j 2 π where φ j is the an gle between e j = E i − E j and b e j = b E i − b E j for any isotropic distrib ution on d . Noting that φ ≤ tan( φ ) , Pr { B \ A c } ≤ X j ∈S ( i ) tan( φ j ) 2 π ≤ X j ∈S ( i ) 1 2 π k b e j − e j k 2 k e j k 2 ≤ W √ W π d 2 k b E − E k ∞ where the last inequality is o btained by th e relationship between the ℓ ∞ norm and the ℓ 2 norm, an d the fact that for j ∈ S ( i ) , k e j k 2 = k E i − E j k 2 ≥ d 2 , (1 − b ) γ s . Ther efore for a n ovel word i , we have, q i − p i ( b E ) ≤ W √ W π d 2 k b E − E k ∞ Similarly for a non- novel word i ∈ C 0 , J i ⊇ S ( i ) , p i ( b E ) − q i = Pr { A } − P r { B } = P r { A \ B c } ≤ X j ∈S ( i ) Pr { ( \ l ∈ b S ( i ) A l ) \ B c j } ≤ X j ∈S ( i ) Pr { A j \ B c j } ≤ W √ W π d 2 k b E − E k ∞ A direct imp lication of Proposition 7 is, Proposition 8. ∀ ǫ > 0 , let ρ = min { d 8 , π d 2 ǫ W 1 . 5 } . If k b E − E k ∞ ≤ ρ , then , q i − p i ( b E ) ≤ ǫ for a novel wor d i a nd p j ( b E ) − q j ≤ ǫ for a non -novel word j . W e now pr ove Theorem 3. I n order to corr ectly de tect a ll the novel words of K distinct to pics, we decompo se th e error ev ent to be the union of the following two types, 1) Sorting err or , i.e ., ∃ i ∈ S K k =1 C k , ∃ j ∈ C 0 such that ˆ p i < ˆ p j . Th is e vent is denoted as A i,j and let A = S A i,j . 2) Clustering err o r , i. e., ∃ k , ∃ i, j ∈ C k such that i / ∈ J j . This event is denoted as B i,j and let B = S B i,j W e poin t out that the ev ent A, B are different from th e notations we u sed in Pro position 7. Accor ding to Pro posi- tion 8, we also deﬁne ρ = min { d 8 , π d 2 q ∧ 4 W 1 . 5 } and the ev ent that C = {k E − b E k ∞ ≥ ρ } . W e note that B ( C . Therefo re, P e = Pr { A [ B } ≤ Pr { A \ C c } + Pr { C } ≤ X i no vel,j non − nov el Pr { A i,j \ B c } + Pr { C } ≤ X i,j Pr( ˆ p i − ˆ p j < 0 \ k b E − E k ∞ ≥ ρ ) + Pr( k b E − E k ∞ > ρ ) The second term can b e bou nd by Lemm a 5. Now we fo cus on the ﬁrst term. Note th at ˆ p i − ˆ p j = ˆ p i − ˆ p j − p i ( b E ) + p i ( b E ) − q i + q i − p j ( b E ) + p j ( b E ) − q j + q j = { ˆ p i − p i ( b E ) } + { p i ( b E ) − q i } + { p j ( b E ) − ˆ p j } + { q j − p j ( b E ) } + q i − q j and the fact that q i − q j ≥ q ∧ , then,, Pr( ˆ p i < ˆ p j \ k b E − E k ∞ ≤ ρ ) ≤ Pr( p i ( b E ) − ˆ p i ≥ q ∧ / 4) + Pr( ˆ p j − p j ( b E ) ≥ q ∧ / 4) + Pr( q i − p i ( b E ) ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) + Pr( p j ( b E ) − q j ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) ≤ 2 ex p( − P q 2 ∧ / 8) + Pr( q i − p i ( b E ) ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) + Pr( p j ( b E ) − q j ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) The last equ ality is by Proposition 6. For the last two terms, by Proposition 8 is 0. Therefore, ap plying Lemma 5 we obtain, P e ≤ 2 W 2 exp( − P q 2 ∧ / 8) + 8 W 2 exp( − ρ 2 η 4 M N / 20) And this con cludes Th eorem 3. 20 L. Pr oof of Th eor em 4 W ithout loss of g enerality , let 1 , . . . , K be the n ovel word s of top ic 1 to K . W e ﬁr st con sider the solution o f the co n- strained linear regression. T o simp lify the n otation, we denote E i = [ E i, 1 , . . . , E i,K ] are the ﬁrst K entries of a row vector without the super-scripts as in Algorith m 3. Proposition 9. Let ¯ R be γ a -afﬁne-ind ependen t. The solution to the followin g o ptimization p r o blem b b ∗ = arg min b j ≥ 0 , P b j =1 k b E i − K X j =1 b j b E j k conver ges to the i -th r ow of ¯ β , ¯ β i , as M → ∞ . Mor eover , Pr( k b b ∗ − ¯ β i k ∞ ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 320 K ) wher e η is deﬁne the same as in Lemm a 5 . Pr o of. W e note that ¯ β i is the op timal solution to the following problem with ideal word co-occurren ce statistics b ∗ = arg min b j ≥ 0 , P b j =1 k E i − K X j =1 b j E j k Deﬁne f ( E , b ) = k E i − P K j =1 b j E j k and note the fact tha t f ( E , b ∗ ) = 0 . Let Y = [ E ⊤ 1 , . . . , E ⊤ K ] ⊤ . Then, f ( E , b ) − f ( E , b ∗ ) = k E i − K X j =1 b j E j k − 0 = k K X j =1 ( b j − b ∗ j ) E j k = q ( b − b ∗ ) YY ⊤ ( b − b ∗ ) ⊤ ≥k b − b ∗ k γ a The last equality is true b y the deﬁnition of af ﬁne- indepen dence. Next, no te tha t, | f ( E , b ) − f ( b E , b ) | ≤ k E i − b E i + X b j ( b E j − E j ) k ≤k E i − b E i k + X b j k b E j − E j k ≤ 2 max w k b E w − E w k Combining the above ineq ualities, we ob tain, k b b ∗ − b ∗ k ≤ 1 γ a { f ( E , b b ∗ ) − f ( E , b ∗ ) } = 1 γ a { f ( E , b b ∗ ) − f ( b E , b b ∗ ) + f ( b E , b b ∗ ) − f ( b E , b ∗ ) + f ( b E , b ∗ ) − f ( E , b ∗ ) } ≤ 1 γ a { f ( E , b b ∗ ) − f ( b E , b b ∗ ) + f ( b E , b ∗ ) − f ( E , b ∗ ) } ≤ 4 K 0 . 5 γ a k b E − E k ∞ where the last term converges to 0 almo st sur ely . The conver - gence rate fo llows d irectly from Lemm a 5. W e next consider the row r enormaliza tion. Let ˆ b ∗ ( i ) be the optimal solu tion in Pro position 9 for the i -th word, an d consider b B i := ˆ b ∗ ( i ) ⊤ ( 1 M X1 M × 1 ) → β i diag( a ) (20) T o show the co n vergence rate of the above equatio n, it is straightfor ward to app ly the result in Lem ma 5 Proposition 10. F o r the r ow-scaled estimation ˆ B i as in Eq. (20) , we have, Pr( | ˆ B i,k − β i,k a k | ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 1280 K ) Pr o of. By Propo sition 9, we ha ve, Pr( | b b ∗ ( i ) k − ¯ β i,k | ≥ ǫ/ 2) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 1280 K ) Recall that in Lemma 5 by McDiarm id’ s in equality , we ha ve Pr( | 1 M X1 M × 1 − B i a | ≥ ǫ / 2) ≤ exp( − ǫ 2 M N / 2) Therefo re, Pr( | ˆ B i,k − β i,k a k | ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 1280 K ) + exp( − ǫ 2 M N / 2) where the second term is do minated b y the ﬁrst term. Finally , we c onsider the column nor malization step to remove th e ef fect of diag( a ) : b β i,k := b B i,k / W X w =1 b B w, k (21) And P W w =1 b B w, k → a k for k = 1 , . . . , K . A worst case analysis o n its co n vergence is, Pr( | W X w =1 b B w, k − a k | > ǫ ) ≤ W Pr( | ˆ B i,k − β i,k a k | ≥ ǫ/W ) ≤ 8 W 3 exp( − ǫ 2 M N γ 2 a η 4 1280 W 2 K ) Combining all the result ab ove, we can s how ∀ i = 1 , . . . , W, ∀ k = 1 , . . . , K , Pr( | b β i,k − β i,k | > ǫ ) ≤ 8 W 4 K exp( − ǫ 2 M N γ 2 a a 2 min η 4 2560 W 2 K ) where a min > 0 is the minim um value of entries of a . This conclud es the result o f Th eorem 4. M. Pr oof of Lemma 9 Pr o of. W e ﬁr st show th at irred ucibility im plies separ ability , o r equiv alently , if the collection is n ot separable, th en it is n ot irreducib le. Suppose that { ν 1 , . . . , ν K } is not separ able. Th en there exists some k ∈ [ K ] and a δ > 0 such that, inf A : ν k ( A ) > 0 max j : j 6 = k ν j ( A ) ν k ( A ) = δ > 0 . Then ∀ A ∈ F : ν k ( A ) > 0 , ma x j : j 6 = k ν j ( A ) ν k ( A ) ≥ δ . Th is implies that ∀ A ∈ F : ν k ( A ) > 0 , X j : j 6 = k ν j ( A ) − δ ν k ( A ) ≥ 0 . 21 On the oth er hand, ∀ A ∈ F : ν k ( A ) = 0 , we hav e X j : j 6 = k ν j ( A ) − δ ν k ( A ) = X j : j 6 = k ν j ( A ) ≥ 0 . Thus the linear combination P j 6 = k ν j − δ ν k with o ne strictly negativ e coefﬁcient − δ is nonnegativ e over all measura ble A . This implies that the co llection of measures { ν 1 , . . . , ν K } is not irreducible. W e next show that separability implies irreducibility . If the collection of measu res { ν 1 , . . . , ν K } is separab le, then by th e deﬁnition o f separa bility , ∀ k , ∃ A ( k ) n ∈ F , n = 1 , 2 , . . . , such that ν k ( A ( k ) n ) > 0 and ∀ j 6 = k , ν j ( A ( k ) n ) ν k ( A ( k ) n ) → 0 as n → ∞ . Now consider a ny linear c ombinatio n of me asures P K i =1 c i ν i which is n onnegative over all measurable sets, i.e., for all A ∈ F , P K i =1 c i ν i ( A ) ≥ 0 . Then ∀ k = 1 , . . . , K and all n ≥ 1 we have, K X i =1 c i ν i ( A ( k ) n ) ≥ 0 ⇒ ν k ( A ( k ) n )   c k + X j 6 = k c j ν j ( A ( k ) n ) ν k ( A ( k ) n )   ≥ 0 ⇒ c k ≥ − X j 6 = k c j ν j ( A ( k ) n ) ν k ( A ( k ) n ) → 0 as n → ∞ . Therefo re, c k ≥ 0 fo r all k and the collection of measur es is irreducib le. R E F E R E N C E S [1] D. Blei, “Proba bilistic topic models, ” Commun. of the A CM , v ol. 55, no. 4, pp. 77–84, 2012. [2] W . Ding, M. Rohban, P . Ishwar , and V . Saligrama, “ A new geometric approac h to latent topic modeling and discov ery , ” in IEEE Internati onal Confer ence on Acoustics, Speech and Signal P r ocessing , May 2013. [3] W . Ding, M. H. Rohban, P . Ishwar , and V . Saligrama , “T opic discov ery through data dependent and random proj ections, ” in P r oc. of the 30th Internati onal Confere nce on Machine Learning , Atlanta , GA, USA, Jun. 2013. [4] W . Ding, P . Ishwar , M. H. Rohban, and V . Saligrama, “Necessary and Suf ﬁcient Condi tions for Novel W ord Detecti on in S eparab le T opic Models, ” in Advances in on Neural Information Pr ocessing Systems (NIPS), W orkshop on T opic Models: Computatio n, A pplicat ion , L ake T ahoe, NV , USA, Dec. 2013. [5] W . Ding, M. H. Rohban, P . Ishwar , and V . Saligra ma, “Efﬁcient Distrib uted T opic Modeling with Prov able Guarantees, ” in Pro c. ot the 17th International Confer ence on Artiﬁc ial Intel lige nce and Statistics , Reyk javik , Iceland , Apr . 2014. [6] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation, ” Jo urnal of Mach ine Learning Resear ch , vol. 3, pp. 993–1022, Mar . 2003. [7] S. Arora, R. Ge, Y . Halpern, D. Mimno, A. Moitra, D. Sontag, Y . Wu, and M. Zhu, “ A practica l algorithm for topic modeling with prov able guarant ees, ” in P r oc. of the 30th Internationa l Confer ence on Machine Learning , Atlanta, GA, USA, Jun. 2013. [8] M. Lichman, “UCI machine learning repository , ” 2013. [Online]. A vai lable: http://arc hi ve.ics.uci .edu/ml [9] S. Arora, R. Ge, and A. Moitra, “Learning topic models – going bey ond SVD, ” in Proc. of the IE EE 53rd Annual Symposium on F oundations of Computer Science , New Brunswick, NJ, USA, Oct. 2012. [10] D. Sontag and D. Roy , “Complexit y of inference in latent dirichle t alloc ation, ” in NIPS , 2011, pp. 1008–1016. [11] T . Grifﬁths and M. Stey vers, “Finding scientiﬁc topics, ” Proce edings of the National academy of Sciences , vol. 101, pp. 5228–5235, 2004. [12] M. W ainwright and M. Jorda n, “Graphica l models, ex ponentia l fami- lies, and va riationa l infere nce, ” F oundations and T r ends R  in Machine Learning , vol. 1, no. 1-2, pp. 1–305, 2008. [13] E. Airoldi, D. Blei, E. Eroshev a, and S. Fienber g, Handbook of Mixed Member ship Models and Their Applicat ions . Chapman and Hall/CRC , 2014. [14] A. Anandkumar , R. Ge, D. Hs u, and S. M. Kakade, “ A tensor approach to learning mixed membership community models, ” J . Mach. Learn. Res. , vol. 15, no. 1, pp. 2239–2312, 2014. [15] A. Kumar , V . Sindhwa ni, and P . Kambadur , “Fast conica l hull algorith ms for near-se parable non-negati ve matrix factoriz ation, ” in the 30th Int. Conf . on Mach ine Learning , Atlanta, GA, Jun. 2013. [16] W . Ding, P . Ishwar , and V . Saligrama, “Most large T opic Models are approximat ely separable, ” in IT A , 2015. [17] T . Hofmann, “Probabilist ic latent semantic inde xing, ” in Proce edings of the 22nd annual international ACM SIGIR confer ence on R esear ch and dev elopment in information retrie val , 1999, pp. 50–57. [18] D. Blei and J. L af ferty , “ A correlated topic model of science, ” T he Ann. of Applied Statisti cs , vol. 1, no. 1, pp. 17–35, 2007. [19] H. W alla ch, I. Murray , R. Salakhutd inov , and D. Mimno, “Eva luation methods for topic m odels, ” in Proc. of the 26th International Confer ence on Machine Learning , Montreal, Canada, Jun. 2009. [20] M. Hoffman, F . R. Bach, and D. M. Blei, “Onlin e learni ng for latent dirichl et allocati on, ” in advances in neural informatio n proc essing systems , 2010, pp. 856–864. [21] D. L ee and H. Seung, “Learn ing the parts of objects by non-ne gati ve matrix facto rizatio n, ” Natur e , vol. 401, no. 6755, pp. 788–791, Oct. 1999. [22] A. Cichocki , R. Zdunek, A. H. Phan, and S. Amari, Nonne gative matrix and tensor factorizatio ns: application s to explo ratory multi-way data analysis and blind source s eparat ion . Wi ley , 2009. [23] B. Recht, C. Re, J. Tropp, and V . Bittorf, “Factorin g nonnegati ve matrice s with linear programs, ” in Advance s in Neur al Information Pr ocessing Systems 25 , Lake T ahoe, NV , Dec. 2012, pp. 1223–1231. [24] S. V av asis, “On the comple xity of nonne gati ve matrix fac torizat ion, ” SIAM J. on O ptimizat ion , vol. 20, no. 3, pp. 1364–1377, Oct. 2009. [25] A. Anandkumar , D. Hsu, A. Jav anmard, and S. Kakade, “Learning linea r bayesian network s with laten t v ariable s, ” in the 30th Int. Conf . on Mach ine Learning , Atlanta, GA, Jun. 2013. [26] P . A wasthi and A. Risteski, “On s ome prov ably correct cases of v ariatio nal infere nce for topic mode ls, ” arXiv:1503.06567 [cs.LG] , 2015. [27] T . Bansal, C. Bhattacha ryya, and R. Kannan, “A prov able SVD-based al- gorithm for learning topics in dominant admixture corpus, ” in Advances in Neural Information Pr ocessing Systems , 2014, pp. 1997–2005. [28] J. Boardman, “ Automatin g spectral unmixing of avi ris data using con ve x geometry concept s, ” in P r oc. Ann. JPL A irborne Geoscienc e W orkshop , 1993, p. 1114. [29] D. Donoho and V . Stodde n, “When does non-nega ti ve matrix fac tor- izati on give a correct decompositi on into parts?” in Advances in Neural Informatio n Proc essing Systems 16 . Cambridge , MA: MIT press, 2004, pp. 1141–1148. [30] N. Gillis and S . A. V av asis, “Fast and robust recursi ve al gorithms for separabl e nonne gati ve matrix factoriz ation, ” IEEE Tr ans. on P attern Analysis and Mach ine Intellig ence , vol. 36, no. 4, pp. 698–714, 2014. [31] V . Faria s, S. Jagabathula, and D. Shah, “ A data-dri ven approach to mod- eling choice, ” in Advanc es in Ne ural Information Pro cessing Systems , V ancouver , Canada , Dec. 2009. [32] W . Ding, P . Ishwar , and V . Saligrama, “A T opic Modeling approac h to Ranking , ” in Proc . ot the 18th International Confer ence on A rtiﬁcia l Intell igenc e and Statistic s , San Diago, CA, May 2015. [33] A. McCallum, “Mall et: A machine learning for langua ge toolki t, ” 2002, http:/ /mallet.c s.umass.edu. [34] D. Lewis, Y . Y ang, T . Rose, and F . Li, “Rcv1: A new benchmark colle ction for text cate gorizat ion research, ” J. Mach. Learn. Res. , vol. 5, pp. 361–397, Dec. 2004. [35] L. Y ao, D. Mimno, and A. McCallum, “Efﬁci ent methods for topic model inference on streaming document coll ections, ” in Pr oceedin gs of the 15th ACM SIGKDD internationa l confer ence on Knowledg e disco very and data mining . A CM, 2009, pp. 937–946. [36] D. Ne wman, A. Asuncion, P . Smyth, and M. W ellin g, “Distribute d algo- rithms for topic models, ” The Journal of Machi ne Learni ng Researc h , vol. 10, pp. 1801–1828, 2009. [37] A. Asuncion, P . Smyth, and M. W elling, “ Asynchronous distrib uted learni ng of topic m odels, ” in Advances in Neural Informatio n Pr ocessing Systems 21 , D. Kol ler , D . Schuurmans, Y . Bengio, and L. Bottou, Eds., 2009, pp. 81–88. [38] G. Blancha rd and C. Scott, “Deconta mination of mutually contamin ated models, ” in Pr oceedin gs of the 17th International Confere nce on Artiﬁ- cial Intellig ence and Statistics , 2014, pp. 1–9. 22 [39] C. Scott, “ A rate of con ver gence for m ixture proportion estimation, with applic ation to learning from noisy labels, ” in Pr oceedi ngs of the 18th Internati onal Conferen ce on Artiﬁcial Intell igenc e and Statisti cs , 2015, pp. 838–846. [40] W . Ding, P . Ishwar , and V . Salig rama, “Learning mixed membership mallo ws model from pairwise comparisons, ” in arXiv: 1504.00757 [cs.LG] , 2015.

Necessary and Sufficient Conditions and a Provably Efficient Algorithm for Separable Topic Discovery

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment