Necessary and Sufficient Conditions and a Provably Efficient Algorithm for Separable Topic Discovery

We develop necessary and sufficient conditions and a novel provably consistent and efficient algorithm for discovering topics (latent factors) from observations (documents) that are realized from a probabilistic mixture of shared latent factors that …

Authors: Weicong Ding, Prakash Ishwar, Venkatesh Saligrama

Necessary and Sufficient Conditions and a Provably Efficient Algorithm   for Separable Topic Discovery
1 Necessary and Suf ficient Conditions and a Pro v ably Ef ficient Algorithm for Separable T opic Discov ery W eicong Ding, ∗ Prakash Ish war , IEEE Senior Member an d V enkates h Saligrama, IEEE Senior Member Abstract —W e develop necessa ry and sufficient conditions and a nov el prov ably consistent and efficient algorithm for discov ering topics (latent factors) from observ ations (documents) that are realized from a probabilistic mixture of shared latent factors that ha ve certain properties. Our fo cus is on the class of topic models in which each shared latent factor contains a novel wo rd that is unique to that factor , a p roperty that has come to be known as separabilit y . Our algorithm i s based on the key insight that the nove l words correspond t o the extre me points of the con vex hull formed by the row-vec tors of a suitably normalized word co-occurrence matrix. W e le verage this geometric insight to establish polynomial compu tation and sample complexity bound s based on a few isotropic random projections of the ro ws of the normalized word co-oc currence matrix. Our proposed random- projections-based algorithm is n aturally amenable to an efficient distributed imp lementation an d is attractiv e for modern web - scale d istributed data mining applications. Index T erms —T opic Modeli ng, Sep arability , Random P rojec- tion, Solid Angle, Necessary an d S ufficient Conditi ons. I . I N T R O D U C T I O N T OPIC m odeling r efers to a family of g enerative mo dels and associated algorithm s for discovering the (latent) topical stru cture sha red by a large corpu s o f do cuments. They are important for organizing, search ing, and making sense of a large text c orpus [1]. In this paper we d escribe a novel geometric app roach, with provable statistical and computa- tional efficiency guaran tees, for learn ing the latent topics in a do cument collection. Th is work is a culminatio n of a series of recent publication s o n certain stru cture-leveraging metho ds for topic modeling with pr ovable theoretical guaran tees [ 2]– [5]. W e con sider a co rpus of M d ocumen ts, in dexed b y m = 1 , . . . , M , each com posed of words f rom a fixed vocabulary of size W . The distinct words in th e vocab ulary are indexed by w = 1 , . . . , W . Each document m is viewed as an un ordered “bag of words” and is repr esented b y an empir ical W × 1 word-coun ts vector X m , wh ere X w, m is the n umber o f times that word w app ears in docu ment m [1] , [5] –[7]. The entire docume nt corpus is then rep resented by the W × M matr ix X =  X 1 , . . . , X M  . 1 A “to pic” is a W × 1 d istribution over the vocabulary . A top ic mod el posits th e existence o f K < min( W , M ) latent topics that ar e shar ed amo ng all M docum ents in the corpus. The to pics can be collectively represented by th e K columns β 1 , . . . , β K of a W × K column- stochastic “topic matrix” β . E ach docum ent m is conceptu ally mo deled as bein g genera ted indepen dently of a ll 1 When it is c lear from the conte xt, we will use X w,m to represen t either the empirica l word-count or , by suitabl e column-normaliz ation of X , the e mpirical word-fre quenc y . other do cuments throu gh a two-step pro cess: 1) first draw a K × 1 docum ent-specific distribution over topics θ m from a prior distribution Pr( α ) on the pr obability simplex with some hyper-parameter s α ; 2) then draw N iid words acco rding to a W × 1 docu ment-specific word d istribution over the vocab ulary given by A m = P K k =1 β k θ k,m which is a conve x combinatio n (probab ilistic mixture) o f the latent topics. Our goal is to estimate β f rom the matrix o f empirical observations X . T o appreciate the d ifficulty of the prob lem, consider a typical benchma rk dataset such as a news article collectio n from the New Y o rk T imes (NYT) [8] that we use in our experiments. I n this dataset, afte r suitable pr e-proce ssing, W = 14 , 943 , M = 300 , 000 , a nd, on a verage, N = 298 . Thus, N ≪ W ≪ M , X is very sparse, an d M is very large. T y pically , K ≈ 1 00 ≪ min( W, M ) . This estimation p roblem in top ic mod eling has been ex- tensiv ely stud ied. T he prev ailing appr oach is to com pute th e MAP/ML estimate [1]. The tru e posterior of β g i ven X , howe ver , is intractable to compute an d the associated MAP and ML estimation pr oblems are in fact NP-hard in the gen eral case [9], [10]. This necessitates the use of sub-o ptimal metho ds based on app roximatio ns and heuristics such as V ariational- Bayes and MCMC [6], [1 1]–[1 3]. While they p roduce impr es- si ve empiric al results o n many real-world datasets, guarante es of asympto tic consistency or efficiency for these ap proache s are either weak or non-existent. T his makes it d ifficult to ev aluate mod el fidelity : failure to pro duce satisfactory results in new datasets could be due to the u se of approx imations and heuristics or due to mode l mis-specification which is mor e fundam ental. Furthermo re, these sub- optimal app roaches are computatio nally intensive fo r large text corp ora [5], [7]. T o overcome the hardne ss o f the topic estimatio n prob lem in its full ge nerality , a new ap proach has emerged to lear n the topic mod el b y imposing add itional structure on the m odel parameters [3 ], [5], [7], [ 9], [14 ], [15]. T his p aper fo cuses on a key stru ctural pr operty o f the topic matrix β called topic separa bility [3] , [5 ], [7] , [1 5] wher ein every latent topic contains at least one word that is novel to it, i.e ., the word is unique to th at topic and is absent from the other top ics. Th is is, in essence, a property o f the support of th e laten t topic matrix β . The topic separ ability property can b e motiv ated by th e fact that for many rea l-world datasets, the em pirical topic estimates pr oduced b y p opular V ariatio nal-Bayes an d Gibbs Samp ling app roaches are ap prox imately separable [5], [7]. Moreover , it h as recently been shown th at the separa bility proper ty will be approximately satisfied with h igh prob ability when the dimen sion of the vocabulary W scales sufficiently faster than th e n umber o f topics K and β is a r ealization 2 of a Dirichlet p rior tha t is typically used in pr actice [16 ]. Therefo re, separability is a natural appro ximation for most high-d imensional topic models. Our approach exploits the following ge ometric implication of the key sep arability structu re. If we associate eac h word in the vocabulary with a r ow-vector o f a suitably norm alized empirical word c o-occur rence matrix, the set of nov el words correspond to t he extreme points of the conv ex hu ll form ed by the row-vectors of all words. W e leverage this geo metric insight and develop a provably consistent and efficient alg o- rithm. Informally spea king, we establish the following r esult: Theorem 1. I f th e topic matrix is separable and th e mixing weights satisfy a min imum informatio n-theoretically necessary technical con dition, th en o ur pr oposed algorithm runs in polynomial time in M , W, N , K , and estimates the topic matrix co nsistently as M → ∞ with N ≥ 2 held fixed. Mor eover , our pr oposed algorithm ca n estimate β to within an ǫ eleme nt-wise err or with a pr obability at least 1 − δ if M ≥ P oly ( W, 1 / N , K , log(1 /δ ) , 1 /ǫ ) . The asymptotic setting M → ∞ with N held fixed is motiv ated by text corpora in wh ich the n umber o f words in a single d ocument is small while the numbe r of docum ents is large. W e n ote th at our algorithm can b e app lied to any family of topic mo dels whose topic mixing weig hts prior Pr( α ) satis- fies a minimu m infor mation-the oretically n ecessary technic al condition . In contrast, th e standard Bayesian a pproac hes such as V ar iational-Bayes or M CMC need to b e hand-d esigned separately for e ach specific topic mixing weights prio r . The high light of our ap proach is to identify the novel words as extreme points throu gh approp riately defin ed random pro- jections . Sp ecifically , we p roject th e row-vector o f each word in an ap propr iately norm alized word co-o ccurren ce m atrix along a few indep enden t and isotropically distributed random directions. The f raction of times th at a word attains the maxi- mum value along a ran dom direction is a m easure of its degree of robustness as an extreme point. This p rocess of ran dom projection s followed by countin g the number o f times a word is a maximizer can be efficiently computed and is robust to the pe rturbation s indu ced by sampling no ise associated with having only a very small number of words per docu ment N . In addition to be ing computationally efficient, it turns out that this random pro jections based appro ach (1) requires the minimum informa tion-theo retically n ecessary technical conditions on the to pic prio r for asymptotic consistency , and (2) can be naturally par allelized and distributed. As a con sequence, it can p rovably achie ve the efficiency guar antees of a centralized method while requirin g insign ificant communicatio n between distributed do cument collec tions [5]. This is attracti ve fo r web - scale topic mo deling o f la rge distributed text corpor a. Another ad vance of this paper is the id entification of nec- essary a nd sufficient cond itions on the mixin g weights for consistent separ able topic estimation. In previous work we showed that a simplicia l condition on the mixing weigh ts is both necessary and sufficient for consistently detecting all the novel words [4]. In this paper we complete the ch aracterization by showing that an affine independ ence con dition o n the mixing weights is necessary a nd sufficient for consistently estimating a sep arable top ic matrix. These conditio ns are satisfied by practical choices of topic pr iors suc h as the Dirichlet distribution [6 ]. All these nece ssary condition s are informa tion-theo retic and a lgorithm- indepen dent, i.e., they are irrespective of the specific statistics o f th e o bservations o r the algor ithms that are used. Th e provable statistical and computatio nal ef ficiency guarantees of our pr oposed algorithm hold tr ue under th ese ne cessary a nd suf ficient conditions. The rest o f this paper is organized a s follows. W e revie w related work on topic modelin g as well as the separability proper ty in various do mains in Sec. II. W e introdu ce the sepa - rability prop erty on β , the simp licial an d affine in depend ence condition s on mixing weigh ts, and the extreme p oint geometry that motiv ates our approac h in Sec. III. W e then discuss how the solid angle can b e used to iden tify robust extreme poin ts to deal with a finite number of samples (word s p er docum ent) in Sec. IV. W e describ e our overall algo rithm and sketch its ana lysis in Sec. V. W e d emonstrate th e perf ormance of our approach in Sec. VI on various syn thetic and rea l-world examples. Proo fs o f all results appear in the appendice s. I I . R E L AT E D W O R K The idea of modeling text documents as mixture s of a few semantic topics was first pro posed in [17 ] wher e the m ixing weights were assum ed to be deter ministic. Latent Dirichlet Allocation (LD A) in the seminal work o f [6] extended this to a pro babilistic setting by mo deling top ic mixing weights using Dirichlet priors. This setting has been furth er extended to inclu de other topic prio rs such a s the log-no rmal prior in the Correlated T opic Model [1 8]. L D A mode ls and their deriv ativ es have been su ccessful on a wide range of prob lems in ter ms of ach ieving good empirical p erforma nce [1 ], [13]. The prev ailing ap proach es for estimation an d inference problem s in topic mode ling a re based on MAP or M L estima- tion [1]. Howev er , the com putation of posterio r distributions condition ed on observations X is intractable [6]. Moreover , th e MAP estimation o bjective is no n-conve x an d has b een shown to b e N P -hard [ 9], [1 0]. Theref ore various approximatio n and heuristic strategies have been employed. These ap proach es fall into two major categories – sampling app roaches and optimization app roaches. Most sampling ap proach es are b ased on Markov Chain Monte Carlo (MCMC) algo rithms th at seek to generate (appro ximately) indep endent samp les from a Markov Chain that is carefully d esigned to ensure that the sample distribution co n verges to th e true posterior [1 1], [19]. Op timization app roaches are typically b ased on the so- called V ariational-Bayes methods. These methods optimize the parameters of a simpler parametric distribution so that it is close to the tru e po sterior in terms of KL d iv ergence [6], [12] . Expectation -Maximizatio n-type algorithms ar e ty pically used in these methods. In practice, while b oth V ariational- Bayes and MCMC algorith ms have similar perf ormance , V ariational- Bayes is typica lly faster than MCMC [1 ], [ 20]. Nonnegative Matrix Factorization (NMF) is an alternative approa ch fo r topic estimation . NMF-based me thods exploit the fact that bo th the topic matrix β and the mix ing weights are non negativ e and attem pt to decomp ose the emp irical 3 observation matrix X into a pr oduct of a n onnegative topic matrix β and the matrix o f mixing w eights by minimizing a cost function o f th e f orm [20]–[2 3] M X m =1 d ( X m , β θ m ) + λψ ( β , θ 1 , . . . , θ M ) , where d ( , ) is some measure of closeness and ψ is a regular iza- tion term wh ich enfo rces desirable pr operties, e.g., sparsity , on β and the mix ing weights. The NMF pr oblem, howe ver, is also known to b e non-c on vex an d N P -hard [24] in gener al. Su b- optimal strategies such a s alternatin g minimiz ation, gr eedy gradient descent, an d h euristics a re used in practice [22]. In co ntrast to the ab ove app roaches, a new app roach has recently emerged w hich is b ased on imposing add itional structure on th e mod el par ameters [3], [ 5], [7] , [9] , [14], [15]. Th ese appr oaches show that the topic discovery pro b- lem lends itself to p rovably consistent and polyno mial-time solutions b y mak ing assumption s abou t th e structur e of the topic matrix β and the distribution of the mixing weights. In this category of app roache s are metho ds based on a tenso r decomp osition of the mom ents o f X [ 14], [2 5]. The algor ithm in [25] uses second order empirical mo ments and is shown to be asymptotically consistent wh en the topic matrix β has a special sparsity structu re. The algorithm in [14] uses the third ord er tensor of observations. It is, however , strongly tied to the specific structure o f the Dir ichlet pr ior on the mixing weights and req uires knowledge of the concentratio n parameters o f the Dirichlet distribution [14] . Furtherm ore, in practice these approache s are computatio nally intensive an d require some in itial coarse dimensionality redu ction, g radient descent speedu ps, and GPU ac celeration to process large-scale text corp ora like the NY T dataset [ 14]. Our work falls into th e family of ap proache s that exploit the separ ability pro perty of β and its g eometric imp lications [3], [5], [7 ], [9 ], [1 5], [26], [27 ]. An asymptotically consistent polyno mial-time topic estimation algorithm was fir st pr oposed in [9]. Howe ver , this method requires solv ing W linear pro - grams, e ach with W variables a nd is comp utationally imprac- tical. Su bsequen t work imp roved the computational efficiency [15], [23], b ut theo retical guarantees of asymptotic consistency (when N fixed, and the n umber o f d ocumen ts M → ∞ ) are un clear . Algorithms in [7] and [3 ] are both practical and provably co nsistent. Each requir es a stron ger an d slightly different technical co ndition on the topic m ixing weigh ts than [9]. Specifically , [7] im poses a full-rank con dition on th e second-o rder correlation ma trix of the mix ing weights and propo ses a Gr am-Schmid t p rocedu re to id entify the extreme points. Similarly , [3] imposes a diago nal-do minance con dition on th e same second- order correlation matr ix an d prop oses a random projection s based ap proach . These appr oaches are tied to the specific condition s imp osed and they both fail to detect all the novel words and estimate to pics when the imposed conditio ns (wh ich ar e su fficient but not necessary for consistent nov el word detection or topic estimatio n) f ail to hold in some e xample s [5]. The random projections based algorithm propo sed in [5] is bo th practical and provably consistent. Furthermo re, it req uires fewer con straints on the topic mixing weights. W e note that the separ ability prop erty has been exploited in other recent work as well [26], [27]. In [27], a singular value d ecomposition based ap proach is p roposed for topic estimation. In [ 26], it is shown that the stan dard V ariatio nal- Bayes ap prox imation can b e asy mptotically co nsistent if β is separable. Howe ver , the add itional co nstraints propo sed essentially boil down to the requir ement that each document contain pr edomin antly only on e top ic. In add ition to assuming the existence of such “pu re” docu ments, [26] also requires a strict initialization. It is thus un clear how th is can b e achieved using only the observations X . The separab ility pro perty h as been re-d iscovered a nd ex- ploited in th e literature ac ross a numb er of different fields and ha s fou nd applicatio n in several proble ms. T o the best of our knowledge, this concept was first introduced as the Pur e Pixel I ndex assumption in the Hy perspectral Ima ge u nmixin g problem [28]. This work assumes th e existence o f pixels in a hype r-spectral imag e contain ing p redomin antly one species. Separability h as a lso b een studied in the NMF literature in the context of ensuring the u niquen ess of NMF [29 ]. Sub sequent work h as led to the development of NMF algo rithms that exploit separ ability [23 ], [ 30]. Th e un iqueness and cor rectness results in this line of work has primar ily fo cused on the noiseless case. W e finally n ote that separability has also been r ecently exp loited in the p roblem o f learning multiple ranking preferen ces fro m p airwise comparison s for p ersonal recommen dation systems and in formation retr iev al [31], [32 ] and ha s led to provably con sistent and e fficient estimation algorithm s. I I I . T O P I C S E PA R A B I L I T Y , N E C E S S A RY A N D S U FFI C I E N T C O N D I T I O N S , A N D T H E G E O M E T R I C I N T U I T I O N S In this section , we u nravel the key ideas that moti vate o ur algorithm ic approac h by focusing on the ideal case where there is no “samp ling-no ise”, i.e. , each doc ument is infinitely long ( N = ∞ ). In the next section , we will turn to the finite N case. W e rec all that β and X den ote the W × K topic matrix and the W × M empirical word c ounts/freq uency m atrix respectively . Also, M , W , and K d enote, respectively , the n umber of docume nts, the vocabulary size, and the nu mber o f topics. For conv enience, we gro up the do cument- specific mixing weigh ts, the θ m ’ s, in to a K × M weight matrix θ =  θ 1 , . . . , θ M  and the doc ument-spec ific d istributions, the A m ’ s, into a W × M docume nt distribution matrix A =  A 1 , . . . , A M  . The g ener- ativ e procedu re that descr ibes a top ic m odel then im plies th at A = β θ . In the ideal case considered in this s ection ( N = ∞ ), the empirical word fr e quency matrix X = A . N otation: A vector a without specification will deno te a c olumn- vector , 1 the all-ones co lumn vector of suitable size, X i the i -th column vector and X j the j - th row vector of matr ix X , an d ¯ B a suitab ly row-normalized version ( described later) o f a nonnegative matrix B . Also, [ n ] := { 1 , . . . , n } . A. K ey Structu ral P r op erty: T opic Separability W e first introduce separab ility as a key stru ctural pro perty of a top ic ma trix β . Formally , 4 Definition 1. (Separability) A to pic matrix β ∈ R W × K is separable if for each to pic k , there is some wor d i such that β i,k > 0 and β i,l = 0 , ∀ l 6 = k . T o pic separab ility imp lies that each topic c ontains word(s) which app ear only in that to pic. W e ref er to these words as the novel words of the K to pics. Figure 1 shows an examp le Fig. 1. An exampl e of separab le topic matrix β (left ) and the underlying geometri c structure (right) of the row space of the normalized document distrib ution matrix ¯ A . Note: the word ordering is only for visualizati on and has no bearing on separabil ity . Solid circl es represent rows of ¯ A . Empty circle s repre sent ro ws of ¯ X when N is finite (in the ideal case, ¯ A = ¯ X ). Project ions of ¯ A w ’ s (resp. ¯ X w ’ s) along a random isotropic direction d can be used to identify nove l words. of a separable β with K = 3 topics. W o rds 1 and 2 are novel to topic 1 , words 3 and 4 to topic 2 , and word 5 to topic 3 . Other words that appear in multiple topics are called non- novel word s (e.g., word 6 ). I dentifyin g th e novel words f or K distinct topics is the key step of ou r proposed appr oach. W e note that separability has been empirically obser ved to be ap prox imately satisfied by topic estimates p roduc ed b y V ariational-Bayes and MCMC based algor ithms [5], [7], [26 ]. More f undam entally , in very recent work [16] , it has b een shown that topic separability is an inevitable consequence of having a relatively small numbe r of top ics in a very large vocab ulary (high -dimension ality). In particu lar , whe n the K columns (topics) of β are indepen dently sampled fro m a Dirichlet distribution ( on a ( W − 1) -dimension al pr obability simplex), the re sulting top ic matrix β will be ( approx imately) separable with pr obability ten ding to 1 as W scales to infinity sufficiently faster than K . A Dirichlet prior on β is widely- used in smo othed setting s of top ic mo deling [1]. As we will discu ss next in Sec. II I-C, the top ic separab ility proper ty c ombined with ad ditional conditions on the secon d- order statistics of the mix ing weights leads to an in tuitiv ely appealing g eometric prop erty th at c an be exploited to develop a p rovably co nsistent and efficient topic estimatio n algorithm. B. Conditio ns o n th e T opic Mixing W eights T o pic sep arability alon e doe s not gu arantee that there will be a uniq ue β that is consistent with all the obser vations X . T his is illustrated in Fig. 2 [4]. T herefor e, in an e ffort to d ev elop p rovably consistent topic estimation alg orithms, a num ber o f different condition s have been imposed o n the topic m ixing we ights θ in the literatu re [3 ], [5 ], [7 ], [9], [15]. Complemen ting the work in [4] which iden tifies necessary an d sufficient conditions for consistent detection of novel words, in this paper we identify necessary and suffi cient condition s fo r consistent estimation of a separable topic matr ix. Our necessity results are in formation-th eor etic and algorithm-in depend ent in nature, meaning that they are independen t of any sp ecific statistics of the observations and the alg orithms used. The novel words and th e topics can only be identified up to a permutatio n and this is accou nted for in our results. Let a := E ( θ m ) and R := E ( θ m θ m ⊤ ) be the K × 1 expectation vector and the K × K cor relation matrix o f the weight pr ior Pr( α ) . W ithout lo ss of g enerality , we can assume that the elemen ts of a are strictly po siti ve since oth erwise some to pic(s) will not appear in th e cor pus. A key quan tity is ¯ R := diag ( a ) − 1 R diag( a ) − 1 which m ay be viewed as a “normalize d” secon d-mom ent matrix of the weight vector . Th e following cond itions are central to ou r results. Condition 1. ( Simplicial Condition) A matrix B is (r ow- wise) γ s -simplicial if any r ow-vector of B is a t a Eu clidean distance of at least γ s > 0 fr o m the conve x hu ll o f the r emaining r o w-vectors. A topic model is γ s -simplicial if its normalized seco nd-mome nt ¯ R is γ s -simplicial. Condition 2. (Affine-Independence) A ma trix B is (r ow- wise) γ a -affine-ind ependen t if min λ k P K k =1 λ k B k k 2 / k λ k 2 ≥ γ a > 0 , where B k is the k -th r ow of B and the minimu m is over all λ ∈ R K such that λ 6 = 0 and P K k =1 λ k = 0 . A topic mo del is γ a -affine-ind ependen t if its n ormalized seco nd- moment ¯ R is γ a -affine-ind ependen t. Here, γ s and γ a are called the simplicial and affine- indepen dence con stants respectively . T hey a re con dition num - bers wh ich mea sure the degree to which th e condition s th at they are respectively associated with hold . Th e lar ger that these condition num bers ar e, the easier it is to estimate the topic matrix. Going forward, w e will say th at a matrix is simplicial (resp. affine indep endent) if it is γ s -simplicial (resp. γ a - affine-independe nt) fo r some γ s > 0 (resp. γ a > 0 ). Th e simplicial condition was first p roposed in [9 ] and th en fu rther in vestigated in [4]. Th is p aper is the first to ide ntify affine- indepen dence as both ne cessary and sufficient fo r consistent separable to pic estimation . Before we discu ss their geometric implications, we point o ut that affine-indepen dence is stron ger than the simplicial conditio n: Proposition 1. ¯ R is γ a -affine-ind ependen t ⇒ ¯ R is at lea st γ a -simplicial. The re verse imp lication is false in gener al. The Simplicial Condition is bot h Necessary and Sufficient for Novel W ord Det ection: W e first focu s o n detecting all the novel words of the K d istinct topics. For this task, the simplicial con dition is an algo rithm-ind epende nt, inf ormation - theoretic n ecessary condition. Form ally , Lemma 1 . (Simp licial Con dition is Necessary for Novel W or d Detection [4, Lemma 1 ]) Let β be separable an d W > K . If ther e exists an algorithm tha t ca n consistently ide ntify all novel words of all K topic s fr om X , then ¯ R is simplicial. The key insight b ehind this result is that when ¯ R is non - 5       1 0 0 0 1 0 0 0 1 0 0 1 . . .         ← θ 1 → ← θ 2 → ← 0 . 5 θ 1 + 0 . 5 θ 2 →   =       1 0 0 0 1 0 0 0 1 0 . 5 0 . 5 0 . . .         ← θ 1 → ← θ 2 → ← 0 . 5 θ 1 + 0 . 5 θ 2 →   β (1) θ β (2) θ Fig. 2. Example showing that topic separabilit y alone does not guarante e a unique solution to the problem of estimatin g β from X . Here, β 1 θ = β 2 θ = A is a document distributio n m atrix that is consistent with two differ ent topic matrices β (1) and β (2) that are both separable. simplicial, we can co nstruct two distinct sepa rable topic matr i- ces with different sets of novel words which induce the sam e distribution on the empir ical ob servations X . Geom etrically , the simplicial condition g uarantee s tha t the K rows of ¯ R will be extreme points of the conv ex h ull that they themselves f orm. Therefo re, if ¯ R is not simplicial, there will exist at least one redund ant topic which is just a convex combin ation of the other topics. It turns out that ¯ R being simplicial is also sufficient f or consistent novel word d etection. T his is a direct co nsequen ce of th e c onsistency gu arantees of our approach as outlined in Theorem 3. Affine-Independence is Necessary and Sufficient for Sep- arable T opic Estimation: W e n ow f ocus on estimating a separable topic matr ix β , which is a strong er requ ire- ment than d etecting novel words. It naturally r equires condi- tions that are stronger than the simplicial condition . Affine- indepen dence tu rns out to be an a lgorithm- indepen dent, informa tion-theo retic necessary condition. Formally , Lemma 2. (A ffine-Indep endence is Necessary for Separable T opic Estimation) Let β b e separable with W ≥ 2 + K . If ther e exis ts an algorithm that can consistently estimate β fr om X , then its n ormalized secon d-momen t ¯ R is affine- indepen dent. Similar to Lemma 1 , if ¯ R is no t affine-indepen dent, we can construct two distinct and separ able top ic matrices that induce the same distribution on the observation which makes consistent topic estimation impo ssible. Geometrically , every point in a conve x set can be decomp osed uniqu ely as a conv ex combin ation of its extreme points, if, and only if, the extreme points are affine-inde penden t. Hen ce, if ¯ R is not affine-independen t, a non -novel word can be assigned to different subsets of to pics. The sufficiency o f the affine-indep endence co ndition in separable top ic estimation is again a direct co nsequence of the c onsistency guar antees of our ap proach as in Theorems 3 and 4. W e note that since affine-indep endence im plies the simplicial co ndition (Pr oposition 1), affine-indepen dence is sufficient for novel word detection as well. Connection to Ot her Co nditions on t he Mixing W eig hts: W e briefly discuss other con ditions on the m ixing weights θ that have b een explo ited in the literature. In [7 ], [15] , R (equiv alently ¯ R ) is assumed to h av e fu ll-rank (with minimu m eigenv alue γ r > 0 ). In [3 ], ¯ R is assumed to b e diagona l- dominan t, i.e., ∀ i, j, i 6 = j, ¯ R i,i − ¯ R i,j ≥ γ d > 0 . They are both sufficient condition s for detecting all th e novel words o f all distinct top ics. Th e con stants γ r and γ d are co ndition num bers which measu re the degre e to which the f ull-rank and diag onal- dominan ce condition s hold respectively . They are counterp arts of γ s and γ a and like them, the larger they are, the easier it is to consistently detect the n ovel words and estimate β . The relationsh ips between these condition s are summ arized in Proposition 2 and illustrated in Fig. 3. Fig. 3. Relation ships between Simplic ial, Affine -Independe nce, Full Rank, and Diagonal Dominance conditions on the normalize d second-moment ¯ R . Proposition 2. Let ¯ R be the normalized second -moment o f the to pic prior . Then , 1) ¯ R is full rank with m inimum eigenvalue γ r ⇒ ¯ R is at least γ r -affine-ind ependen t ⇒ ¯ R is at least γ r - simplicial. 2) ¯ R is γ d -diagona l-domina nt ⇒ ¯ R is at least γ d - simplicial. 3) ¯ R bein g diagona l-domin ant neither implies nor is im- plied by ¯ R being affine-independ ent (o r full-rank). W e no te that in our earlier work [5] , the provable guarantees for estimating the separable top ic m atrix requ ire ¯ R to hav e full rank . The analy sis in this p aper pr ovably exten ds the guaran tees to the affine-indep endenc e c ondition . C. Geometric Implicatio ns and Rand om Pr ojections Ba sed Algorithm W e now demonstrate the geometric implication s of topic separability comb ined with the simplicial/ affine-independ ence condition on the top ic mixing we ights. T o high light the key ideas we fo cus on the ideal case wh ere N = ∞ . The n, th e empirical d ocumen t word-fre quency matrix X = A = β θ . Novel W or ds are Extreme Points: T o expose the under lying geometry , we norm alize the rows of A and θ to ob tain row-stochastic matrices ¯ A := diag( A1 ) − 1 A and ¯ θ := diag( θ 1 ) − 1 θ . Th en since A = β θ , we have ¯ A = ¯ β ¯ θ where ¯ β := dia g( A1 ) − 1 β diag( θ 1 ) is a r ow-normalized “topic matrix” which is both row-stochastic and separable with th e same sets of novel words as β . Now consider the row vectors of ¯ A and ¯ θ . First, it can be shown that if ¯ R is simplicial (cf. Condition 1) then, with high probab ility , no row of ¯ θ will be in the co n vex hull of 6 the others (see Appen dix D). Next, the separability prop erty ensures th at if w is a novel word o f top ic k , then ¯ β wk = 1 and ¯ β wj = 0 ∀ j 6 = k so that ¯ A w = ¯ θ k . Re visiting th e example in Fig. 1 , th e r ows o f ¯ A which correspo nd to novel words, e.g., words 1 th rough 5 , are all row-vectors of ¯ θ and togeth er fo rm a conv ex hu ll of K extreme points. For e xample, ¯ A 1 = ¯ A 2 = ¯ θ 1 and ¯ A 3 = ¯ A 4 = ¯ θ 2 . If, howe ver , w is a n on-n ovel word, then ¯ A w = P k ¯ β wk ¯ θ k li ves inside the con vex h ull of the rows of ¯ θ . In Fig. 1, row ¯ A 6 which corresp onds to n on-n ovel word 6 , is in side the con vex hull of ¯ θ 1 , ¯ θ 2 , ¯ θ 3 . In summary , the novel words can b e detected as extrem e points of all the row-vectors of ¯ A . Also, multip le novel w ords of the same topic cor respond to the same extreme p oint (e.g., ¯ A 1 = ¯ A 2 = ¯ θ 1 ). Formally , Lemma 3. Let ¯ R be γ s simplicial an d β be sep arable. Then , with pr ob ability at least 1 − 2 K exp( − c 1 M ) − e x p( − c 2 M ) , the i -th r ow of ¯ A is an e xtr eme point of the co n vex hull span ned by all th e r ows of ¯ A if, a nd only if, word i is novel. Her e the constant c 1 := γ 2 s a 4 min / 4 λ max and c 2 := γ 4 s a 4 min / 2 λ 2 max . The model parameters are defin ed as follows. a min is the m inimum element of a . λ max is the ma ximum sin gular-value of ¯ R . T o see how identifyin g novel words c an help us estimate β , recall tha t the row-vectors of ¯ A correspon ding to novel words coincid e with the rows o f ¯ θ . Thus ¯ θ is known once one novel word for ea ch top ic is k nown. Also, for all words w , ¯ A w = P k ¯ β wk ¯ θ k . Thu s, if we can uniqu ely deco mpose ¯ A w as a conve x com bination of the extreme points, then th e coefficients of the decomp osition will giv e us the w -th row of ¯ β . A uniqu e deco mposition exists with high p robab ility when ¯ R is af fine-indep endent and can b e found b y so lving a constrained linear regression prob lem. This gives us ¯ β . Finally , noting that diag ( A1 ) ¯ β = β diag ( θ 1 ) , β can be recovered by suitably r enorma lizing rows and then column s of ¯ β . T o sum up, Lemma 4 . Let A a nd one novel wor d per distinct topic be given. I f ¯ R is γ a affine-indep endent, then, with pr obab ility at least 1 − 2 K exp( − c 1 M ) − exp( − c 2 M ) , β ca n be r ecover e d uniquely via constrained linear r e gr ession. Her e the constant c 1 := γ 2 a a 4 min / 4 λ max and c 2 := γ 4 a a 4 min / 2 λ 2 max . The model parameters are d efined as follows. a min is the minimum element of a . λ max is the ma ximum sin gular-value of ¯ R . Lemmas 3 an d 4 tog ether p rovide a geom etric appro ach for learnin g β from A (equi valently ¯ A ): (1) Find extreme points of rows of ¯ A . Cluster the rows of ¯ A that correspond to the same extreme point into the same g roup. (2) Express the remainin g rows of ¯ A as conv ex combina tions of the K distinct extreme points. (3) Renormalize ¯ β to ob tain β . Detecting Extreme Points using Random Projections: A key contribution of ou r appr oach is an efficient rando m projections based alg orithm to detect n ovel words as extreme points. The idea is illustrated in Fig. 1: if we pro ject every point of a convex body onto an isotro pically distributed r andom direction d , the maximu m (o r minimum ) p rojection value must cor respond to one of the extreme p oints with p robability 1 . On the other hand , the n on-novel words will n ot have the maximu m projection value alon g any ran dom directio n. Therefo re, by repeated ly projectin g all the points onto a few isotropically distributed rand om directio ns, we can detect all the extreme p oints with very high pro bability as th e nu mber of random directio ns increase. An explicit boun d on the numbe r of projections need ed app ears in Theor em 3. Finite N in Practice: The geometric intuition discussed above was based on the r ow-vectors of ¯ A . When N = ∞ , ¯ A = ¯ X the matrix of row-normalized emp irical word- freque ncies of all docume nts. If N is finite but very large, ¯ A can be well- approx imated by ¯ X than ks to the law of large numb ers. Howe ver , in real-word text cor pora, N ≪ W ( e.g., N = 298 while W = 14 , 94 3 in the NYT dataset) . Therefore, the r ow- vectors of ¯ X ar e significan tly perturb ed away from the ideal rows of ¯ A as illustrated in Fig. 1. W e discuss the effect o f small N a nd ho w we address the accompanying issues n ext. I V . T O P I C G E O M E T RY W I T H F I N I T E S A M P L E S : W O R D C O - O C C U R R E N C E M A T R I X R E P R E S E N TA T I O N , S O L I D A N G L E , A N D R A N D O M P RO J E C T I O N S B A S E D A P P RO A C H The extreme po int geometr y sketched in Sec. III-C is p er- turbed when N is small as highlig hted in Fig. 1. Specifically , the rows of the empirical word- freque ncy matrix X deviate from the rows o f A . This cr eates several pro blems: (1) p oints in th e conv ex h ull correspon ding to non- novel words m ay also b ecome “outlier” extreme points (e.g., ¯ X 6 in Fig. 1); (2) some extreme points that correspon d to novel w ords ma y no longer b e extrem e (e.g ., ¯ X 3 in Fig. 1); (3) multiple novel words correspond ing to the same extreme point may become multiple distinct extreme poin ts (e.g., ¯ X 1 and ¯ X 2 in Fig. 1 ). Unfortu nately , these issues do n ot vanish as M increases with N fixed – a regime which captures the characteristics of typical benchm ark datasets – bec ause the dimension ality of the r ows (equal to M ) also increases. T here is n o “averaging” effect to smoothen -out th e sampling no ise. Our solu tion is to seek a new repr esentation, a statistic o f X , which c an no t on ly sm oothen out th e sampling noise of in- dividual docu ments, but also p reserve the same extreme point geometry indu ced by the separ ability and affine indepen dence condition s. In add ition, we also develop an extreme point robustness measure that natu rally ar ises within our rando m projection s b ased framework. This robustness m easure ca n be used to d etect an d exclu de the “outlier” extreme points. A. Normalized W or d Co-occurr ence Matrix Representation W e co nstruct a suitably normalized word co-o ccurren ce matrix from X as our new rep resentation. Th e co-occurren ce matrix conver ges alm ost surely to an ideal statistic as M → ∞ for any fixed N ≥ 2 . Simultaneously , in th e asympto tic lim it, the original novel words co ntinue to corresp ond to extreme points in the new rep resentation and overall extreme po int geometry is pr eserved. The new repre sentation is (con ceptually) constru cted as follows. First rand omly divide all the words in e ach docum ent into two eq ual-sized independent halves and obtain two W × K empirical word-f requen cy matrices X an d X ′ each contain ing N / 2 words. Th en norm alize their rows like in Sec. II I-C to 7 obtain ¯ X and ¯ X ′ which are ro w-stochastic. The empirical w ord co-occu rrence matrix of size W × W is th en gi ven by b E := M ¯ X ′ ¯ X ⊤ (1) W e no te that in our ran dom projectio n b ased appr oach, b E is n ot explicitly con structed by multip lying ¯ X ′ and ¯ X . Instead, we k eep ¯ X ′ and ¯ X and exploit their sparsity proper ties to redu ce the comp utational complexity of all sub sequent processing. Asymptotic Co nsistency: The first nice prop erty of the word co-occu rrence r epresentation is its asymp totic consistency when N is fixed. As the nu mber of d ocumen ts M → ∞ , the empirical b E conv erges, almost sur ely , to an ideal word co-occu rrence matrix E o f size W × W . Formally , Lemma 5 . ( [32, Lemma 2]) Let b E be the empirical wor d co-occu rr ence ma trix defined in Eq. (1) . The n, b E M →∞ − − − − − − − − − − → almost su r ely ¯ β ¯ R ¯ β ⊤ =: E (2) wher e ¯ β := diag − 1 ( β a ) β diag( a ) and ¯ R := diag − 1 ( a ) R diag − 1 ( a ) . Furthermor e, if η := min 1 ≤ i ≤ W ( β a ) i > 0 , then Pr( k b E − E k ∞ ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 η 4 M N / 20) . Here ¯ R is th e same norm alized secon d-mom ent of th e top ic priors as d efined in Sec. I II an d ¯ β is a row-normalized version of β . W e make note of th e ab use of notion for ¯ β which was defined in Sec. III-C. It can be shown that the ¯ β define d in Lemma 5 is th e limit of the one defined in Sec. II I-C as M → ∞ . The conv ergence r esult in Lemma 5 shows that the word co-occu rrence rep resentation E can be consistently estimated by b E as M → ∞ and the deviation vanishes expon entially in M which is large in typic al ben chmark d atasets. Novel W ords are Extreme P oints: Ano ther reason fo r u sing this word co-o ccurren ce represen tation is that it p reserves the extreme point geome try . Consider the ideal word co- occurre nce matrix E = ¯ β ( ¯ R ¯ β ⊤ ) . It is straightf orward to show that if ¯ β is separable an d ¯ R is simplicial then ( ¯ R ¯ β ⊤ ) is also simplicial. Using th ese facts it is possible to establish the following cou nterpar t of Lemma 3 for E : Lemma 6 . (Novel W or ds a r e Extreme P o ints [5, Lemma 1]) Let ¯ R be simplicial a nd β be separable. Th en, a wor d i is novel if, and only if, the i - th r ow of E is an extr eme point of the con vex hu ll spa nned by a ll the r ows of E . In another w ords, the novel words corresp ond to the extreme points of all the row-vectors of the ideal word co-occurre nce matrix E . C onsider the example in Fig. 4 which is based on the same top ic matrix β as in Fig. 1. Her e, E 1 = E 2 , E 3 = E 4 , and E 5 are K = 3 distinct extrem e points of all row-vectors of E and E 6 , which co rrespon ds to a non -novel word, is in side the con vex hu ll. Once th e novel words are d etected as extrem e points, we can follow the same pro cedure a s in Lem ma 4 and express each row E w of E as a unique conve x com bination of the K extreme rows o f E or equiv alently th e rows of ( ¯ R ¯ β ⊤ ) . The weights o f the c onv ex combination are the ¯ β wk ’ s. W e can then apply the same row and colu mn ren ormalization to ob tain β . The following re sult is the counterpa rt o f Lemma 4 for E : Lemma 7. Let E and one n ovel word for each distinct topic be given . I f ¯ R is a ffine-indep endent, then β can b e r ecover ed uniquely via con strained linear r e gr ession. One can f ollow the same steps as in the pr oof of Le mma 4. The only add itional step is to check th at ¯ R ¯ β ⊤ =  ¯ R , ¯ RB  is affine-independe nt if ¯ R is affine-indep endent. W e note that the finite sampling noise pertu rbation b E − E is still no t 0 but vanishes as M → ∞ (in contra st to the ¯ X representatio n in Sec. III- C). Howe ver , ther e is still a possibility of observin g “o utlier” extreme points if a non-n ovel word lies on the facet of the conve x h ull of th e rows of E . W e next in troduce an extreme p oint robustness measur e based on a cer tain solid an gle that na turally arises in o ur rando m projection s based ap proach, and d iscuss how it c an be u sed to detect and d istinguish between “true” novel words and su ch “outlier” extrem e p oints. B. Solid Angle E xtr eme P oint Robustness Measur e T o han dle the impact o f a small but no nzero pertur bation k b E − E k ∞ , we develop an e xtreme point “ro bustness” measure. This is necessary fo r n ot only ap plying our app roach to r eal- world d ata but also to establish finite sample co mplexity bound s. Intu iti vely , a robustness measure sho uld be able to distinguish between the “true” extreme p oints (r ow vectors that are novel words) and the “o utlier” extrem e po ints (row vectors of non -novel w ords that bec ome extreme p oints du e to the nonzero per turbation ). T ow ards this go al, we leverage a k ey geometric quan tity , namely , the Normalized Solid Angle sub - tended by the con vex hu ll of the ro ws of E at an e xtreme point. T o visualize this quantity , we revisit our r unning example in Fig. 4 and indicate the solid angles attached to each extreme point by the shad ed regions. It tu rns out that this geometr ic quantity naturally a rises in the con text of random projections that was d iscussed earlier . T o see this connectio n, in Fig. 4 observe that the shaded region attache d to any extreme point Fig. 4. An exampl e of separab le topic matrix β (left ) and the underlying geometri c s tructure (right) in the word co-oc currence representati on. Note: the word ordering is only for visuali zation and has no bearing on separabilit y . The example topic matrix β is the same as in Fig. 1. Solid circle s represent the rows of E . The shaded regions depict the solid angles subtended by each ext reme point. d 1 , d 2 , d 3 are isotropic random direction s along which each ext reme point has maximum projectio n value. They can be used to estimate the solid angles. 8 coincides precisely with th e set of direction s along which its projection is larger (taking sign into accoun t) than that o f any o ther p oint (whether extreme or not). For examp le, in Fig. 4 the pr ojection of E 1 = E 2 along d 1 is la rger th an that of any o ther point. Thus, the solid angle attached to a point E i (whether extreme or not) can be fo rmally defin ed as the set of directions { d : ∀ j : E j 6 = E 1 , h E i , d i > h E j , d i} . This set is nonem pty o nly f or extrem e p oints. T he solid angle defined above is a set. T o derive a scalar robustness measure from this set and tie it to the idea of random projections, we adopt a statistical perspe ctiv e and define the n ormalized solid angle o f a point as th e p r o bability th at th e point will have the maximum projection value alon g an isotrop ically distributed random direction . Concretely , for the i -th word (row vector), the normalized solid angle q i is defined as q i := Pr( ∀ j : E j 6 = E i , h E i , d i > h E j , d i ) (3) where d is drawn from an isotropic distribution in R W such as the spher ical Gaussian. The condition E i 6 = E j in Eq. (3) is intro duced to exclu de the mu ltiple novel word s of the same topic th at corre spond to the same extreme p oint. For instan ce, in Fig. 4 E 1 = E 2 , Hence, for q 1 , j = 2 is excluded . T o make it p ractical to han dle fin ite sample estimatio n no ise we rep lace the condition E j 6 = E i by th e condition k E i − E j k ≥ ζ f or some suitably defin ed ζ . As illustrated in Fig . 4, the solid an gle for all the extreme points ar e strictly positive given ¯ R is γ s -simplicial. On the other hand, for i th at is non-n ovel, the correspon ding solid angle q i is zero b y definition . Hen ce the extreme point geometry in Lemm a 6 can be re-expressed in term of solid angles as follows: Lemma 8. (Novel W o r ds hav e P ositive So lid Ang les) Let ¯ R be simplicial and β be separable. Th en, wor d i is a novel wor d if, an d only if, q i > 0 . W e denote the smallest solid ang le among the K distinct extreme p oints by q ∧ > 0 . This is a robust cond ition number of the co n vex hu ll fo rmed b y the rows of E and is related to the simplicial con stant γ s of ¯ R . In a real-world dataset we h av e access to o nly an em pirical estimate b E of th e ideal word co- occurre nce matrix E . If w e replace E with b E , then the resu lting emp irical solid angle estimate b q i will be very close to the ideal q i if b E is close enoug h to E . Then , the solid angles of “outlier ” e xtreme points will be close to 0 wh ile they will be b ound ed away fro m zero for the “true” extrem e points. One can then hope to correctly identify all K extrem e po ints by rank-ordering a ll empirical solid angle estima tes and selectin g the K distinct row-vectors that h av e the largest solid angles. This forms the basis of our propo sed algorith m. The problem now boils down to ef ficiently estimating the solid angles and establishing th e asympto tic conv ergence of the e stimates as M → ∞ . W e next discuss how rando m pro jections can be used to achieve these g oals. C. Efficient S olid Angle Estimation via Random Pr ojections The d efinition of the norm alized solid ang le in Eq. (3 ) motiv ates an efficient algorithm based o n rand om pr ojections to estima te it. For con venience, we first r ewrite Eq. (3) as q i = E " I {∀ j : k E j − E i k ≥ ζ , E i d ≥ E j d } # (4) and then pr opose to estimate it b y ˆ q i = 1 P P X r =1 I ( ∀ j : b E i,i + b E j,j − 2 b E i,j ≥ ζ / 2 , b E i d r > b E j d r ) (5) where d 1 , . . . , d P ∈ R W × 1 are P iid dir ections drawn from an isotropic d istribution in R W . Algo rithmically , by E q. (5), we approxim ate the solid angle q i at the i -th word (ro w-vector) by first p rojecting all th e row-vectors onto P iid iso tropic random dir ections and th en calculating the fr action of tim es each row-vector achieves the maximum projection value. It turns out th at th e cond ition b E i,i + b E j,j − 2 b E i,j ≥ ζ / 2 is equiv alent to k E i − E j k ≥ ζ in term s of its ab ility to exclu de multiple novel words from the same topic and is adopte d for its simp licity . 2 This pro cedure o f taking ran dom projectio ns followed by calculating the number o f times a word is a maximizer via Eq. (5 ) provides a consistent estima te o f th e so lid angle in Eq. (3) as M → ∞ and th e number of projections P in creases. The h igh-level id ea is simp le: a s P increases, the empirica l av erage in Eq. 5 converges to th e correspo nding e xpectatio n. Simultaneou sly , as M increases, b E a.s. − − → E . Overall, the approx imation b q i propo sed in Eq ( 5) u sing rand om projectio ns conv erges to q i . This ran dom projectio ns based approa ch is also compu ta- tionally e fficient fo r the fo llowing reasons. First, it en ables us to avoid the explicit constru ction of the W × W dim ensional matrix b E : Recall that each co lumn of X and X ′ has no more than N ≪ W non zero e ntries. Hence X and X ′ are both sparse. Since b Ed = M ¯ X ′ ( ¯ X ⊤ d ) , the pro jection can be calculated u sing two sparse m atrix-vector multiplicatio ns. Second, it turn s ou t that the numb er of projections P need ed to gu arantee consistency is small. In fact in Theorem 3 we provide a sufficient upper bou nd for P which is a polynomial function of log( W ) , lo g(1 /δ ) an d other mod el para meters, where δ is the pro bability that the algo rithm fails to detect all the d istinct novel words. Parallelization, Distributed and Online Set tings: Ano ther advantage of the p roposed random pro jections based appr oach is th at it can be pa rallelized and is natu rally ame nable to online or distributed settings. Th is is based on the f ollowing observation that each p rojection has an additive structu re: b Ed r = M ¯ X ′ ¯ X ⊤ d r = M M X m =1 ¯ X m ′ ¯ X m ⊤ d r . The P projection s can also be c omputed independ ently . There- fore, • In a distributed setting in which the do cuments ar e store d on distributed servers, we can first share the same ra ndom 2 W e abuse the symbol ζ by using it to indic ate differe nt thresholds in these conditi ons. 9 directions across servers and then aggregate the projectio n values. The commu nication co st is only th e “par tial” projection values and is theref ore insignificant [5] a nd does no t scale as the numb er of ob servations N , M increases. • In a n online setting in which the docu ments are streamed in an online fashion [20], we only need to keep all the projection values and update the p rojection values ( hence the em pirical solid a ngle estimates) when new docum ents arrive. The additive and ind ependen t stru cture guaran tees that the statistical efficiency o f these variations ar e th e same as the centralized “b atch” implemen tation. For the r est of this pa per , we only f ocus on the centralized versio n. Outline of Overall Appr oach: Our overall appro ach can be summarized as follows. (1) Estimate the empirical solid angles usin g P iid isotropic ran dom direction s as in Eq . 5. (2) Select the K words with distinct word co-occu rrence patterns (r ows) that have the largest empirica l solid an gles. (3) Estimate the to pic ma trix usin g con strained linear regression as in Lem ma 4. W e will discuss the details of our overall approa ch in th e next section and e stablish guaran tees f or its computatio nal and statistical e fficiency . V . A L G O R I T H M A N D A N A LY S I S Algorithm 1 describes th e main step s o f ou r overall ra ndom projecton s based algor ithm which we call RP . The two main steps, n ovel word detectio n and topic m atrix estimatio n are outlined in Algorithms 2 an d 3 respectively . Algorithm 2 outlines the random pr ojection and rank -orde ring steps. Al- gorithm 3 descr ibes the constrained linear regression and the renorm alization steps in a combin ed way . Algorithm 1 RP Input: T ext d ocumen ts ¯ X , ¯ X ′ ( W × M ) ; Number of top ics K ; Numb er of iid rando m projectio ns P ; T oleran ce pa- rameters ζ , ǫ > 0 . Output: Estimate of th e topic m atrix b β ( W × K ) . 1: Set of No vel W ord s I ← NovelW ord Detect( ¯ X , ¯ X ′ , K , P , ζ ) 2: ˆ β ← E stimateT opics( I , ¯ X , ¯ X ′ , ǫ ) Computational Efficiency: W e first summar ize the computa- tional ef ficiency of Algorithm 1: Theorem 2. Let the numb er of novel wor ds for each topic be a constan t r elative to M , W , N . Then, the runn ing time of Algorithm 1 is O ( M N P + W P + W K 3 ) . This efficiency is ach iev ed by exploiting the spa rsity o f X and the p roperty that there ar e o nly a small number of novel words in a typical vocabulary . A detailed an alysis o f the computatio nal complexity is presented in the app endix. Here we point out that in order to upper boun d t he co mputation time of th e linear r egression in Algorith m 3 we used O ( W K 3 ) for W matrix inv ersions, one for each of the words in the vocab ulary . In pr actice, a gradient descent implementa tion can be used for the constrained line ar regression which is Algorithm 2 NovelW ordDetect ( via Random Pro jections) Input: ¯ X , ¯ X ′ ; Nu mber of topics K ; Num ber of pr ojections P ; T olerance ζ ; Output: The set of all novel words o f K d istinct topics I . 1: ˆ q i ← 0 , ∀ i = 1 , . . . , W , b E ← M ¯ X ′ ¯ X ⊤ . 2: for all r = 1 , . . . , P do 3: Sample d r ∈ R W from a n isotropic pr ior . 4: v ← M ¯ X ′ ¯ X ⊤ d r 5: i ∗ ← arg max 1 ≤ i ≤ W v i , ˆ q i ∗ ← ˆ q i ∗ + 1 / P 6: ˆ J i ∗ ← { j : b E i ∗ ,i ∗ + b E j,j − 2 b E i ∗ ,j ≥ ζ / 2 } 7: for all k ∈ ˆ J c i ∗ do 8: ˆ J k ← { j : b E k,k + b E j,j − 2 b E k,j ≥ ζ / 2 } 9: if {∀ j ∈ ˆ J k , v k > v j } t hen 10: ˆ q k ← ˆ q k + 1 / P 11: end if 12: end for 13: end f or 14: I ← ∅ , k ← 0 , j ← 1 15: while k < K do 16: i ← index of the j th largest value o f { ˆ q 1 , . . . , ˆ q W } . 17: if {∀ p ∈ I , b E p,p + b E i,i − 2 b E i,p ≥ ζ / 2 } then 18: I ← I ∪ { i } , k ← k + 1 19: end if 20: j ← j + 1 21: end while 22: Return I . Algorithm 3 EstimateT o pics Input: I = { i 1 , . . . , i K } set o f n ovel word s, on e f or each of the K topics; b E ; precision parameter ǫ Output: b β , which is the estimate of the β matrix 1: b E ∗ w = h b E w, i 1 , . . . , b E w, i K i 2: Y = ( b E ∗⊤ i 1 , . . . , b E ∗⊤ i K ) ⊤ 3: for all i = 1 , . . . , W do 4: Solve b ∗ := ar g min b k b E ∗ i − bY k 2 5: subject to b j ≥ 0 , P K j =1 b j = 1 6: using precision ǫ for the stop ping-cr iterion. 7: b β i ← ( 1 M X i 1 ) b ∗ 8: end f or 9: b β ← c olumn normalize b β much m ore efficient. W e also note that these W optim ization problem s are decou pled given the set of detected n ovel words. Therefo re, they can be parallelized in a straightforward mann er [5]. Asymptotic Consistency and Statistica l Efficiency: W e now summarize the asympto tic con sistency an d sample comp lexity bound s for Algo rithm 1 . The analy sis is a co mbination of the consistency of the novel w ord detection step (Algorithm 2) and the topic estimation step (A lgorithm 3). W e state the r esults for both o f these steps. First, f or detecting all the n ovel words of the K distinct topics, we h av e the following result: Theorem 3 . Let topic matrix β be separable and ¯ R be γ - simplicial. If the pr ojection d ir ections are iid sampled fr om any isotr opic d istribution, then A lgorithm 2 can id entify all 10 the n ovel words of th e K distinct topics a s M , P → ∞ . Furthermore , ∀ δ ≥ 0 , if M ≥ 20 log(2 W /δ ) N ρ 2 η 4 and P ≥ 8 log(2 W /δ ) q 2 ∧ (6) then Algo rithm 2 fails with pr ob ability at most δ . Th e model parameters ar e defi ned a s follows. ρ = min { d 8 , π d 2 q ∧ 4 W 1 . 5 } wher e d = (1 − b ) 2 γ 2 /λ max , d 2 , (1 − b ) γ , λ max is the ma ximum eigen value o f ¯ R , b = max j ∈C 0 ,k ¯ β j,k , and C 0 is the set of non-n ovel wor ds. F inally , q ∧ is the minimu m solid an gle of the e xtr eme points o f the conve x hull of the r ows of E . The detailed pr oof is pr esented in the append ix. T he results in Eq. (6 ) pr ovide a sufficient fin ite samp le comp lexity bound for novel word d etection. T he bou nd is polyno mial with respect to M , W, K , N , log( δ ) and o ther m odel param eters. The num ber of projec tions P th at impacts the c omputatio nal complexity scales as log ( W ) / q 2 ∧ in this suf ficient bound wher e q ∧ can be u pper boun ded by 1 /K . I n pra ctice, we have fo und that setting P = O ( K ) is a good choice [5]. W e no te th at the result in Theorem 3 on ly requ ires the simplicial cond ition wh ich is the minimum con dition requ ired for consistent novel word detectio n (Lem ma 1). T his theor em holds true if the topic pr ior ¯ R satisfies strong er conditions such as affine-indep endenc e. W e also point out th at o ur proo f in this pape r holds for any isotr op ic distribution on the random projection directions d 1 , . . . , d P . The p revious result in [5 ], howe ver , only applies to some specific isoto pic distributions such as the Spherical Gaussian o r the uniform distribution in a unit ba ll. In practice, we use Spherica l Gau ssian sin ce sampling from su ch prior is simp le and requ ires only O ( W ) time for g enerating eac h r andom direction. Next, given the successful d etection of the set of n ovel words for all top ics, we have the following result f or the accurate estimation o f th e sep arable topic matr ix β : Theorem 4. Let topic matrix β be separable and ¯ R be γ a - affine-indep endent. Given the successful detection of novel wor ds for all K distinct top ics, th e outpu t of A lgorithm 3 b β p − → β element-wise (u p to a colu mn permutation) . Specifi- cally , if M ≥ 2560 W 2 K log( W 4 K /δ ) N γ 2 a a 2 min η 4 ǫ 2 (7) then ∀ i , k , b β i,k will be ǫ close to β i,k with pr obability at least 1 − δ , for any 0 < ǫ < 1 . η is the same as in Theorem 3 . a min is th e minimum va lue in a . W e note that the sufficient sample comp lexity boun d in Eq. (7) is again p olynom ial in ter ms o f all the mo del pa- rameters. Here we only requir e ¯ R to be a ffine-independen t. Combining Theorem 3 and T heorem 4 gives the co nsistency and sample comp lexity bou nds of o ur overall app roach in Algorithm 1. V I . E X P E R I M E N TA L R E S U LT S In this section , we p resent experimen tal results on b oth syn- thetic an d real world datasets. W e repo rt different perfo rmance measures that h ave been com monly used in th e topic mo deling literature. When the g round tru th is av ailable (Sec. VI- A), we use the ℓ 1 r econstruction err or between the grou nd truth topics an d the e stimates af ter prope r topic alignmen t. For the real-world text corpu s in Sec. VI-B, we r eport the held- out pr obability , which is a standar d measur e used in th e topic m odeling literature. W e also qua litatively (seman tically) compare the topics extracted by the d ifferent approac hes using the to p probable words for each topic. A. Semi-syn thetic text co rpus In order to validate our pr oposed algorithm, we g enerate “semi-synthetic” text corpor a by sampling fro m a synthetic, yet realistic, gro und tr uth topic mod el. T o en sure that the semi-synthetic data is similar to real-world d ata, in ter ms of dimensiona lity , spar sity , and othe r characteristics, we use th e following gener ativ e procedu re adapted from [ 5], [7 ]. W e first train an L D A m odel (with K = 100 ) on a rea l- world dataset using a stand ard Gibb s Samp ling metho d with default parameters (as describ ed in [1 1], [33] ) to obtain a topic matrix β 0 of size W × K . The real-world dataset that we use to generate our synthetic data is derived f rom a New Y ork T imes (NYT) articles dataset [8]. The original vocabulary is first prun ed b ased on docu ment frequen cies. Specifically , as is standard practice, o nly words that appear in more tha n 500 docu ments ar e retained . Ther eafter, again as per standard practice, th e words in the so -called stop-word list are delete d as rec ommend ed in [34]. Afte r th ese steps, M = 3 00 , 000 , W = 1 4 , 943 , and the average d ocumen t leng th N = 298 . W e then generate semi-syn thetic datasets, for various values of M , by fixing N = 30 0 and u sing β 0 and a Dirichlet topic prior . As suggested in [11] and used in [5], [7], we use symmetric hyper-parameter s ( 0 . 03 ) for the Dirichlet topic prior . The W × K top ic m atrix β 0 may no t b e separab le. T o enforce sep arability , we create a new sepa rable ( W + K ) × K dimensiona l topic matrix β sep by inserting K sy nthetic novel words ( one per topic) having suitable prob abilities in eac h topic. Specifically , β sep is co nstructed by tran sforming β 0 as follows. First, f or eac h synthetic novel word in β sep , the value of the sole no nzero en try in its r ow is set to the prob ability of the most pro bable word in the to pic (co lumn) of β 0 for which it is a novel word. Th en the resulting ( W + K ) × K d imensional no nnegative matr ix is renormalize d colu mn- wise to make it colu mn-stoch astic. Finally , we gen erate semi- synthetic datasets, for various values of M , by fixing N = 30 0 and u sing β sep and th e same sym metric Dir ichlet top ic prior used for β 0 . W e use the name Semi-Syn to re fer to datasets that are generated using β 0 and the nam e Semi-Syn + Novel for datasets generated using β sep . In o ur pro posed ran dom p rojections ba sed algo rithm, wh ich we call RP , we set P = 150 × K , ζ = 0 . 05 , an d ǫ = 10 − 4 . W e com pare RP against the provably efficient algorith m RecoverL2 in [7] a nd th e standard G ibbs Samp ling based LD A algorith m (d enoted by Gibb s) in [1 1], [33 ]. In or der to measure the perfo rmance of d ifferent algorith ms in our experiments b ased on semi-synth etic data, we c ompute the ℓ 1 norm of the r econstruction err or between b β and β . Since 11 all colum n per mutations of a given top ic m atrix c orrespon d to the same topic m odel (for a co rrespon ding perm utation of the topic m ixing weights), we use a bipartite graph match ing algorithm to optimally match the colu mns of b β with those of β (based on min imizing the sum of ℓ 1 distances between all pairs o f matching colu mns) bef ore compu ting the ℓ 1 norm of the reconstruction e rror b etween b β and β . The results on both Semi-Syn + Novel NYT and Semi-Syn NYT are summarized in Fig. 5 fo r a ll thr ee a lgorithms for various ch oices o f the numb er of docu ments M . W e note th at in th ese figu res the ℓ 1 norm of the e rror has been nor malized by the n umber of top ics ( K = 10 0 ). 50000 200000 300000 500000 1e+06 2e+06 0 0.1 0.2 0.3 0.4 0.5 # of Documents M L1 error RP RecoverL2 Gibbs 50000 200000 300000 500000 1e+06 2e+06 0 0.1 0.2 0.3 0.4 0.5 # of Documents M L1 error RP RecoverL2 Gibbs Fig. 5. ℓ 1 norm of the error in estimati ng the topic matrix β for various M ( K = 100 ): (T op) Semi-Syn + Novel NYT ; (Bottom) Semi-Syn NYT . RP is the proposed algorit hm, Recov erL2 is a prov ably effic ient algorithm from [7], and Gibbs is the Gibbs S ampling approxi mation algori thm in [11]. In RP , P = 150 K , ζ = 0 . 05 , and ǫ = 10 − 4 . As Fig. 5 shows, when the separability con dition is strictly satisfied ( Semi-S yn + Novel ), the reco nstruction er ror of RP conv erges to 0 as M beco mes large and outperfo rms the approx imation-b ased Gibbs. When the separ ability cond ition is not strictly satisfi ed ( Semi-Sy n ), the recon struction err or of RP is co mparable to Gibbs (a practical ben chmark ). Solid An gle and Model Selection: In our pr oposed algorithm RP , the number o f topics K (the mod el-order ) nee ds to be specified. When K is unav ailable, it needs to b e estimated from the data. Although not the fo cus of th is work, Algo- rithm 2, which identifies n ovel w ords by sortin g and clustering the e stimated solid angles o f words, can be suitably modified to estimate K . Indeed , in the idea l scenario where there is no samp ling noise ( M = ∞ , b E = E , and ∀ i, ˆ q i = q i ), only n ovel words have positi ve solid angles ( ˆ q i ’ s) and the r ows of b E correspo nd- ing to the novel words of th e same topic are iden tical, i.e., the distance b etween the rows is zer o or, equ iv alently , they are within a neigh borho od of size zero of e ach other . Thus, the number of distinct neighb orhoo ds of size zer o amo ng the non - zero solid a ngle words equals K . In the non ideal case M is finite. If M is sufficiently large, one can expect tha t the estimated solid an gles of n on-novel words will not a ll be zero. T hey are, howe ver , likely to be much sma ller than those of novel words. T hus to reliab ly estimate K one should no t o nly exclude word s with exactly zero solid ang le e stimates, but also tho se ab ove som e non zero threshold. When M is finite, th e the r ows of b E c orrespon ding to th e novel words of th e same topic are unlikely to be id en- tical, but if M is suffi ciently large they are likely to be close to each o ther . Thus, if the thresh old ζ in Algorithm 2, w hich determines the size of the neig hborh ood for clustering all novel words b elongin g to the sam e topic, is made sufficiently small, then each neigh borho od will have only novel words be longing to th e same to pic. W ith the two modificatio ns d iscussed above, the num ber of distinct neighbo rhood s of a suitably no nzero size ( determine d by ζ > 0 ) among the words whose solid angle e stimates ar e larger th an some th reshold τ > 0 w ill provide an estimate of K . The values of τ an d ζ should, in prin ciple, decrease to zero as M increases to infinity . Leaving th e task of unraveling th e depend ence of τ and ζ o n M to future work, here we only pro- vide a brief empir ical validation on b oth th e Semi-Sy n + Novel and Semi-Syn NYT datasets. W e set M = 2 , 00 0 , 000 so that the reconstru ction erro r ha s essentially conver ged (see Fig. 5 ), and consider different ch oices of the threshold ζ . W e run Algor ithm 2 with K = 100 , P = 15 0 × K , and a n ew line o f co de: 16 ’: ( if { ˆ q i = 0 } , break ); inserted between lines 16 and 1 7 (this corresp onds to τ = 0 ). Th e input hyperp arameter K = 10 0 is not the actual nu mber of estimated topics. I t should be interpreted as specifying an upper bound on the nu mber o f topics. The value of (little) k when Algorithm 2 ter minates (see lines 14–2 1) p rovides an estimate of the number of top ics. Figure 6 illustrates how the so lid ang les of all words, sor ted in descending o rder, decay for different cho ices o f ζ and how they c an be used to detect the novel words and estimate the value of K . W e note th at in b oth th e sem i-synthetic datasets, for a wide r ange of values of ζ (0.1 –5), the modified Algorithm 2 co rrectly estimates the value of K as 1 00 . When ζ is large (e.g., ζ = 10 in Fig. 6), m any in terior points would be declared as novel word s an d mu ltiple id eal novel words would be g rouped into one clu ster r esulting. This cau ses K to be u nderestimate d (46 and 4 1 in Fig. 6). B. Real- world data W e now describ e results on the ac tual real-world NYT dataset that was u sed in Sec. VI-A to co nstruct the semi- synthetic datasets. Since groun d tru th topics are unav ailable, we measur e perfor mance u sing th e so-called predictive held- out log-pr obab ility . Th is is a stan dard measure which is typically used to ev aluate how well a learned to pic model fits real- world data. T o calculate this f or eac h of the three topic estimation method s ( Gibbs [11], [33 ], RecoverL2 [7], and RP), we first rand omly select 60 , 000 do cuments to test the go odness o f fit an d use the re maining 240 , 00 0 docu ments to produ ce an estimate b β of the top ic matrix. Next we assume a D irichlet p rior o n th e topics and estimate its concen tration hyper-parameter α . In Gibbs, this estimate b α is a by prod uct of the algor ithm. In RecoverL2 a nd RP this can b e estimated 12 2000 4000 6000 8000 10000 12000 14000 200 400 600 800 1000 1200 1400 Words Solid Angle ζ = 0.1 ζ = 1 ζ = 5 ζ = 10 46 100 100 100 2000 4000 6000 8000 10000 12000 14000 200 400 600 800 1000 1200 1400 Words Solid Angle ζ = 0.1 ζ = 1 ζ = 5 ζ = 10 100 41 100 100 Fig. 6. Solid-angles (in descending order) of all 14943 + 100 words in the Semi-Syn + Sep NYT dataset (left) and all 14943 words in the Semi-Syn NYT dataset (right) estima ted (for diffe rent v alues of ζ ) by Algor ithm 2 with K = 100 , P = 150 × K , M = 2 , 000 , 000 , and a new line of code: 16’: ( if { ˆ q i = 0 } , break ); inserted between lines 16 and 17. The values of j and (little) k when Algorithm 2 terminates are indicate d, respecti ve ly , by the position of the vert ical dashed line and the rectangular box next to it for diffe rent ζ . from b β a nd X . W e then calcu late the prob ability of ob serving the test d ocuments gi ven the learned to pic model b β and b α : log Pr( X test | b β , b α ) Since an exact ev aluation of th is pred icti ve log-likelihood is in tractable in general, we calculate it using the MCMC based app roximatio n pr oposed in [19 ] which is now a standard approx imation tool [33] . For RP , we use P = 1 50 × K , ζ = 0 . 05 , and ǫ = 10 − 4 as in Sec. VI-A. W e repo rt the held- out log probab ility , no rmalized by the total n umber of words in the test do cuments, averaged acro ss 5 training/testing splits. The results a re summarized in T ab le I. As shown in T able I, T ABLE I N O R M A L I Z E D H E L D - O U T L O G P R O B A B I L I T Y O F R P , R E C O V E R L 2 , A N D G I B B S S A M P L I N G O N N Y T T E S T D A TA . T H E M E A N ± S T D ’ S A R E C A L C U L AT E D F R O M 5 D I FF E R E N T R A N D O M T R A I N I N G - T E S T I N G S P L I T S . K Reco verL2 Gibbs RP 50 -8.22 ± 0.56 -7.42 ± 0.45 -8.54 ± 0.52 100 -7.63 ± 0.52 -7.50 ± 0.47 -7.45 ± 0.51 150 -8.03 ± 0.38 -7.31 ± 0.41 -7.84 ± 0.48 200 -7.85 ± 0.40 -7.34 ± 0.44 -7.69 ± 0.42 Gibbs has the b est descriptive power for ne w documents. RP and RecoverL2 have similar , but somewhat lower values than Gibbs. Th is may be attributed to missing novel words that appear o nly in the test set and are cru cial to the su ccess o f RecoverL2 and RP . Specifically , in real-world examples, th ere is a mo del-mismatch as a result o f w hich the data likelihoo ds of RP an d RecoverL2 suffer . Finally , we qualitatively access the to pics produced by our RP algorithm . W e show some example topics extracted b y RP trained on the en tir e NYT dataset of M = 30 0 , 000 documen ts in T able I I 3 For ea ch to pic, its most frequ ent words ar e listed. As can be seen, the e stimated topics d o form recog nizable themes th at c an be assigned meanin gful labels. T he fu ll list of all K = 100 to pics estimated on the NYT dataset c an be found in [3] . 3 The zzz prefix in the NYT vocab ulary is used to annotate certain special named entitie s. For example, zzz nfl annotates NFL. T ABLE II E X A M P L E S O F T O P I C S E S T I M AT E D B Y R P O N N Y T T opic la- bel W ords in decreasin g order of estimate d probabilitie s “weather” weathe r wind air s torm rain cold “feeling” feeling sense lov e character heart emotion “election” elec tion zzz florida ballot vote zzz al gore recount “game” yard game team season play zzz nfl V I I . C O N C L U S I O N A N D D I S C U S S I O N This paper pr oposed a pr ovably consistent and efficient al- gorithm fo r top ic discovery . W e co nsidered a natur al structural proper ty – topic separability – on the topic matrix and ex- ploited its geometric implications. W e resolved the necessary and sufficient con ditions that can g uarantee consistent novel words detection as well as separable topic estima tion. W e then propo sed a random projection s ba sed alg orithm that has not only provably p olyno mial statistical and com putationa l complexity but a lso state-of-the -art per forman ce on semi- synthetic and re al-world datasets. While we focused on the standard c entralized batch imple- mentation in this paper, it turns ou t that our rando m projec tions based scheme is n aturally amen able to an efficient distributed implementatio n which is of interest whe n the docu ments are stored on a network o f distributed servers. This is because the iid isotro pic projectio n dir ections can be pr ecompute d an d shared acro ss docu ment servers, and counts, pr ojections, and co-occu rrence matr ix co mputation s h av e an additive stru cture which allows partial comp utations to be perfor med at each docume nt server locally and then ag gregated at a fu sion center with only a small comm unication cost. It tu rns o ut that the distributed implem entation can provably m atch th e polyno mial co mputation al and statistical efficiency g uarantees of its centr alized counterp art. As a consequ ence, it provide s a provably efficient alternativ e to the distributed topic e sti- mation p roblem wh ich has b een tackle d using variations of MCMC or V ariation al-Bayes in the literature [ 20], [35]–[ 37] This is appealin g for modern web-scale databases, e.g ., th ose generated by T witter Streaming. A comprehen si ve th eoretical and empirical in vestigation o f the distrib uted variation of our algorithm can be found in [ 5]. 13 Separability of genera l mea sures: W e d efined and studied the n otion of separability for a W × K topic m atrix β which is a fin ite collection of K pro bability distributions over a finite set (of size W ). It turns out that we can extend the notion separ ability to a finite collection o f mea sures over a measurable space. This necessitates making a small technical modification to th e definition of separa bility to acco mmodate the po ssibility of only having “novel subsets” that hav e zero measure. W e also show that our generalized d efinition of sep- arability is equ iv alent to the so-called irreducibility pro perty of a finite collectio n of measures that has recently been studied in the context of mix ture mod els to establish con ditions for the identifiability of the mixing co mpon ents [ 38], [39]. Consider a co llection of K measur es ν 1 , . . . , ν K over a mea- surable space ( X , F ) , where X is a set an d F is a σ -alg ebra over X . W e d efine th e g eneralized notion of sepa rability f or measures as fo llows. Definition 2. (Separability) A collection of K measures ν 1 , . . . , ν K over a measurable space ( X , F ) is sep arable if for all k = 1 , . . . , K , inf A ∈F : ν k > 0 max j : j 6 = k ν j ( A ) ν k ( A ) = 0 . (8) Separability req uires that f or ea ch measure ν k , th ere ex- ists a sequence of measurable sets A ( k ) n , of nonzer o mea- sure with respec t to ν k , such that, for all j 6 = k , the ratios ν j ( A ( k ) n ) /ν k ( A ( k ) n ) vanish asymp totically . I ntuitively , this m eans that for e ach measure there exists a sequ ence of nonzer o-measure measurable subsets that are asymptotically “novel” for that measure. When X is a finite set as in topic modeling , this reduces to the existence of novel words as in Definition 1 and A ( k ) n are simply the sets of novel words for topic k . The separability property just defined is equivalent to the so- called irred ucibility proper ty . Inform ally , a co llection of m ea- sures is irr educible if only no nnegative linear combin ations of them c an pr odu ce a measur e . F ormally , Definition 3. (Irreducibility) A collectio n o f K me asur es ν 1 , . . . , ν K over a measurable space ( X , F ) is irr ed ucible if the following condition holds: If ∀ A ∈ F , P K k =1 c k ν k ( A ) ≥ 0 , then for all k = 1 , . . . , K , c k ≥ 0 . For a collectio n of n onzero measur es, 4 these two pr operties are e quiv alent. Formally , Lemma 9. A collection of nonzer o measures ν 1 , . . . , ν K over a measurable space ( X , F ) is irr educible if and only if it is separable. I n pa rticular , a top ic matrix β is irr educib le if an d only if it is separable. The proof app ears in Appen dix M. T o pic models like LD A discussed in this pa per belong to the larger family of Mixed Membersh ip Latent V ariab le Models [13] wh ich have b een successfully em ployed in a variety of problem s that include text analysis, genetic ana lysis, network commun ity detection, and ran king and prefere nce discovery . 4 A measure ν is nonzero if there exist s at least one measurable set A for which ν ( A ) > 0 . The structure-leveraging app roach pro posed in this paper can be pote ntially extended to this larger family of models. Some initial steps in this direction for r ank and preference data are explored in [32]. Finally , in this en tire p aper, the topic matrix is assum ed to be separ able. While exact separab ility may b e an idealization , as shown in [16] , appro ximate separability is bo th theo reti- cally inevitable and p ractically enco untered when W ≫ K . Extendin g the results of this work to app roximately separab le topic matrices is an in teresting direction for futu re w ork. Som e steps in this direc tion are explored in [40 ] in the context of learning mixed m embership Mallows mo dels f or rankings. A C K N O W L E D G M E N T This article is b ased upon work supp orted by the U.S. AFOSR un der award numb er # F A9550- 10-1- 0458 (subaward # A1795) and the U.S. NSF un der award n umbers # 15276 18 and # 12 1899 2. The views and conclusions contain ed in this article are tho se of the autho rs and shou ld not be interpr eted as necessarily representin g th e official policies, either expressed or implied, o f th e ag encies. A P P E N D I X A. Pr oof o f Lemma 1 Pr o of. Th e proo f is by contradic tion. W e will show that if ¯ R is non-simplicial, we can con struct two topic matrices β (1) and β (2) whose sets o f n ovel word s are not identical and yet X has the same d istribution under both mod els. The difference between con structed β (1) and β (2) is no t a result o f colu mn permutatio n. Th is will imp ly th e impossibility of consistent novel word d etection. Suppose ¯ R is non -simplicial. Then we can assume, with- out loss of generality , th at its first row is within the con - vex hu ll of the remainin g rows, i.e., ¯ R 1 = P K j =2 c j ¯ R j , where ¯ R j denotes th e j -th row of ¯ R , and c 2 , . . . , c K ≥ 0 , P K j =2 c j = 1 are con vex combin ation weights. Compactly , e ⊤ ¯ Re = 0 wher e e := [ − 1 , c 2 , . . . , c K ] ⊤ . Recalling that ¯ R = dia g ( a ) − 1 R diag( a ) − 1 , where a is a positive vector and R = E ( θ m θ m ⊤ ) by definition, we have 0 = e ⊤ ¯ Re = (dia g( a ) − 1 e ) ⊤ E ( θ m θ m ⊤ )(diag( a ) − 1 e ) = E ( k θ m ⊤ diag( a ) − 1 e k 2 2 ) , which implies that θ m ⊤ diag( a ) − 1 e a.s. = 0 . Fro m this it follows that if we define tw o non neg- ativ e row vectors b 1 := b  a − 1 1 , 0 , . . . , 0  and b 2 = b  (1 − α ) a − 1 1 , αc 2 a − 1 2 , . . . , αc K a − 1 K  , wh ere b > 0 , 0 < α < 1 a re constants, then b 1 θ m a.s. = b 2 θ m for any d istribution on θ m . Now we constru ct two sep arable topic matrices β (1) and β (2) as follows. Let b 1 be the first row a nd b 2 be the second in β (1) . Let b 2 be the first row and b 1 the secon d in β (2) . Let B ∈ R W − 2 × K be a valid separable topic matrix. Set the remaining ( W − 2) rows o f both β (1) and β (2) to be B ( I K − diag( b 1 + b 2 )) . W e can cho ose b to be small en ough to ensure that each element of ( b 1 + b 2 ) is strictly less than 1 . This will en sure th at β (1) and β (2) are column-stoc hastic 14 and therefor e valid separab le to pic matrices. Observe that b 2 has at lease two nonzero componen ts. Th us, word 1 is novel for β (1) but non- novel for β (2) . By c onstruction , β (1) θ a.s. = β (2) θ , i.e., the distribution of X condition ed on θ is the sam e for both models. Margina lizing over θ , the distribution of X u nder each top ic matrix is th e same. Th us n o algorith m can c onsistently distingu ish between β (1) and β (2) based on X . B. Pr oof o f Lemma 2 Pr o of. Th e proof is by contr adiction. Suppose that ¯ R is not affine-independe nt. The n there exists a λ 6 = 0 with 1 ⊤ λ = 0 such that λ ⊤ ¯ R = 0 so that λ ⊤ ¯ R λ = 0 . Recalling tha t ¯ R = diag( a ) − 1 R diag( a ) − 1 , we h av e, 0 = λ ⊤ ¯ R λ = (diag( a ) − 1 λ ) ⊤ E ( θ m θ m ⊤ )(diag( a ) − 1 λ ) = E ( k θ m ⊤ diag( a ) − 1 λ k 2 ) , which implies that θ m ⊤ diag( a ) − 1 λ a.s. = 0 . Since λ 6 = 0 , we can assume, without loss of g enerality , that the first t elements o f λ , λ 1 , . . . , λ t > 0 , the next s elemen ts of λ , λ t +1 , . . . , λ t + s < 0 , a nd the remain ing elemen ts are 0 for some s, t : s > 0 , t > 0 , s + t ≤ K . Therefo re, if we define two n onnegative and n onzero row vectors b 1 := b  λ 1 a − 1 1 , . . . , λ t a − 1 t 0 , . . . , 0  and b 2 := − b  0 , . . . , 0 , λ t +1 a − 1 t +1 , . . . , λ s a − 1 s , 0 , . . . , 0  , where b > 0 is a constant, then b 1 θ m a.s. = b 2 θ m . Now we construct two topic matrices β (1) and β (2) as follows. Let b 1 be the fir st row an d b 2 the second in β 1 . Le t b 2 be the first row and b 1 the second in β 2 . Let B ∈ R W − 2 × K be a valid topic matrix an d assume that it is separable . Set the remaining ( W − 2) rows o f both β (1) and β (2) to be B ( I K − diag( b 1 + b 2 )) . W e can choo se b to b e small enoug h to ensu re that each eleme nt of ( b 1 + b 2 ) is strictly less tha n 1 . This will ensu re that β (1) and β (2) are column- stochastic an d therefor e valid top ic matrices. W e note that the suppor ts o f b 1 and b 2 are disjoint and both are non- empty . They appea r in distinct topics. By co nstruction, β (1) θ a.s. = β (2) θ ⇒ the distribution of the observation X conditio ned on θ is th e same for b oth models. Margin alizing over θ , the distributions of X und er the topic matrice s are th e same. Thu s no algo rithm can distingu ish between β 1 and β 2 based on X . C. Pr oof of Pr oposition 1 and Pr oposition 2 Proposition 1 and Proposition 2 summ arizes the relation- ships between the full-ra nk, affine-ind ependen ce, simplicial, and diagonal-do minance co nditions. Here we consider all th e pairwise implication sep arately . (1) ¯ R is γ a -affine-independ ent ⇒ ¯ R is at least γ a -simplicial. Pr o of. By d efinition of affine in depend ence, k P K k =1 λ k ¯ R k k 2 ≥ γ a k λ k 2 > 0 for all λ ∈ R K such th at P K k =1 λ k = 0 and λ 6 = 0 . If for ea ch i ∈ [ K ] we set λ k = 1 fo r k = i and choose λ k ≤ 0 , ∀ k 6 = i then (i) k λ k 2 ≥ 1 , (ii) {− λ k , k 6 = i } are con vex weights, i.e., they are nonnegative and sum to 1 , and (iii) P K k =1 λ k ¯ R k = ¯ R i − P k 6 = i ( − λ k ) ¯ R k . Th erefor e, for all i ∈ [ K ] , k ¯ R i − P k 6 = i ( − λ k ) ¯ R k k 2 ≥ γ a > 0 which pr oves that ¯ R is at least γ a -simplicial. For the rev erse implication, con sider ¯ R =     1 0 0 . 5 0 . 5 0 1 0 . 5 0 . 5 0 . 5 0 . 5 1 0 0 . 5 0 . 5 0 1     . It is simplicial but is no t af fine independen t (the 1 , 1 , − 1 , − 1 combinatio n of the 4 rows would be 0 ). (2) ¯ R is full r ank with minim um eigenv alue γ r ⇒ ¯ R is at least γ r -affine-independ ent. Pr o of. Th e Rayleig h-quo tient ch aracterization o f the min- imum eigenv alue o f a symmetric, po siti ve-definite matrix ¯ R gives min λ 6 = 0 k λ ⊤ R k 2 / k λ k 2 = γ r > 0 . Therefo re, min λ 6 = 0 , 1 ⊤ λ =0 k λ ⊤ R k 2 / k λ k 2 ≥ γ r > 0 . One can construct examples that contrad ict the re verse imp lication: ¯ R =   1 0 1 0 1 1 1 1 2   . which is affine ind epende nt, but not linear independen t. (3) ¯ R is γ d -diagon al-domin ant ⇒ ¯ R is at least γ d -simplicial. Pr o of. Notin g that ¯ R i,i − ¯ R i,j ≥ γ d > 0 for all i, j , then the d istance of the first row of ¯ R , ¯ R 1 , to any con- vex comb ination of the remain ing r ows, K P j =2 c j ¯ R j , wh ere c 2 , . . . , c K are convex com bination weig hts, can b e lower bound ed by , k ¯ R 1 − K P j =2 c j ¯ R j k 2 ≥ | ¯ R 1 , 1 − K P j =2 c j ¯ R j, 1 | = | K P j =2 c j ( ¯ R 1 , 1 − ¯ R j, 1 ) | ≥ γ d > 0 . There fore, ¯ R is at least γ d -simplicial. It is straightforward to construct examples tha t contradict the reverse implication : ¯ R =   1 0 1 0 1 1 1 1 2   . which is affine indep endent, hen ce simplicial, b ut not diago nal- dominan t. (4) ¯ R being diago nal-dom inant neither implies nor is imp lied by ¯ R being affine-indep endent. Pr o of. Con sider th e following two examples: ¯ R =   1 0 1 0 1 1 1 1 2   . and ¯ R =     1 0 0 . 5 0 . 5 0 1 0 . 5 0 . 5 0 . 5 0 . 5 1 0 0 . 5 0 . 5 0 1     . They are the examples for the two side s of this assertion. 15 D. Pr oof of Lemma 3 Pr o of. Recall tha t ¯ A = ¯ β ¯ θ wher e ¯ A and ¯ θ are row-norm alized version of A and θ , ¯ β := diag ( A1 ) − 1 β diag( θ 1 ) . ¯ β is row- stochastic and is separable if β is separ able. If w is a novel word of topic k , ¯ β wk = 1 and ¯ β wj = 0 , ∀ j 6 = k . W e have then ¯ A w = ¯ θ k . If w is a non-n ovel word, ¯ A w = P k ¯ β wk ¯ θ k is a conve x comb ination o f th e rows of ¯ θ . W e next prove that if ¯ R is γ s -simplicial with som e co nstant γ s > 0 , then, the ran dom matr ix ¯ θ is also simplicial with hig h probab ility , i.e., for any c ∈ R K such tha t c k = 1 , c j ≤ 0 , j 6 = k , P j 6 = k − c j = 1 , k ∈ [ K ] , the M -dim ensional vector c ⊤ ¯ θ is not all-zero with high prob ability . In another words, we need to show that the max imum abso lute value of the M entries in c ⊤ ¯ θ is strictly positi ve . No ting that the m -th entry of c ⊤ ¯ θ (scaled by M ) is M c ⊤ ¯ θ m = c ⊤ diag( a ) − 1 θ m + c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m the absolute value ca n b e lo wer bounded as follows, | M c ⊤ ¯ θ m | ≥| c ⊤ diag( a ) − 1 θ m | −| c ⊤ (diag( X d θ d / M ) − 1 − diag( a ) − 1 ) θ m | (9) The key idea s are: ( i ) as M incr eases, the second term in Eq. (9) conv erges to 0 , and ( ii ) th e m aximum o f the first term in Eq. (9) amo ng m = 1 , . . . , M is strictly above zero with high p robab ility . For ( i ) , recall that a = E ( θ m ) an d 0 ≤ θ m k ≤ 1 , by Hoeffding’ s lemma ∀ t > 0 , Pr( k X d θ d / M − a k ∞ ≥ t ) ≤ 2 K exp( − 2 M t 2 ) Also note tha t ∀ 0 < ǫ < 1 , k X d θ d / M − a k ∞ ≤ ǫa 2 min / 2 ⇒k (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) k ∞ ≤ ǫ ⇒| c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | ≤ ǫ where a min is th e minimu m entry of a . The last inequality is true since P K k =1 θ m k = 1 . In sum, we have Pr( | c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | > ǫ ) ≤ 2 K ex p( − M ǫ 2 a 4 min / 2) (10) For ( ii ) , recall that ¯ R is γ s -simplicial and k c ⊤ ¯ R k ≥ γ s . Ther efore, c ⊤ ¯ Rc = c ⊤ ¯ R ¯ R † ¯ Rc ≥ γ 2 s λ max where λ max is the maxim um sing ular value of ¯ R . Noting that ¯ R = diag( a ) − 1 E ( θ m θ m ⊤ ) dia g( a ) − 1 , we g et E ( | c ⊤ diag( a ) − 1 θ m | 2 ) ≥ γ 2 s λ max (11) For con venience, let x m := | c ⊤ diag( a ) − 1 θ m | 2 ≤ 1 /a 2 min . Then, by Ho effding’ s lemma, Pr( E ( x m ) − M X m =1 x m / M ≥ γ 2 s 2 λ max ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) Combining Eq. (1 1) we get Pr( M X m =1 x m / M ≤ γ 2 s 2 λ max ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) Hence Pr( M max m =1 x m ≤ γ 2 s 2 λ max ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) (12) i.e., th e maxim um absolu te value of th e first term in Eq. (9) is g reater than γ s / √ 2 λ max with h igh probability . T o sum up, if we set ǫ = γ s / √ 2 λ max in Eq. ( 10), we get Pr( M max m =1 | c ⊤ ¯ θ m | = 0) ≤ Pr( M max m =1 x m ≤ γ 2 s 2 λ max ) + Pr( | c ⊤ (diag( a P d θ d ) − I ) θ m | > ǫ ) ≤ exp( − M γ 4 s a 4 min / 2 λ 2 max ) + 2 K ex p( − M γ 2 s a 4 min / 4 λ max ) T o summarize, the probab ility that ¯ θ is not simplicial is at most ex p( − M γ 4 s a 4 min / 2 λ 2 max ) + 2 K exp( − M γ 2 s a 4 min / 4 λ max ) . This conv erges to 0 expo nentially fast as M → ∞ . Ther efore, with high probab ility , all the row-vectors of ¯ θ are extreme points o f the con vex hull th ey form and this concludes ou r proof . E. Pr oof o f Lemma 4 Pr o of. W e first show that if ¯ R is γ a affine-independe nt, ¯ θ is also affine-indepe ndent with high p robab ility , i. e., ∀ c ∈ R K such that c 6 = 0 , P k c k = 0 , c ⊤ ¯ θ is n ot all-ze ro vector with high probab ility . Our proo f is similar to that of Lem ma 3. W e first re-write the m -th entry of c ⊤ ¯ θ (with some scaling) as, M c ⊤ ¯ θ m = c ⊤ diag( a ) − 1 θ m + c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m and lo wer bound its absolute v alue b y | M c ⊤ ¯ θ m | ≥| c ⊤ diag( a ) − 1 θ m | −| c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | (13 ) W e will then show that: ( i ) as M increases, the secon d term in Eq. (13) conv erges to 0 , and ( ii ) the maximu m of the first term in Eq. (13) amon g M iid samples is strictly above zer o with high p robab ility . For ( i ) , by the Cauchy-Schwartz inequality | c ⊤ (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | ≤k c k 2 k (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m k 2 ≤k c k 2 k (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) k ∞ Here the last ineq uality is tr ue since θ m k ≤ 1 , P k θ m k = 1 . Similar to Eq. (10), we h av e, Pr( | (diag( X d θ d / M ) − 1 − diag ( a ) − 1 ) θ m | ≥ k c k 2 ǫ ) 16 ≤ 2 K exp( − M ǫ 2 a 4 min / 4) (14) for any 0 < ǫ < 1 , a min is the minimum entry of a . For ( i i ) , recall that by definition, k c ⊤ ¯ R k 2 ≥ γ a k c k 2 . Hence c ⊤ ¯ Rc ≥ γ 2 a k c k 2 2 /λ max . Therefore, by the construction o f ¯ R , we have, E ( | c ⊤ diag( a ) − 1 θ m | 2 / k c k 2 2 ) ≥ γ 2 a λ max (15) For convenience, let x m := | c ⊤ diag( a ) − 1 θ m | 2 / k c k 2 2 ≤ 1 /a 2 min . Following the sam e procedur e as in Eq. ( 12), we have, Pr( M max m =1 x m ≤ γ 2 a 2 λ max ) ≤ exp( − M γ 4 a a 4 min / 2 λ 2 max ) (16 ) Therefo re, if we set in Eq. (14) ǫ = γ / √ 2 λ max , we g et, Pr( M max m =1 | c ⊤ ¯ θ m | ≤ 0) ≤ exp( − M γ 4 a a 4 min / 2 λ 2 max ) + 2 K ex p( − M γ 2 a a 4 min / 4 λ max ) In sum mary , if ¯ R is γ a affine-independe nt, ¯ θ is also a ffine- indepen dent with high pr obability . Now we tu rn to prove L emma 4. By Lemm a 3 , detecting K distinct n ovel word s for K topics is equiv alent to kn owing ¯ θ up to a row permutation. Noting that ¯ A w = P k ¯ β wk ¯ θ k . it follows tha t ¯ β wk , k = 1 , . . . , K is one optimal solutio n to the following constrain ed optim ization prob lem: min k ¯ A w − K X k =1 b k ¯ θ k k 2 s.t b k ≥ 0 , K X k =1 b k = 1 Since ¯ θ is affine-independen t with hig h probability , there- fore, this optimal solu tion is unique with high p robab ility . If this is no t true , then there would exist two distinct solutio ns b 1 1 , . . . , b 1 K and b 2 1 , . . . , b 2 K such tha t ¯ A w = P K k =1 b 1 k ¯ θ k = P K k =1 b 2 k ¯ θ k . P b 1 k = P b 2 k = 1 . W e would then ob tain K X k =1 ( b 1 k − b 2 k ) ¯ θ k = 0 where the coefficients b 1 k − b 2 k are not all zero and P k b 1 k − b 2 k = 0 . This would contradict the affine-indepen dence defin ition. Finally , we check the reno rmalization steps. Recall tha t since diag( A1 ) ¯ β = β diag( θ 1 ) , diag( A1 ) can be dire ctly obtained f rom the o bservations. So we can first reno rmalize the rows of ¯ β . Removing diag( θ 1 ) is then simp ly a column renorm alization ope ration (recall that β is column -stochastic). It is no t nec essary to kn ow the exact the value of diag( θ 1 ) . T o sum up , by solving a constrain ed linear regression followed by suitable row r enorma lization, we can o btain a unique solution which is th e grou nd truth topic matrix. This conclud es the proof of Lemm a 4. F . Pr oof of Lemma 5 Lemma 5 establishes the second o rder co -occurr ence es- timator in Eq . (1). W e first pr ovide a generic meth od to establish th e explicit c onv ergence bou nd for a function ψ ( X ) of d random variables X 1 , . . . , X d , then apply it to establish Lemma 5 Proposition 3. Let X = [ X 1 , . . . , X d ] be d random vari- ables an d a = [ a 1 , . . . , a d ] b e positive constants. Let E := S i ∈I {| X i − a i | ≥ δ i } for some constants δ i > 0 , and ψ ( X ) be a co ntinuou sly d iffer en tiable function in C := E c . If fo r i = 1 , . . . , d , P r( | X i − a i | ≥ ǫ ) ≤ f i ( ǫ ) ar e the individu al conver gence rates and max X ∈C | ∂ i ψ ( X ) | ≤ C i , then, Pr( | ψ ( X ) − ψ ( a ) | ≥ ǫ ) ≤ X i f i ( δ i ) + X i =1 f i ( ǫ dC i ) Pr o of. Since ψ ( X ) is continu ously d ifferentiable in C , ∀ X ∈ C , ∃ λ ∈ (0 , 1 ) such that ψ ( X ) − ψ ( a ) = ∇ ⊤ ψ ((1 − λ ) a + λ X ) · ( X − a ) Therefo re, Pr( | ψ ( X ) − ψ ( a ) | ≥ ǫ ) ≤ Pr( X ∈ E )+ Pr( d X i =1 | ∂ i ψ ((1 − λ ) a + λ X ) || X i − a i | ≥ ǫ | X ∈ C ) ≤ X i ∈I Pr( | X i − a i | ≥ δ i )+ d X i =1 Pr(max x ∈C | ∂ i ψ ( x ) || X i − a i | ≥ ǫ/ d ) = X i ∈I f i ( δ i ) + X i =1 f i ( ǫ dC i ) Now we turn to p rove L emma 5. Recall that ¯ X and ¯ X ′ are o btained fro m X by first splittin g eac h user’ s compariso ns into two ind epend ent halves an d th en re- scaling the rows to make th em row-stochastic henc e ¯ X = diag − 1 ( X1 ) X . Also recall th at ¯ β = diag − 1 ( β a ) β diag( a ) , ¯ R = dia g − 1 ( a ) R diag − 1 ( a ) , and ¯ β is row stoch astic. For any 1 ≤ i, j ≤ W , b E i,j = M 1 M P m =1 X ′ i,m ( M X m =1 X ′ i,m X j,m ) 1 M P m =1 X i,m = 1 / M M P m =1 ( X ′ i,m X j,m ) (1 / M M P m =1 X ′ i,m )(1 / M M P m =1 X j,m ) = 1 M N 2 M ,N ,N P m =1 ,n =1 ,n ′ =1 I ( w m,n = i ) I ( w ′ m,n ′ = j ) 1 M N M ,N P m =1 ,n =1 I ( w m,n = i ) 1 M N M ,N P m =1 ,n =1 I ( w ′ m,n = i ) := F i,j ( M , N ) G i ( M , N ) H j ( M , N ) From th e Stron g Law of Large Numb ers and the gener ativ e topic modeling p rocedur e, F i,j ( M , N ) a.s. − − → E ( I ( w m,n = i ) I ( w ′ m,n ′ = j )) = ( β R β ⊤ ) i,j := p i,j 17 G i ( M , N ) a.s. − − → E ( I ( w ′ m,n = i )) = ( β a ) i := p i H i ( M , N ) a.s. − − → E ( I ( w m,n = j )) = ( β a ) j := p j and ( β R β ⊤ ) i,j ( β a ) i ( β a ) j = E i,j by definition. Using McDiar mid’ s inequality , we o btain Pr( | F i,j − p i,j | ≥ ǫ ) ≤ 2 ex p( − ǫ 2 M N ) Pr( | G i − p i | ≥ ǫ ) ≤ 2 exp( − 2 ǫ 2 M N ) Pr( | H j − p j | ≥ ǫ ) ≤ 2 ex p( − 2 ǫ 2 M N ) In order to calculate Pr {| F i,j G i H j − p i,j p i p j | ≥ ǫ } , we ap ply the results from Propo sition 3. Let ψ ( x 1 , x 2 , x 3 ) = x 1 x 2 x 3 with x 1 , x 2 , x 3 > 0 , an d a 1 = p i,j , a 2 = p i , a 3 = p j . Let I = { 2 , 3 } , δ 2 = γ p i , and δ 3 = γ p j . Then | ∂ 1 ψ | = 1 x 2 x 3 , | ∂ 2 ψ | = x 1 x 2 2 x 3 , and | ∂ 3 ψ | = x 1 x 2 x 2 3 . If F i,j = x 1 , G i = x 2 , and H j = x 3 , then F i,j ≤ G i , F i,j ≤ H j . Then n ote that C 1 = max C | ∂ 1 ψ | = max C 1 G i H j ≤ 1 (1 − γ ) 2 p i p j C 2 = max C | ∂ 2 ψ | = max C F i,j G 2 i H j ≤ max C 1 G i H j ≤ 1 (1 − γ ) 2 p i p j C 3 = max C | ∂ 3 ψ | = max C F i,j G i H 2 j ≤ max C 1 G i H j ≤ 1 (1 − γ ) 2 p i p j By applying Pr oposition 3, we get Pr {| F i,j G i H j − p i,j p i p j | ≥ ǫ } ≤ exp( − 2 γ 2 p 2 i M N ) + exp( − 2 γ 2 p 2 j M N ) + 2 exp( − ǫ 2 (1 − γ ) 4 ( p i p j ) 2 M N / 9) + 4 exp( − 2 ǫ 2 (1 − γ ) 4 ( p i p j ) 2 M N / 9) ≤ 2 exp( − 2 γ 2 η 2 M N ) + 6 e x p( − ǫ 2 (1 − γ ) 4 η 4 M N / 9) where η = min 1 ≤ i ≤ W p i . There are many strategies for optimizing the free parameter γ . W e set 2 γ 2 = (1 − γ ) 4 9 and solve for γ to obtain Pr {| F i,j G i H j − p i,j p i p j | ≥ ǫ } ≤ 8 exp( − ǫ 2 η 4 M N / 20) Finally , by applying the u nion bou nd to the W 2 entries in b E , we obtain th e claim ed result. G. Pr oof of Lemma 6 Pr o of. W e first show that when ¯ R is γ s simplicial and β is separable, then Y = ¯ R ¯ β ⊤ is at least γ s -simplicial. W ithout loss o f gen erality we assume that word 1 , . . . , K are the n ovel words for topic 1 to K . By defin ition, ¯ β ⊤ = [ I K , B ] hence Y = ¯ R ¯ β ⊤ =  ¯ R , ¯ RB  . Therefo re, fo r co n vex combin ation weights c 2 , . . . , c K ≥ 0 such that P K j =2 c j = 1 , k Y 1 − K X j =2 c j Y j k ≥ k ¯ R 1 − K X j =2 c j ¯ R j k ≥ γ s > 0 Therefo re the fir st row vector Y 1 is at least γ s distant away from the conve x hull o f the remainin g rows. Similarly , any row of Y is at lea st γ s distant away from the conve x hull of the rem aining rows hence Y is at least γ s simplicial. Th e rest of the p roof will be exactly the same as fo r Le mma 6. H. Pr oof o f Lemma 7 Pr o of. W e first show that when ¯ R is γ a affine indep endent and β is separa ble, then Y = ¯ R ¯ β ⊤ is at least γ a affine indepen dent. Similarly as in the pro of of Lemma 6, w e assume that word 1 , . . . , K are the novel words for topic 1 to K . By d efinition, ¯ β ⊤ = [ I K , B ] hen ce Y = ¯ R ¯ β ⊤ =  ¯ R , ¯ RB  . ∀ λ ∈ R K such that λ 6 = 0 , P K k =1 λ k = 0 , then, k K X k =1 Y k k 2 / k λ k 2 ≥ k K X k =1 ¯ R k k 2 / k λ k 2 ≥ γ a Hence Y is affine in depend ent. The The rest o f the proof will be exactly the same as that for Lemma 4. W e no te tha t on ce the novel words for K topics are detection, we c an use o nly the co rrespond ing colu mns of E for linear regression. Formally , let E ∗ be the W × K matrix formed by the c olumns of the E th at corr espond to K distinct novel words. Then, E ∗ = ¯ β ¯ R . The rest of the proof is ag ain the sam e as that for Lemma 4 . I. Pr oof of Lemma 8 Pr o of. W e first check that if q w > 0 , w must be a n ovel word. W ithout loss of gen erality let word 1 , . . . , K be novel word s for K distinct topics. ∀ w , E w = P ¯ β wk E k . ∀ d ∈ R W , h E w , d i = X ¯ β wk h E k , d i ≤ ma x k h E k , d i and the la st equality holds if, an d only if, th ere e xist some k such that ¯ β wk = 1 which implies w is a novel words. W e then show that fo r a novel word w , q w > 0 . W e need to show f or each topic k , when d is sampled fr om an isotropic distribution in R W , th ere exist a set of direction s d with non zero pro bability such that h E k , d i > h E l , d i for l = 1 , . . . , K , l 6 = k . First, one can check by d efinition that Y = ( E ⊤ 1 , . . . , E ⊤ K ) ⊤ = ¯ R ¯ β ⊤ is a t least γ s -simplicial if ¯ R is γ s -simplicial. Let E ∗ 1 be the pro jection of E 1 onto the simplex f ormed by the r emaining row vectors E 2 , . . . , E K . By the orthogo nality principle, h E 1 − E ∗ 1 , E k − E ∗ 1 i ≤ 0 for k = 2 , . . . , K . Therefo re, for d 1 = E ⊤ 1 − E ∗⊤ 1 , E 1 d 1 − E k d 1 = k d 1 k 2 − ( E k − E ∗ 1 ) d 1 ≥ γ 2 s > 0 Due to the continuity of the inner produ ct, there exist a neighbo r on the un ite sphe re aroun d d 1 / k d 1 k 2 that E 1 has maximum projection value. T his conclude our proof . J. Pr oof of Theo r em 2 Pr o of. W e first co nsider the rand om projectio n steps ( step 3 to 12 in Alg . 2). For p rojection along direc tion d r , we fir st calcu- late p rojection values r = ¯ X ′ ¯ X ⊤ d r , find the maxim izer ind ex i ∗ and th e corr espondin g set ˆ J i ∗ , and then evaluate I ( ∀ j ∈ ˆ J w , v w > v j ) for all the words w in ˆ J c i ∗ = { 1 , . . . , W } \ ˆ J i ∗ . ( I ) Th e set ˆ J c i ∗ have up to |C k | elements asymp totically , where k is the topic associated with word i ∗ . This is co nsidered a small constan t O (1) ; ( I I ) Note that b Ed r = M ¯ X ′ ( ¯ X ⊤ d r ) and each colu mn of ¯ X has at mo st N ≪ W nonze ro entr ies. Calculating the W × 1 projection value vector v requires two sparse matrix-vector m ultiplications and takes O ( M N ) time. 18 Finding the max imum req uires W runn ing time; ( I I I ) T o ev aluate one set ˆ J i ← { j : b E i,i + b E j,j − 2 b E i,j ≥ ζ / 2 } we need to calcu late b E i,j , j = 1 , . . . , W . This can b e viewed as projectin g b E along d = e i and takes O ( M N ) . W e also note that the diagon al en tries E w, w , w = 1 , . . . , W can be calculated once u sing O ( W ) tim e. T o sum up, these step s takes O ( M N P + W P ) runnin g time. W e then con sider th e detectin g an d c lustering steps (step 14 to 21 in Alg. 2). W e no te that all the cond itions in Step 17 have been calcu lated in the p revious steps, and recall that the number of n ovel words are small con stant per topic, then, this step will req uire a runn ing time of O ( K 2 ) . W e last co nsider the topic estimation steps in Algorith m 3. Here all the co rrespon ding inpu ts fo r the linear regression have already been co mputed in the pro jection step . Each linear regression has K variables and we upper boun d its runnin g time by O ( K 3 ) . Calcu lating the row-norma lization factors 1 M X1 r equires O ( M N ) time. T he row and column re-nor malization eac h requires at most O ( W K ) ru nning tim e. Overall, we need a O ( W K 3 + M N ) running time . Other steps are also efficient. Splitting each d ocumen t into two in depend ent halves takes linear tim e in N for each docume nt since we can achieve it using ran dom permu tation over N items. T o g enerate each ran dom direction d r requires O ( W ) comp lexity if we use th e spher ical Gaussian p rior . While we can d irectly sort the em pirical estimated solid ang les (in O ( W lo g ( W )) time) , w e only search fo r the words with largest solid a ngles whose number is a con stant w .r .t W , therefor e it would take only O ( W ) time. K. Pr oof of The or em 3 W e fo cus on the case wh en the rand om projection directio ns are sam pled from any isotropic distribution. Our proof is not tied to the sp ecial fo rm of the d istribution; just its isotrop ic nature. W e first provide som e u seful p ropo sitions. W e denote by C k the set of all n ovel word of top ic k , fo r k ∈ [ K ] , and denote by C 0 the set of all non-novel words. W e first show , Proposition 4. Let E i be the i -th r ow of E . Suppose β is separable and ¯ R is γ s -simplicial, th en the following is true: F or all k ∈ [ K ] , k E i − E j k E i,i − 2 E i,j + E j,j i ∈ C k , j ∈ C k 0 0 i ∈ C k , j / ∈ C k ≥ (1 − b ) γ s ≥ (1 − b ) 2 γ 2 s /λ max wher e b = max j ∈C 0 ,l ¯ β j,l and λ max > 0 is the maximum eigen value of ¯ R Pr o of. W e focus on th e case k = 1 since the proofs for other values o f k are an alogou s. Le t ¯ β i be the i -th r ow vector of matrix ¯ β . T o show the ab ove results, r ecall th at E = ¯ β ¯ R ¯ β ⊤ . Then k E i − E j k = k ( ¯ β i − ¯ β j ) ¯ R ¯ β ⊤ k E i,i − 2 E i,j + E j,j = ( ¯ β i − ¯ β j ) R ′ ( ¯ β i − ¯ β j ) ⊤ . It is clear that when i , j ∈ C 1 , i.e., th ey are both novel word for the same top ic, ¯ β i = ¯ β j = e 1 . Hence , k E i − E j k = 0 and E i,i − 2 E i,j + E j,j = 0 . When i ∈ C 1 , j / ∈ C 1 , we have ¯ β i = [1 , 0 , . . . , 0 ] , ¯ β j = [ ¯ β j,i , ¯ β j, 2 , . . . , ¯ β j,K ] with ¯ β j, 1 < 1 . Then, ¯ β i − ¯ β j = [1 − ¯ β j,i , − ¯ β j, 2 , . . . , − ¯ β j,K ] = (1 − ¯ β j,i )[1 , − c 2 , . . . , − c K ] := (1 − ¯ β j,i ) e ⊤ and P K l =2 c l = 1 . Therefo re, defining Y := ¯ R ¯ β ⊤ , we g et k E i − E j k 2 = (1 − ¯ β j,i ) k Y 1 − K X l =2 c l Y l k 2 Noting that Y is at least γ s -simplicial, we have k E i − E j k 2 ≥ (1 − b ) γ s where b = max j ∈C 0 ,k ¯ β j,k < 1 . Similarly , n ote that k e ⊤ ¯ R k ≥ γ and let ¯ R = U Σ U ⊤ be its singula r value d ecompo sition. If λ max is the m aximum eigenv alue o f ¯ R , then we have E i,i − 2 E i,j + E j,j = (1 − ¯ β j, 1 ) 2 ( e ⊤ ¯ R ) U Σ − 1 U ⊤ ( e ⊤ ¯ R ) ⊤ ≥ (1 − b ) 2 γ 2 s /λ max . The inequ ality in the last step f ollows from the ob servation that e ⊤ R ′ is within th e co lumn spa ce spanned by U . The results in Pro position 4 pr ovide two statistics fo r identifyin g novel words of th e same topic, k E i − E j k and E i,i − 2 E i,j + E j,j . While the first is straig htforward, the latter is efficient to ca lculate in practice with be tter compu tational complexity . Specifically , its emp irical version, the set J i in Algorithm 2 J i = { j : b E i,i − b E i,j − b E j,i + b E j,j ≥ d/ 2 } can b e used to discover the set o f novel words of th e same topics asymptotically . F ormally , Proposition 5. If k b E − E k ∞ ≤ (1 − b ) 2 γ 2 s / 8 λ max , then, 1) F or a novel wor d i ∈ C k , J i = C c k 2) F or a no n-novel wor d j ∈ C 0 , J i ⊃ C c k Now we start to show that Alg orithm 2 can detect all the novel words of the K distinct rankings consistently . As illustrated in Lem ma 8, we detect the novel word s by ranking orderin g the solid angles q i . W e denote the minimum solid angle of the K extreme points by q ∧ . Our pro of is to show that th e estimated solid angle in E q (5), ˆ p i = 1 P P X r =1 I {∀ j ∈ J i , b E j d r ≤ b E i d r } ( 17) conv erges to the ideal solid ang le q i = Pr {∀ j ∈ S ( i ) , ( E i − E j ) d ≥ 0 } (18) as M , P → ∞ . d 1 , . . . , d P are iid d irections drawn fr om a isotropic distribution. For a novel word i ∈ C k , k = 1 , . . . , K , let S ( i ) = C c k , and for a non- novel w ord i ∈ C 0 , let S ( i ) = C c 0 . T o show the conver gence of ˆ p i to p i , we consider an intermediate quantity , p i ( b E ) = Pr {∀ j ∈ J i , ( b E i − b E j ) d ≥ 0 } First, by Hoeffding ’ s lemma , we have th e following result. 19 Proposition 6. ∀ t ≥ 0 , ∀ i , Pr {| ˆ p i − p i ( b E ) | ≤ t } ≥ 2 exp( − 2 P t 2 ) (19) Next we show the conv ergence of p i ( b E ) to solid ang le q i : Proposition 7. Con sider the case when k b E − E k ∞ ≤ d 8 and ¯ R is γ s -simplicial. If i is a novel wor d, then, q i − p i ( b E ) ≤ W √ W π d 2 k b E − E k ∞ Similarly , if j is a no n-novel wor d, we have, p j ( b E ) − q i ≤ W √ W π d 2 k b E − E k ∞ wher e d 2 , (1 − b ) γ s , d = (1 − b ) 2 γ 2 s /λ max . Pr o of. First n ote that, by th e definition of J i and Prop osi- tion 4, if k b E − E k ∞ ≤ d 8 , then, for a novel word i ∈ C k , J i = S ( i ) . And for a non- novel word i ∈ C 0 , J i ⊇ S ( i ) . For conv enience, let A j = { d : ( b E i − b E j ) d ≥ 0 } A = \ j ∈J i A j B j = { d : ( E i − E j ) d ≥ 0 } B = \ j ∈S ( i ) B j For i being a novel word, we consider q i − p i ( b E ) = Pr { B } − Pr { A } ≤ Pr { B \ A c } Note th at J i = S ( i ) when k b E − E k ≤ d/ 8 , Pr { B \ A c } = Pr { B \ ( [ j ∈S ( i ) A c j ) } ≤ X j ∈S ( i ) Pr { ( \ l ∈S ( i ) B l ) \ A c j } ≤ X j ∈S ( i ) Pr { B j \ A c j } = X j ∈S ( i ) Pr { ( b E i − b E j ) d < 0 , and ( E i − E j ) d ≥ 0 } = X j ∈S ( i ) φ j 2 π where φ j is the an gle between e j = E i − E j and b e j = b E i − b E j for any isotropic distrib ution on d . Noting that φ ≤ tan( φ ) , Pr { B \ A c } ≤ X j ∈S ( i ) tan( φ j ) 2 π ≤ X j ∈S ( i ) 1 2 π k b e j − e j k 2 k e j k 2 ≤ W √ W π d 2 k b E − E k ∞ where the last inequality is o btained by th e relationship between the ℓ ∞ norm and the ℓ 2 norm, an d the fact that for j ∈ S ( i ) , k e j k 2 = k E i − E j k 2 ≥ d 2 , (1 − b ) γ s . Ther efore for a n ovel word i , we have, q i − p i ( b E ) ≤ W √ W π d 2 k b E − E k ∞ Similarly for a non- novel word i ∈ C 0 , J i ⊇ S ( i ) , p i ( b E ) − q i = Pr { A } − P r { B } = P r { A \ B c } ≤ X j ∈S ( i ) Pr { ( \ l ∈ b S ( i ) A l ) \ B c j } ≤ X j ∈S ( i ) Pr { A j \ B c j } ≤ W √ W π d 2 k b E − E k ∞ A direct imp lication of Proposition 7 is, Proposition 8. ∀ ǫ > 0 , let ρ = min { d 8 , π d 2 ǫ W 1 . 5 } . If k b E − E k ∞ ≤ ρ , then , q i − p i ( b E ) ≤ ǫ for a novel wor d i a nd p j ( b E ) − q j ≤ ǫ for a non -novel word j . W e now pr ove Theorem 3. I n order to corr ectly de tect a ll the novel words of K distinct to pics, we decompo se th e error ev ent to be the union of the following two types, 1) Sorting err or , i.e ., ∃ i ∈ S K k =1 C k , ∃ j ∈ C 0 such that ˆ p i < ˆ p j . Th is e vent is denoted as A i,j and let A = S A i,j . 2) Clustering err o r , i. e., ∃ k , ∃ i, j ∈ C k such that i / ∈ J j . This event is denoted as B i,j and let B = S B i,j W e poin t out that the ev ent A, B are different from th e notations we u sed in Pro position 7. Accor ding to Pro posi- tion 8, we also define ρ = min { d 8 , π d 2 q ∧ 4 W 1 . 5 } and the ev ent that C = {k E − b E k ∞ ≥ ρ } . W e note that B ( C . Therefo re, P e = Pr { A [ B } ≤ Pr { A \ C c } + Pr { C } ≤ X i no vel,j non − nov el Pr { A i,j \ B c } + Pr { C } ≤ X i,j Pr( ˆ p i − ˆ p j < 0 \ k b E − E k ∞ ≥ ρ ) + Pr( k b E − E k ∞ > ρ ) The second term can b e bou nd by Lemm a 5. Now we fo cus on the first term. Note th at ˆ p i − ˆ p j = ˆ p i − ˆ p j − p i ( b E ) + p i ( b E ) − q i + q i − p j ( b E ) + p j ( b E ) − q j + q j = { ˆ p i − p i ( b E ) } + { p i ( b E ) − q i } + { p j ( b E ) − ˆ p j } + { q j − p j ( b E ) } + q i − q j and the fact that q i − q j ≥ q ∧ , then,, Pr( ˆ p i < ˆ p j \ k b E − E k ∞ ≤ ρ ) ≤ Pr( p i ( b E ) − ˆ p i ≥ q ∧ / 4) + Pr( ˆ p j − p j ( b E ) ≥ q ∧ / 4) + Pr( q i − p i ( b E ) ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) + Pr( p j ( b E ) − q j ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) ≤ 2 ex p( − P q 2 ∧ / 8) + Pr( q i − p i ( b E ) ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) + Pr( p j ( b E ) − q j ≥ q ∧ / 4) \ k b E − E k ∞ ≤ ρ ) The last equ ality is by Proposition 6. For the last two terms, by Proposition 8 is 0. Therefore, ap plying Lemma 5 we obtain, P e ≤ 2 W 2 exp( − P q 2 ∧ / 8) + 8 W 2 exp( − ρ 2 η 4 M N / 20) And this con cludes Th eorem 3. 20 L. Pr oof of Th eor em 4 W ithout loss of g enerality , let 1 , . . . , K be the n ovel word s of top ic 1 to K . W e fir st con sider the solution o f the co n- strained linear regression. T o simp lify the n otation, we denote E i = [ E i, 1 , . . . , E i,K ] are the first K entries of a row vector without the super-scripts as in Algorith m 3. Proposition 9. Let ¯ R be γ a -affine-ind ependen t. The solution to the followin g o ptimization p r o blem b b ∗ = arg min b j ≥ 0 , P b j =1 k b E i − K X j =1 b j b E j k conver ges to the i -th r ow of ¯ β , ¯ β i , as M → ∞ . Mor eover , Pr( k b b ∗ − ¯ β i k ∞ ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 320 K ) wher e η is define the same as in Lemm a 5 . Pr o of. W e note that ¯ β i is the op timal solution to the following problem with ideal word co-occurren ce statistics b ∗ = arg min b j ≥ 0 , P b j =1 k E i − K X j =1 b j E j k Define f ( E , b ) = k E i − P K j =1 b j E j k and note the fact tha t f ( E , b ∗ ) = 0 . Let Y = [ E ⊤ 1 , . . . , E ⊤ K ] ⊤ . Then, f ( E , b ) − f ( E , b ∗ ) = k E i − K X j =1 b j E j k − 0 = k K X j =1 ( b j − b ∗ j ) E j k = q ( b − b ∗ ) YY ⊤ ( b − b ∗ ) ⊤ ≥k b − b ∗ k γ a The last equality is true b y the definition of af fine- indepen dence. Next, no te tha t, | f ( E , b ) − f ( b E , b ) | ≤ k E i − b E i + X b j ( b E j − E j ) k ≤k E i − b E i k + X b j k b E j − E j k ≤ 2 max w k b E w − E w k Combining the above ineq ualities, we ob tain, k b b ∗ − b ∗ k ≤ 1 γ a { f ( E , b b ∗ ) − f ( E , b ∗ ) } = 1 γ a { f ( E , b b ∗ ) − f ( b E , b b ∗ ) + f ( b E , b b ∗ ) − f ( b E , b ∗ ) + f ( b E , b ∗ ) − f ( E , b ∗ ) } ≤ 1 γ a { f ( E , b b ∗ ) − f ( b E , b b ∗ ) + f ( b E , b ∗ ) − f ( E , b ∗ ) } ≤ 4 K 0 . 5 γ a k b E − E k ∞ where the last term converges to 0 almo st sur ely . The conver - gence rate fo llows d irectly from Lemm a 5. W e next consider the row r enormaliza tion. Let ˆ b ∗ ( i ) be the optimal solu tion in Pro position 9 for the i -th word, an d consider b B i := ˆ b ∗ ( i ) ⊤ ( 1 M X1 M × 1 ) → β i diag( a ) (20) T o show the co n vergence rate of the above equatio n, it is straightfor ward to app ly the result in Lem ma 5 Proposition 10. F o r the r ow-scaled estimation ˆ B i as in Eq. (20) , we have, Pr( | ˆ B i,k − β i,k a k | ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 1280 K ) Pr o of. By Propo sition 9, we ha ve, Pr( | b b ∗ ( i ) k − ¯ β i,k | ≥ ǫ/ 2) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 1280 K ) Recall that in Lemma 5 by McDiarm id’ s in equality , we ha ve Pr( | 1 M X1 M × 1 − B i a | ≥ ǫ / 2) ≤ exp( − ǫ 2 M N / 2) Therefo re, Pr( | ˆ B i,k − β i,k a k | ≥ ǫ ) ≤ 8 W 2 exp( − ǫ 2 M N γ 2 a η 4 1280 K ) + exp( − ǫ 2 M N / 2) where the second term is do minated b y the first term. Finally , we c onsider the column nor malization step to remove th e ef fect of diag( a ) : b β i,k := b B i,k / W X w =1 b B w, k (21) And P W w =1 b B w, k → a k for k = 1 , . . . , K . A worst case analysis o n its co n vergence is, Pr( | W X w =1 b B w, k − a k | > ǫ ) ≤ W Pr( | ˆ B i,k − β i,k a k | ≥ ǫ/W ) ≤ 8 W 3 exp( − ǫ 2 M N γ 2 a η 4 1280 W 2 K ) Combining all the result ab ove, we can s how ∀ i = 1 , . . . , W, ∀ k = 1 , . . . , K , Pr( | b β i,k − β i,k | > ǫ ) ≤ 8 W 4 K exp( − ǫ 2 M N γ 2 a a 2 min η 4 2560 W 2 K ) where a min > 0 is the minim um value of entries of a . This conclud es the result o f Th eorem 4. M. Pr oof of Lemma 9 Pr o of. W e fir st show th at irred ucibility im plies separ ability , o r equiv alently , if the collection is n ot separable, th en it is n ot irreducib le. Suppose that { ν 1 , . . . , ν K } is not separ able. Th en there exists some k ∈ [ K ] and a δ > 0 such that, inf A : ν k ( A ) > 0 max j : j 6 = k ν j ( A ) ν k ( A ) = δ > 0 . Then ∀ A ∈ F : ν k ( A ) > 0 , ma x j : j 6 = k ν j ( A ) ν k ( A ) ≥ δ . Th is implies that ∀ A ∈ F : ν k ( A ) > 0 , X j : j 6 = k ν j ( A ) − δ ν k ( A ) ≥ 0 . 21 On the oth er hand, ∀ A ∈ F : ν k ( A ) = 0 , we hav e X j : j 6 = k ν j ( A ) − δ ν k ( A ) = X j : j 6 = k ν j ( A ) ≥ 0 . Thus the linear combination P j 6 = k ν j − δ ν k with o ne strictly negativ e coefficient − δ is nonnegativ e over all measura ble A . This implies that the co llection of measures { ν 1 , . . . , ν K } is not irreducible. W e next show that separability implies irreducibility . If the collection of measu res { ν 1 , . . . , ν K } is separab le, then by th e definition o f separa bility , ∀ k , ∃ A ( k ) n ∈ F , n = 1 , 2 , . . . , such that ν k ( A ( k ) n ) > 0 and ∀ j 6 = k , ν j ( A ( k ) n ) ν k ( A ( k ) n ) → 0 as n → ∞ . Now consider a ny linear c ombinatio n of me asures P K i =1 c i ν i which is n onnegative over all measurable sets, i.e., for all A ∈ F , P K i =1 c i ν i ( A ) ≥ 0 . Then ∀ k = 1 , . . . , K and all n ≥ 1 we have, K X i =1 c i ν i ( A ( k ) n ) ≥ 0 ⇒ ν k ( A ( k ) n )   c k + X j 6 = k c j ν j ( A ( k ) n ) ν k ( A ( k ) n )   ≥ 0 ⇒ c k ≥ − X j 6 = k c j ν j ( A ( k ) n ) ν k ( A ( k ) n ) → 0 as n → ∞ . Therefo re, c k ≥ 0 fo r all k and the collection of measur es is irreducib le. R E F E R E N C E S [1] D. Blei, “Proba bilistic topic models, ” Commun. of the A CM , v ol. 55, no. 4, pp. 77–84, 2012. [2] W . Ding, M. Rohban, P . Ishwar , and V . Saligrama, “ A new geometric approac h to latent topic modeling and discov ery , ” in IEEE Internati onal Confer ence on Acoustics, Speech and Signal P r ocessing , May 2013. [3] W . Ding, M. H. Rohban, P . Ishwar , and V . Saligrama , “T opic discov ery through data dependent and random proj ections, ” in P r oc. of the 30th Internati onal Confere nce on Machine Learning , Atlanta , GA, USA, Jun. 2013. [4] W . Ding, P . Ishwar , M. H. Rohban, and V . Saligrama, “Necessary and Suf ficient Condi tions for Novel W ord Detecti on in S eparab le T opic Models, ” in Advances in on Neural Information Pr ocessing Systems (NIPS), W orkshop on T opic Models: Computatio n, A pplicat ion , L ake T ahoe, NV , USA, Dec. 2013. [5] W . Ding, M. H. Rohban, P . Ishwar , and V . Saligra ma, “Efficient Distrib uted T opic Modeling with Prov able Guarantees, ” in Pro c. ot the 17th International Confer ence on Artific ial Intel lige nce and Statistics , Reyk javik , Iceland , Apr . 2014. [6] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation, ” Jo urnal of Mach ine Learning Resear ch , vol. 3, pp. 993–1022, Mar . 2003. [7] S. Arora, R. Ge, Y . Halpern, D. Mimno, A. Moitra, D. Sontag, Y . Wu, and M. Zhu, “ A practica l algorithm for topic modeling with prov able guarant ees, ” in P r oc. of the 30th Internationa l Confer ence on Machine Learning , Atlanta, GA, USA, Jun. 2013. [8] M. Lichman, “UCI machine learning repository , ” 2013. [Online]. A vai lable: http://arc hi ve.ics.uci .edu/ml [9] S. Arora, R. Ge, and A. Moitra, “Learning topic models – going bey ond SVD, ” in Proc. of the IE EE 53rd Annual Symposium on F oundations of Computer Science , New Brunswick, NJ, USA, Oct. 2012. [10] D. Sontag and D. Roy , “Complexit y of inference in latent dirichle t alloc ation, ” in NIPS , 2011, pp. 1008–1016. [11] T . Griffiths and M. Stey vers, “Finding scientific topics, ” Proce edings of the National academy of Sciences , vol. 101, pp. 5228–5235, 2004. [12] M. W ainwright and M. Jorda n, “Graphica l models, ex ponentia l fami- lies, and va riationa l infere nce, ” F oundations and T r ends R  in Machine Learning , vol. 1, no. 1-2, pp. 1–305, 2008. [13] E. Airoldi, D. Blei, E. Eroshev a, and S. Fienber g, Handbook of Mixed Member ship Models and Their Applicat ions . Chapman and Hall/CRC , 2014. [14] A. Anandkumar , R. Ge, D. Hs u, and S. M. Kakade, “ A tensor approach to learning mixed membership community models, ” J . Mach. Learn. Res. , vol. 15, no. 1, pp. 2239–2312, 2014. [15] A. Kumar , V . Sindhwa ni, and P . Kambadur , “Fast conica l hull algorith ms for near-se parable non-negati ve matrix factoriz ation, ” in the 30th Int. Conf . on Mach ine Learning , Atlanta, GA, Jun. 2013. [16] W . Ding, P . Ishwar , and V . Saligrama, “Most large T opic Models are approximat ely separable, ” in IT A , 2015. [17] T . Hofmann, “Probabilist ic latent semantic inde xing, ” in Proce edings of the 22nd annual international ACM SIGIR confer ence on R esear ch and dev elopment in information retrie val , 1999, pp. 50–57. [18] D. Blei and J. L af ferty , “ A correlated topic model of science, ” T he Ann. of Applied Statisti cs , vol. 1, no. 1, pp. 17–35, 2007. [19] H. W alla ch, I. Murray , R. Salakhutd inov , and D. Mimno, “Eva luation methods for topic m odels, ” in Proc. of the 26th International Confer ence on Machine Learning , Montreal, Canada, Jun. 2009. [20] M. Hoffman, F . R. Bach, and D. M. Blei, “Onlin e learni ng for latent dirichl et allocati on, ” in advances in neural informatio n proc essing systems , 2010, pp. 856–864. [21] D. L ee and H. Seung, “Learn ing the parts of objects by non-ne gati ve matrix facto rizatio n, ” Natur e , vol. 401, no. 6755, pp. 788–791, Oct. 1999. [22] A. Cichocki , R. Zdunek, A. H. Phan, and S. Amari, Nonne gative matrix and tensor factorizatio ns: application s to explo ratory multi-way data analysis and blind source s eparat ion . Wi ley , 2009. [23] B. Recht, C. Re, J. Tropp, and V . Bittorf, “Factorin g nonnegati ve matrice s with linear programs, ” in Advance s in Neur al Information Pr ocessing Systems 25 , Lake T ahoe, NV , Dec. 2012, pp. 1223–1231. [24] S. V av asis, “On the comple xity of nonne gati ve matrix fac torizat ion, ” SIAM J. on O ptimizat ion , vol. 20, no. 3, pp. 1364–1377, Oct. 2009. [25] A. Anandkumar , D. Hsu, A. Jav anmard, and S. Kakade, “Learning linea r bayesian network s with laten t v ariable s, ” in the 30th Int. Conf . on Mach ine Learning , Atlanta, GA, Jun. 2013. [26] P . A wasthi and A. Risteski, “On s ome prov ably correct cases of v ariatio nal infere nce for topic mode ls, ” arXiv:1503.06567 [cs.LG] , 2015. [27] T . Bansal, C. Bhattacha ryya, and R. Kannan, “A prov able SVD-based al- gorithm for learning topics in dominant admixture corpus, ” in Advances in Neural Information Pr ocessing Systems , 2014, pp. 1997–2005. [28] J. Boardman, “ Automatin g spectral unmixing of avi ris data using con ve x geometry concept s, ” in P r oc. Ann. JPL A irborne Geoscienc e W orkshop , 1993, p. 1114. [29] D. Donoho and V . Stodde n, “When does non-nega ti ve matrix fac tor- izati on give a correct decompositi on into parts?” in Advances in Neural Informatio n Proc essing Systems 16 . Cambridge , MA: MIT press, 2004, pp. 1141–1148. [30] N. Gillis and S . A. V av asis, “Fast and robust recursi ve al gorithms for separabl e nonne gati ve matrix factoriz ation, ” IEEE Tr ans. on P attern Analysis and Mach ine Intellig ence , vol. 36, no. 4, pp. 698–714, 2014. [31] V . Faria s, S. Jagabathula, and D. Shah, “ A data-dri ven approach to mod- eling choice, ” in Advanc es in Ne ural Information Pro cessing Systems , V ancouver , Canada , Dec. 2009. [32] W . Ding, P . Ishwar , and V . Saligrama, “A T opic Modeling approac h to Ranking , ” in Proc . ot the 18th International Confer ence on A rtificia l Intell igenc e and Statistic s , San Diago, CA, May 2015. [33] A. McCallum, “Mall et: A machine learning for langua ge toolki t, ” 2002, http:/ /mallet.c s.umass.edu. [34] D. Lewis, Y . Y ang, T . Rose, and F . Li, “Rcv1: A new benchmark colle ction for text cate gorizat ion research, ” J. Mach. Learn. Res. , vol. 5, pp. 361–397, Dec. 2004. [35] L. Y ao, D. Mimno, and A. McCallum, “Effici ent methods for topic model inference on streaming document coll ections, ” in Pr oceedin gs of the 15th ACM SIGKDD internationa l confer ence on Knowledg e disco very and data mining . A CM, 2009, pp. 937–946. [36] D. Ne wman, A. Asuncion, P . Smyth, and M. W ellin g, “Distribute d algo- rithms for topic models, ” The Journal of Machi ne Learni ng Researc h , vol. 10, pp. 1801–1828, 2009. [37] A. Asuncion, P . Smyth, and M. W elling, “ Asynchronous distrib uted learni ng of topic m odels, ” in Advances in Neural Informatio n Pr ocessing Systems 21 , D. Kol ler , D . Schuurmans, Y . Bengio, and L. Bottou, Eds., 2009, pp. 81–88. [38] G. Blancha rd and C. Scott, “Deconta mination of mutually contamin ated models, ” in Pr oceedin gs of the 17th International Confere nce on Artifi- cial Intellig ence and Statistics , 2014, pp. 1–9. 22 [39] C. Scott, “ A rate of con ver gence for m ixture proportion estimation, with applic ation to learning from noisy labels, ” in Pr oceedi ngs of the 18th Internati onal Conferen ce on Artificial Intell igenc e and Statisti cs , 2015, pp. 838–846. [40] W . Ding, P . Ishwar , and V . Salig rama, “Learning mixed membership mallo ws model from pairwise comparisons, ” in arXiv: 1504.00757 [cs.LG] , 2015.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment