Model-Parallel Inference for Big Topic Models
In real world industrial applications of topic modeling, the ability to capture gigantic conceptual space by learning an ultra-high dimensional topical representation, i.e., the so-called "big model", is becoming the next desideratum after enthusiasm…
Authors: Xun Zheng, Jin Kyu Kim, Qirong Ho
Mo del-P arallel Inference for Big T opic Mo dels Xun Zheng ∗ , Jin Kyu Kim ∗ , Qirong Ho † , Eric P . Xing ∗ Abstract In real w orld industrial applications of topic mo deling, the abilit y to capture gigantic conceptual space by learning an ultra-high dimensional topical representation, i.e., the so-called “big mo del”, is b ecoming the next desideratum after en thusiasms on ”big data”, especially for fine-grained do wnstream tasks such as online adv ertising, where go od performances are usually ac hieved by regression-based predictors built on millions if not billions of input features. The con ven tional data-parallel approach for training gigantic topic mo dels turns out to b e rather inefficient in utilizing the p o wer of parallelism, due to the heavy dep endency on a centralized image of “mo del”. Big mo del size also p oses another c hallenge on the storage, where av ailable mo del size is b ounded by the smallest RAM of no des. T o address these issues, we explore another type of parallelism, namely mo del-p ar al lelism , which enables training of disjoint blocks of a big topic mo del in parallel. By in tegrating data-parallelism with model- parallelism, w e sho w that dependencies betw een distributed elemen ts can b e handled seamlessly , achieving not only faster conv ergence but also an ability to tackle significan tly bigger model size. W e describ e an arc hitecture for mo del-parallel inference of LDA, and present a v arian t of collapsed Gibbs sampling algorithm tailored for it. Exp erimen tal results demonstrate the ability of this system to handle topic mo deling with unpreceden ted amoun t of 200 billion model v ariables only on a low-end cluster with v ery limited computational resources and bandwidth. Keyw ords: Mac hine Learning, T opic Mo dels, Large-scale Systems, Distributed ML 1 In tro duction Recen t adv ances in storage and netw ork tec hnology hav e brought a new era of “big data”, where efficien t pro cessing and distillation of massive datasets has b ecome one of the ma jor pursuits in the field of machine learning (ML). Numerous algorithms and systems hav e b een prop osed to scale up ML for v arious tasks. Ho wev er, many of the existing systems positioned for Big ML such as MapReduce [5] or Spark [21] resort to data-parallelism, based on the assumption that the tasks associated with partitions of the data are indep enden t, and/or p ose only mild reliance on synchronization. Suc h assumptions are indeed v alid for a ma jority of the “data pro cessing” tasks such as keyw ord extraction from h uge log files or con ven tional database op erations that typically sweep the data only once. Differen t from traditional data pro cessing, man y machine learning algorithms are not well suited for data-parallelism, due to the coupling of distributed elemen ts through “shared states” such as mo del parameters, latent v ariables, or other intermediate states; for simplicity , w e refer to suc h en tities as the “mo del” underlying the data. A clear dichotom y betw een data (whic h is conditionally indep enden t and p ersisten t throughout the pro cess of training) and mo del (which is internally coupled, and is transient b efore conv erging to an optim um), and the needs for an iterative- con venien t pro cedure to learning the mo del from the data is the hallmark of machine learning programs. F or example, in the LDA topic mo del [3], the mo del to b e extracted from the data consists of a large collection of subspace bases, i.e., laten t topic v ectors, which are shared among all the do cumen ts; for each of such bases, elements thereof are coupled b y normalit y a nd non-negativit y constrain ts; and estimators of all suc h ∗ School of Computer Science, Carnegie Mellon Univ ersity , USA 15213. Email: { xunzheng,jinkyuk,epxing } @cs.cm u.edu † Institute for Infocomm Research, A*ST AR, Singap ore 138632. Email: ho qirong@gmail.com 1 bases b ear no close-form and must be approximated through some iterative procedures. All these render any trivial parallel treatment of mo del and data elements impossible. T o ac hiev e efficient distributed topic modeling under dep endency constraints via a data-parallel scheme, the following approaches hav e been commonly considered. 1) Exploring approximate indep endencies among sub-tasks: F or example in [15, 22] the v ariational inference algorithm is decomp osed into indep enden t subtasks and parallelized. This strategy in fact amounts to a Bulk Synchronous P arallel (BSP) computation mo del, of which the drawbac k is evident: to o many synchronization barriers needed to ensure logical correctness can result in a large num b er of idle cycles and therefore is hard to achiev e high scalability . 2) Fine grained lo cking: One can employ v arious lo c king mechanisms on shared v ariables to preven t the reader-writer’s problem. Ho wev er is it only viable on the shared memory settings. An early version of GraphLab [14] engine implemented a sophisticated v ersion of fine grained lo c king mechanism in distributed settings, how ever at the expense of enco ding the mo del into a graph. 3) Brute-force parallelization: In this case no sp ecific action is tak en to preven t error from b eing generated during the asynchronous up dates. Some early attempts [16, 2], curren t state-of-the-art distributed LDA inference metho d Y ahoo!LDA [1], and recen t adv ances in parameter serv er [9, 12] can be viewed as instantiations of the mechanism to some extent. The ma jor problem with this approac h is that there is little guarantee on the correctness of the inference pro cedure. Although recent studies[17, 10] hav e shown some justifications for this approach, for no w the theory only supp orts simple mo dels (e.g., Gaussians [10]) or requires certain assumptions (e.g., up dates are not ov erlapping to o muc h [17]) to hold as well. Empirically , as w e show later, the conv ergence speed of suc h error-prone parallelization can b e improv ed significantly if one can eliminate the parallelization error. Apart from the dependency issue, large-scale topic mo deling also p oses a c hallenge on ho w to accommo- date and handle gigan tic mo del size, which has received less attention in the literature. Unlike in academic con ven tion, industry-scale applications of topic mo deling, for instance online adv ertising, typically go b ey ond extracting only topics for human interpretation or visualization, and feature a need for ultra-high scales on v o cabulary size and topic dimensions. How ever, most data-parallel sc hemes implicitly assume an image of all shared model states are readily av ailab le in each work er pro cess, since it can b e extremely expensive to fetc h them from remote pro cesses during iterative training steps. Suc h assumption of having a lo cal copy of the mo del breaks down when facing the big mo del problems. F or example mo deling a corpus with a v o- cabulary of 10 7 terms in a 10 5 dimensional laten t space would require 10 12 mo del v ariables to be estimated. In real-world applications this is not uncommon considering the feature augmentation (e.g., taking w ord com binations) and large conceptual space b ehind the text. Since the raw mo del ma y already take terabytes of storage, unstructured data-parallel approac h is unlik ely to be applicable. T o address these issues, we take adv antage of a different type of parallelization mechanism, namely mo del-p ar al lelism , to complement data-parallelism. Originated from a mac hine learning p erspective, mo del- parallelism addresses the ab o ve problems by carefully scheduling the up dates based on dep endencies among mo del states induced by the inference algorithm. Sp ecifically , we mak e use of the fact that in Gibbs sampling for LDA, the shared state access is limited to a small subset of the en tire mo del during the computation of an update from a data sample. In other words, if the subsets are small enough, it is p ossible to find a class of disjoint subsets whose up dates are completely independent of each other. Based on the iid assumption, parallelizing o ver the disjoin t blo c ks pro duces exactly the same result as the serial execution. Th us mo del- parallel inference not only ensures the inference quality , but also reduces memory requiremen t by partitioning b oth the data and the mo del space. In fact, we demonstrate the ability to handle topic mo deling with 200 billion mo del v ariables on a cluster of 64 low-end machines. W e also note that the mo del-parallelism is suitable not only for LDA but also for man y more machine learning programs. Primitiv es for more general mo del-parallelism can b e found in [11]. Related w orks : V arious methods hav e b een proposed to enable large scale inference for topic models. In [22], a MapReduce based parallelization is presented, b y making use of the indep enden t tasks in v ariational inference algorithm for LDA. The current state-of-the-art distributed inference for LDA [1] resorts to fast bac kground synchronization of the mo del. A rotation-scheduling idea has been studied in [19], ho wev er w e target distributed settings where things become more challenging due to net work latency and smaller degree of parallelism (compared to GPU). GraphLab [14, 7] LDA application can b e seen as a sp ecial case of model- 2 parallelism, where by definition of the graph only non-ov erlapping subgraphs (i.e., do cuments and words) are pro cessed sim ultaneously . Recent study on streaming v ariational Bay es [4] also prop osed a distributed inference algorithm, how ever sp ecialized in the single-pass scenario. Here is an outline of the rest of the paper: we b egin with a briefly in tro duction of the LD A mo del and the collapsed Gibbs sampling algorithm in section 2. Then in section 3 we presen t the big picture and motiv ation of mo del-parallel inference for LDA, whose technical details are sho wn in section 4. Distributed exp eriments are conducted in section 5, and finally section 6 concludes. 2 Laten t Diric hlet Allo cation Laten t Diric hlet allo cation (LDA)[3] is a hierarc hical Ba y esian topic model that learns a lo w-dimensional rep- resen tation for a high-dimensional corpus. Because of the ability of capturing latent seman tics underlying the text, it is widely applied to v arious real world tasks such as online adv ertising and p ersonal recom- mendation. In recent years, with increasing amount of data the need for larger conceptual space is also emphasized, p osing a challenge in large scale inference for LDA. In this section, we briefly ov erview the LDA mo del definition and its inference using collapsed Gibbs sampling. 2.1 The Mo del LD A considers each do cumen t as an admixture of K topics, where each topic φ k is a m ultinomial distribution o ver a v o cabulary of V w ords. F or eac h do cumen t d a topic prop ortion vector θ d is drawn from Diric hlet( α ). Then for eac h tok en n in the do cumen t, a word w dn is dra wn from Multinomial( φ z dn ) and a topic assignment z dn is drawn from Multinomial( θ d ). In fully Bay esian LDA, topics are random samples drawn from Dirichlet prior, φ k ∼ Diric hlet( β ). Giv en the corpus W = { w d } D d =1 where w d = { w dn } N d n =1 , LD A infers the posterior p ( Φ , Θ , Z | W ). How ev er exact inference is intractable due to the normalization term, hence approximation metho ds such as Gibbs sampling comes into play . The mixing rate can be further accelerated b y integrating out intermediate Diric hlet v ariables ( Φ , Θ ) analytically , yielding the collapsed Gibbs sampling algorithm [8]: p ( z dn = k | Z ¬ dn ) ∝ ( C k d, ¬ n + α k )( C t k, ¬ n + β t ) C k, ¬ n + P t β t , (1) where t is the word that the curren t tok en w dn maps to; C k d is the num b er of tok ens assigned to topic k in do cumen t d ; C t k is the n umber of times term t has b een assigned to topic k ; C k is the total num b er of tokens assigned to topic k ; and finally C · · , ¬ n is the count excluding the n -th token. 2.2 Sparse Sampling O ( K ) time complexity for eac h topic assignment in collapsed Gibbs sampling still leav es ro om for further impro vemen t. In [20], a sub-linear complexit y sampling algorithm that mak es use of sparsit y is introduced. The motiv ation comes from the observ ation that the coun ts { C k d } and { C t k } are sparse: only a few out of K en tries are filled with non-zero counts in every do cument or term. As we shall discuss later, similar idea is also applicable to the model-parallel inference. The fast sampling algorithm starts with decomp osing the conditional (1) in the following wa y: p ( z dn = k | Z ¬ dn ) ∝ A k + B k + C k , (2) where A k = α k β t C k, ¬ n + P t β t , B k = β t C k, ¬ n + P t β t C k d, ¬ n , C k = α k + C k d, ¬ n C k, ¬ n + P t β t C t k, ¬ n . 3 Algorithm 1 Sc heduler 1: Initialization: construct the task po ol. 2: while true do 3: dispatc h tasks to work ers; 4: rotate tasks; 5: end while Since A k is dense, P k A k can b e precomputed in O ( K ) and maintained in O (1) time; P k B k can b e precomputed in O ( K d ) time where K d is the a verage num ber of nonzero en tries in { C k d } ; the fractional term in C k can b e precomputed in O ( K ) time and P k C k can b e constructed in O ( K t ) time b y taking adv an tage of sparsity in { C t k } , whose exp ected sparsity is K t . Note that not only the construction of conditional distribution takes sub-linear complexity in K , the sampling pro cedure can also b enefit significantly from the sparsit y of { C t k } and { C k d } , due to the observ ation that B and C buck et contain most of the probability mass. So the ov erall time complexity for sampling one topic assignmen t z dn is O ( K d + K t ). 3 Mo del-P arallel Inference Data-parallel inference for LD A typically distributes different set of do cumen ts to the work ers to p erform Gibbs sampling, while sharing a central mo del across all of them via some synchronization schemes. Alb eit b eing p o w erful in handling large amount of data, it introduces tw o potential issues that are less recognized in the literature. 1) It fails to handle huge mo del. In data-parallel inference, it is natural to assume that the entire cop y of the mo del is av ailable in the work ers throughout the inference pro cedure. How ev er as w e men tion earlier the need for big model breaks down this assumption: a mo del with billions of v ariables can easily exceed the reasonable RAM size these days. 2) It cannot control inconsistency in the shared mo del. Most data-parallel inference trades correctness with p erformance. F or example in [1], the shared mo del is up dated b y a separate thread cycling ov er the lo cal mo del, hoping for the inconsistency do es not affect the algorithm by muc h. How ev er this strategy relie s heavily on the netw ork condition, as we show in Section 5: for low bandwidth net works, the effect of inconsistency b ecomes evident since the algorithm proceeds without noticing the slow sync hronization in the bac kground. 3.1 Dynamic Mo del P artitioning A monolithic treatment of shared mo del in data-parallel inference often fails to address the “big mo del” problem, which can tak e place when 1) massive num b er of the mo del v ariables or parameters are introduced b y the statistical mo del; or 2) huge additional shared data structure is required to assist the inference algorithm. In either case, ha ving a complete copy of the mo del in every work er is a p otential danger not only b ecause it may fail to load the mo del in the first place, but also b ecause adding computing nodes will not help reduce the memory consumption of individual w orkers. Our solution to this issue is to partition the shared mo del into disjoint blo c ks. This is motiv ated by the fact that each step of Gibbs sampling only requires change in a small subset of the entire statistics { C t k } , hence certain degree of parallelization can b e achiev ed on the mo del side, in addition to the data. Sp ecifically , since Gibbs sampling for t wo distinct w ords are nearly indep enden t 1 , the V × K word-topic coun t matrix can be effectively partitioned by words. The outcome of mo del partitioning is straightforw ard: it reduces the mo del size on eac h work ers, and also achiev es scalability on the model b y allowing more no des to share the burden. Note that dynamic mo del partitioning is a complement to data partitioning rather than a replacement. Instead of static placement, it provides more flexibility to the algorithm and ensures eac h w orker to w ork on the complete set of mo del during the inference, rather than only a subset of them. In mo del-parallel LD A, dynamic partitioning of the mo del is realized b y a sche duler comp onen t, as describ ed in Algorithm 1. Specifically , it first divides the V w ords into M disjoin t blo c ks { V 1 , V 2 , . . . , V M } . 1 W e discuss the dependency on { C k } later. 4 Algorithm 2 W orker 1: while not con verged do 2: receiv e tasks from scheduler; 3: request mo del blo c ks from kv-store; 4: Gibbs sampling using (3); 5: commit new model blo c ks to kv-store; 6: end while Eac h blo c k V m is assigned to corresp onding work er m as the initial set of tasks . Therefore eac h work er m only samples tok ens z dn suc h that w dn ∈ V m . Once all the work ers ha ve finished sampling their own blo c ks, the scheduler rotates the blo c ks to different work ers for another round (sub-iteration) of sampling: w orker m acquires the block V m 0 where m 0 = ( m + 1) mo d m . After M rounds of sampling, all topic assignments Z will hav e b een sampled exactly once. This amounts to an iter ation ov er the data and we rep eat the pro cess un til conv ergence. 3.2 On-demand Comm unication Sync hronization of shared mo del is another ma jor issue in data-parallel inference. Existing metho ds such as [1] has b een mainly fo cused on efficient main tenance of the word-topic count matrix { C t k } , for example using an asynchronous key-v alue store to frequen tly incorp orate and distribute up dates committed b y the w orkers. How ever, b est-effort synchronization only guarantees ev entual consistency , hence the work ers may construct incorrect distributions from the staled statistics. The effect of staleness b ecomes even eviden t with lo w netw ork bandwidth, which is common in low-end clusters and custom cloud services. If the shared states cannot b e sync hronized in time even though the netw ork is saturated, then parallelization error will only increase as the inference algorithm pro ceeds. Based on dynamic mo del partitioning, we can av oid such issues easily by carefully managing commu- nication b etw een work ers. T o achiev e this, w e introduce a key-value stor e that stores the global mo del { C t k } . Note that different from b eing a “parameter server” [1], the purpose of this comp onent is mainly for distributed in-memory storage: thanks to dynamic mo del partitioning, frequent bac kground asynchronous comm unication is no longer required. In practice a simple distributed hash table implementation suffices the need. Giv en the dynamic model partitioning strategy , on-demand comm unication b etw een w orkers and k ey-v alue store follo ws the pro cedure described in Algorithm 2. A t the b eginning of each round, after re- ceiving the task list, each work er can start requesting its mo del blo c ks from the key-v alue store. Similarly after finishing the tasks, work ers can commit changes in lo cal mo del blo c ks, thereby up dating the global mo del. This process can be further accelerated by ov erlapping sampling pro cedure and communication, i.e., send/receiv e model blocks async hronously . Again since the mo del blo c ks are non-ov erlapping, there is no synchronization issue on the k ey-v alue store. Moreo ver the amoun t of communication is reduced significan tly , compared to the frequent sync hronization approac h. By combining dynamic mo del partitioning and on-demand communication via key-v alue store, v ariable dep endency b et ween w orkers is also reduced. It not only eliminates the need for a frequen tly sync hronized shared states as in data-parallel inference, but also results in faster sampler conv ergence p er tok en pro cessed. In fact as we sho w later our metho d requires muc h fewer iterations to con verge than others, while having similar per-iteration time complexity . 3.3 Non-separable Dep endency So far we hav e delib erately omitted another source of dep endency , the global topic count vector { C k } . It is imp ossible to divide { C k } into disjoint blocks since the term is required in sampling for all the tokens. Ho wev er, noticing the fact that the v alue is relatively large since C k = P t C t k and it only app ears in the denominator, changes in small magnitude will not affect the final distribution muc h. This demands for a m uch relaxed level of consistency . Therefore we synchronize { C k } across the work ers at the b eginning of 5 scheduler key-value store data partition model partition worker data p a rt i t i o n model partition worker data p a rt i t i o n model partition worker Figure 1: A high-level view of the architecture for model-parallel inference of LDA. eac h round through the k ey-v alue store. It is highly efficient since every work er only needs to send/receive a v ector of size K to/from the k ey-v alue store. During the round, work ers are not a ware of the changes in { C k } made by other w orkers, whic h causes some error in the distribution to sample from. This is in some sense similar to the idea used in [16], where the entire mo del is allow ed to go out-of-sync during an iteration. Ho wev er w e only relax the consistency requirement on { C k } , while the ma jor element of the mo del, { C t k } , is maintained without any error. As we will show in Section 5, due to the small amoun t of change compared to the actual v alue, the resultant error is empirically negligible. T o sum up, com bining dynamic mo del partitioning and on-demand communication not only reduces memory load of eac h w orker, but also av oids most of the parallelization error. A special protocol is in tro duced to address non-separable dep endency issue on { C k } , without sacrificing the inference quality . As w e sho w later, compared to a data-parallel method [1], mo del-parallel inference takes an order of magnitude less time to conv erge. 4 Implemen tation Details In this section we provide some tec hnical details about implemen ting the model-parallel inference for LDA. 4.1 Ov erall Architecture The complete design of the system comp onen ts is illustrated in Figure 1. W e partition b oth data and mo del so that each partition can b e stored in a single machine memory . The scheduler directly comm unicates with the work ers to 1) generate and assign tasks and 2) co ordinate mo del partitions b et ween work ers. It also main tains a sp ecial communication channel with the key-v alue store to handle non-separable dependency in { C k } . As we men tioned, mo del blo c ks are communicated via a distributed key-v alue store in a managed fashion, rather than busy synchronization. This significantly reduces the amoun t of communication and hence low ers the requirement on net work bandwidth. 4.2 F ast Sampling on Inv erted Index Up on receiving the task list and necessary mo del blo cks, the main job left for the work er is to p erform Gibbs sampling on the lo cal data. Because of the scheduling constraint, only tok ens that are mapp ed to the w ords in the current task list can b e sampled in this round. T raditional bag-of-words representation of the do cumen ts turns out to b e rather inefficien t in this case: to determine the set of tokens to b e sampled in 6 this round, sequential iterations ov er the dataset as well as m ultiple comparisons b et ween the task list and the token are required. In fact this is a classic problem in searc h engines, where the typical solution is to represent the do cumen ts in inverte d index , instead of forward index (i.e., bag-of-words). With the inv erted index created, for each w orker m , each rec ord indexed by w ord t represen ts all the topic assignments z dn suc h that w dn = t and z d ∈ D m . By doing so we can completely eliminate the m ultiple comparisons b et ween tw o sets. In addition, similar to the idea in [20], we can tak e adv antage of sparsity as w ell. W e first note that the same decomp osition (2) is not optimal for sampling on inv erted index. T o see the reason, note that a great prop ortion of efficiency in the sparse sampling algorithm [20] comes from precomputing P k B k for eac h do cumen t. Once cached, additional changes to P k B k within a do cumen t only requires O (1) time to make incremental up dates. The caching effect is maximized when tok ens in eac h do cumen t are sampled sequen tially in a pro cess. How ev er for sampling on in v erted index, it is no more the case: P k B k is frequen tly recomputed since typically only a few tokens in a do cumen t represent a sp ecific w ord. Instead, a different decomposition can b e done as follows to maximize the caching effect: p ( z dn = k | Z ¬ dn ) ∝ X k + Y k , (3) where X k = C t k, ¬ n + β t C k, ¬ n + P t β t α k , Y k = C t k, ¬ n + β t C k, ¬ n + P t β t C k d, ¬ n . The first probability buck et P k X k can b e precomputed for ev ery word (i.e., task) in O ( K ) time, with O (1) maintaining cost for future up dates. It is cached for every token asso ciated with the w ord t in lo cal partition. Also note that the fractional terms in X k and Y k is identical, th us co efficien ts of { Y k } can b e precomputed along with P k X k with no additional cost. T o get P k Y k , we make use of sparsit y in C k d whic h requires O ( K d ) time. Note that due to the dense fractional term and unbiased mass partition, the algorithm is not as efficient as the sparse sampler in [20]. How ev er, as we stated ab o ve, this algorithm makes full use of the in verted index structure that is required by the mo del-parallel inference. In Section 5, w e sho w that the disadv antage of non-optimal sampling algorithm is mitigated as the b enefit of mo del-parallelism b ecomes salien t. 5 Exp erimen ts In this section w e quan titatively ev aluate the proposed mo del-parallel inference for LDA. Our c hosen base- line is Y aho o!LD A [1], whic h is a p opular, publicly-av ailable distributed implemen tation of the Sparse Gibbs sampler [20]. Another notable baseline is Go ogle’s PLD A+ [13], which has similar tok en sampling through- put to Y aho o!LDA — roughly , both Y aho o!LD A and Google PLD A+ pro cess 20K tok ens p er compute core, p er second, on a medium-sized cluster with 10-100 machines. Since the sampling throughput of Y aho o!LD A and Go ogle PLDA+ are similar, we only compare to Y aho o!LD A. In our exp erimen ts, w e will sho w that our metho d, while having similar sampling throughput to Y aho o!LD A (and PLDA+), conv erges significantly faster p er iteration b ecause our careful, word-partitioned mo del-parallel design significantly reduces syn- c hronization errors in the w ord-topic table (as in Figure 3). W e also attempted to compare with the topic mo deling to olkit in GraphLab [7], how ever in all of the exp erimen ts it failed to initialize due to excessive memory consumption. This is acceptable since it is an application built on top of the general-purpose system rather than a p erformance-driven instantiation of the algorithm, hence w e omit the result hereinafter. Exp erimen t Settings: T o demonstrate the effectiv eness ov er different hardw are settings, we conduct exp erimen ts on tw o disparate settings [6]: a high-end cluster with 64-core mac hines. a lo w-end cluster equipp ed with 2-core mac hines. Sp ecifically , the high-end cluster con tains 10 mac hines connected via 40Gbps Ethernet netw ork in terface, each no de equipp ed with quad-so c ket 16-core AMD Opteron 6272 (2.1GHz) and 128GB RAM. The lo w-end cluster consists of 128 mac hines connected via 1Gbps Ethernet with dual- so c k et AMD Opteron 252 (2.6GHz) and 8GB RAM in eac h machine. The mo del-parallel inference is fully 7 0 200 400 600 −1.4 −1.2 −1 −0.8 −0.6 x 10 10 Iteration Log−Likelihood Model−Parallel, K=5000 Model−Parallel, K=1000 Yahoo!LDA, K=5000 Yahoo!LDA, K=1000 (a) 0 1 2 3 4 x 10 4 −1.4 −1.2 −1 −0.8 −0.6 x 10 10 Time (sec) Log−Likelihood Model−Parallel, K=5000 Model−Parallel, K=1000 Yahoo!LDA, K=5000 Yahoo!LDA, K=1000 (b) Figure 2: Conv ergence sp eed. Mo del-parallel inference exhibits sharp er mov e to ward higher lik eliho o d. implemen ted in C++11. Note that although the high-end machines are NUMA no des, for fare comparison w e do not include any optimization for NUMA architecture. Dataset: W e use Pubmed 2 and the 3.9M document English Wikip edia abstracts 3 as our dataset. W e further construct an augmented corpus b y extracting bigrams (2 consecutive tok ens) from Wikip edia corpus. Pubmed con tains 8.2M do cumen ts, V = 141043 words and ab out 737.9M tokens. The original Wikip edia dataset consists of V = 2 . 5M unique words and 179M tokens, while in bigram corpus there are V = 21 . 8M unique phrases and 79M o ccurrence of these phrases. W e note that bigram vocabulary of 21.8M is almost an order of magnitude larger than [1], and clearly demonstrates our scalability to very large mo del sizes. Our exp eriments used the num b er of topics from K = 1000 up to 10000, which results in extremely large w ord-topic tables: 12 . 5B elemen ts in the unigram case, and 218B elements in the bigram case. The mo del size is ab out 60 times larger than the recent result [1]. Ev aluation: W e choose training log-likelihoo d as our surrogate measure of con vergence because 1) LDA Gibbs samplers tend to conv erge to (one of man y possible) lo cal optima in the space of p ossible word-topic and do c-topic tables, and 2) the progress of the sampler to a local optimum correlates well with the rise, and then plateauing, of the training log-lik eliho o d measured on the latest sample. Since the LDA Gibbs sampler is unlikely to leav e a lo cal optima once it has reached one, the algorithm can b e safely terminated once the log-lik eliho o d plateaus. One might ask wh y we did not employ test data p erplexit y as our surrogate, as did b y man y practitioners. W e caution that this metric is in fact improp er for ev aluating comp eting inference systems (on the same mo del), but suitable for ev aluating go o dness of different mo del designs (using the same inference system). F or instance, it can b e used to ev aluate different flav ors of LDA or alternative mo dels in terms of how w ell they capture training data c haracteristic and generalize to new data. Ho wev er, our fo cus is on infer enc e quality and efficiency on the same mo del , not go o dness of c omp eting mo dels ; all systems/algorithms we tested p erform inference for the standard LDA mo del, and therefore (the difference of ) mo del generalization is not the issue under inv estigation. Moreov er, b ecause an inference algorithm learns mo del parameters and v ariables only from the training data, it is only appropriate to trac k its conv ergence as a function of the training data. Using test data p erplexity introduces additional confounding factors, particularly how well eac h training data lo cal optima generalizes to the test data — this is a confounding factor b ecause sampler algorithms are not designed to con trol which training optima they will even tually reac h. The point is that training data log-likelihoo d con trols for external factors b etter than test data p erplexit y , in the con text of measuring inference sp eed and accuracy . 5.1 Con v ergence W e first compare conv ergence sp eed of different metho ds on the high-end cluster, using Pubmed dataset with 1000 and 5000 topics. Figure 2(a) shows the log-likelihoo d at each iteration. W e can observe that mo del-parallel inference ac hieves greater per-iteration progress than data-parallel approach. In other words, our metho d requires muc h fewer iterations to reach a certain likelihoo d. Figure 2(b) shows the log-likelihoo d trend in terms of elapsed time. W e can observe similar trends as in p er-iteration plot. This again sho ws the 2 https://archive.ics.uci.edu/ml/datasets/Bag+of+Words 3 http://wiki.dbpedia.org/Downloads39#extended- abstracts 8 1 100 200 300 400 0 0.005 0.01 Iteration Error K=5000 K=1000 Figure 3: The error ∆ r,i at each iteration, with each round viewed as 1 / M progress of an iteration. The error is almost 0 (minim um) everywhere. T able 1: Time to con verge on different mo del size with 64 lo w-end machines. Mo del-parallel inference not only handles larger mo dels, but also con verges faster. Corpus Wiki-unigram Wiki-bigram K 5000 10000 5000 10000 Mo del-P arallel 2.3 hr 5.0 hr 8.9 hr > 12 hr* Y aho o!LD A [1] 11.8 hr N/A N/A N/A (*terminated by the cluster) effect of sampling from correct distributions: dynamic mo del partitioning seamlessly handles the dep endency on the mo del, whereas data-parallel approac h suffers slow conv ergence esp ecially at the b eginning due to drastic change of the mo del copies in work er no des. W e also show the effect of lazy sync hronization in { C k } , which breaks down the independence betw een w orkers. As men tioned in Section 3, { C k } is only synchronized at the b eginning of eac h round, therefore is free to go out of sync during the left of the round. W e relax the consistency requirement based on the in tuition that minor change in huge counts will not affect the ov erall result m uch. W e now show that the induced error is almost negligible in practice. As a proxy for the error made in each round, we c an measure the difference b et ween the true T = { C k } and its local copy ˜ T m = { C k } on work er m at the end of each round. Sp ecifically , w e define the error at eac h round r and iteration i to b e ∆ r,i = 1 M N P M m =1 k T − ˜ T m k 1 , where N = P k C k is the total n umber of tokens in the corpus. In other w ords, we compute the normalized ` 1 -distance b et ween each w orker’s lo cal cop y ˜ T m and true v alue T , and then av erage the amount o ver all the w orkers. As a result ∆ must lie in [0 , 2], where 0 denotes no error. Figure 3 shows the error collected on high-end cluster using Pubmed dataset. W e can observ e that the error immediately drops to 0 and stays close to it during the rest of the inference pro cedure. This demonstrates that our method exhibits very small parallelization error and hence faster con vergence. 5.2 Mo del Size W e demonstrate our ability to handle big models in T able 1. Y aho o!LDA starts to fail on the problem size of 2.5M vocabulary and 10000 topics. It is due to the fact that the lo cal copy of the mo del no longer fits in to the memory , ev en though it only stores keys that appear in the lo cal subset of the data. In con trast, b y sharding the mo del in to blo c ks, our metho d effectiv ely handles bigger mo dels. As shown in the table, mo del-parallel approach is able to p erform inference on all configurations of the mo del size, including the biggest one using bigram dataset with 10000 topics, indicating our abilit y to handle a mo del size ov er 200 billion only on a low-end cluster. In addition we can observe a faster conv ergence in small mo del setting compared to Y aho o!LD A. This indicates that mo del-parallelism is effective not only for big mo del but also for mo derate-sized mo del problems. All of these clearly demonstrates the effectiv eness of dynamic mo del partitioning strategy . 9 16 32 64 128 0 1000 2000 3000 4000 5000 6000 Machines Memory (MB) Model−Parallel Yahoo!LDA Ideal (a) 16 32 64 128 1 2 4 8 Machines Speedup Model−Parallel Yahoo!LDA Ideal (b) Figure 4: (a) Memory usage p er machine as a function of the n umber of mac hines used. Our metho d follows a 1 / M trend, indicating efficien t partitioning of b oth data and models across mac hines. (b) Speedup in terms of time to reach the log-lik eliho o d of − 2 . 7 × 10 9 on different num b er of mac hines. Our method ac hieves nice scalabilit y , whereas Y aho o!LD A fails to utilize more computing resources. 5.3 Scalabilit y In Figure 4(a), w e sho w the total memory fo otprin t of each work er as the n umber of computing no des increase, using unigram dataset with K = 5000 topics. In the ideal case, as the n umber of machines doubles, the memory consumption should b e halved. W e can observ e that the mo del-parallel inference ac hieves nearly ideal scalabilit y ov er mac hines. Although starts with a higher memory fo otprin t, it closely follo ws a 1 / M trend and drops to a m uch lo wer num b er, indicating the dynamic model partitioning scheme effectiv ely mak es use of more memory storage without unnecessary duplication; whereas Y aho o!LD A’s p er-mac hine memory usage is almost constant, again because its data-parallel strategy requires most of the word-topic table { C t k } to b e stored on each mac hine, indicating that adding machines will not solv e big mo del problems. W e also show conv ergence sp eedup as a function of num b er of machines. In Figure 4(b), w e show the sp eedup in terms of con vergence time on different num b er of mac hines for a fixed mo del size (unigram dataset with 5000 topics). In terestingly , we can observe that Y aho o!LD A p erforms worse given 32 machines. The reason can b e explained by the net work congestion in the low-end cluster: since the mo dels are frequen tly sync hronized betw een every no de, netw ork traffic is increased in O ( M 2 ). Thus parameters are more likely to b e out-of-date when increasing num b er of nodes given lo w bandwidth. This introduces more error to the o verall procedure. By contrast, w e can see the curv e for mo del-parallel inference follows the ideal speedup trend closely . This sho ws the model-parallel inference effectively utilizes additional computational resources without significant ov erhead. Unlike full O ( M 2 ) connections, on-demand comm unication strategy in mo del- parallel inference greatly reduces the traffic by managed sync hronization, while providing guarantee for mo del correctness. This demonstrates the ability of mo del-parallel inference to handle large scale inference problems on low-end clusters. 6 Conclusion In this pap er, w e presen ted a mo del-parallel inference for LD A, motiv ated b y the pitfalls of data-parallelism in distributed inference. W e proposed a system that implements mo del-p ar al lelism on top of data-parallelism, and show empirical results on improv ed time and memory efficiency ov er other approaches. In a w ord, mo del-parallelism not only eliminates dep endency b et w een inference pro cesses but also brings the capabil- it y of handling big mo dels. Therefore without drastic change in the algorithm itself, e.g., using crafted Metrop olis-Hasting to sp eed up the sampler, we can already improv e the algorithm significantly just by careful arrangement of model blocks. W e exp ect the idea can b e applied to a broader class of mo dels as well to scale up without sophisticated algorithmic tw eak. The first attempt to the generalized mo del-parallelism can b e found in [11], which nonetheless deserves further inv estigation. W e are also interested in emplo ying mo del-parallelism in more c hallenging tasks, for example Bay esian nonparametric models like Hierarc hical Diric hlet Pro cess (HDP) [18] 10 and regularized Bay esian mo dels such as MedLD A [23]. References [1] A. Ahmed, M. Aly , J. Gonzalez, S. Naray anamurth y , and A. Smola. Sc alable Infer enc e in L atent V ariable Mo dels , In In ternational Conference on W eb Searc h and Data Mining (WSDM), 2013. [2] A. Asuncion, P . Smyth, and M. W elling. Asynchr onous distribute d le arning of topic mo dels . In NIPS, 2008. [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. L atent Dirichlet al lo c ation , Journal of Machine Learning Researc h, 3:993–1022, 2003. [4] T. Bro derick, N. Bo yd, A. Wibisono, A. C. Wilson, and M. I. Jordan, Str e aming V ariational Bayes , In NIPS, 2013. [5] J. Dean and S. Ghema wat. Mapr e duc e: simplifie d data pr o c essing on lar ge clusters. Comm unications of the ACM, 51(1):107-113, 2008. [6] G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. Pr ob e: A thousand-no de exp erimental cluster for c omputer systems r ese ar ch , USENIX; login, 38, 2013. [7] J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGr aph: Distribute d Gr aph-Par al lel Computation on Natur al Gr aphs. Pro ceedings of the 10th USENIX Symp osium on Op erating Systems Design and Implementation (OSDI), 2012. [8] T. L. Griffiths and M. Steyv ers. Finding scientific topics , Pro ceedings of National Academ y of Science (PNAS), 5228–5235, 2004. [9] Q. Ho, J. Cipar, H. Cui, J. K. Kim, S. Lee, P . B. Gibb ons, G. Gibson, G. R. Ganger, and E. P . Xing. Mor e Effe ctive Distribute d ML via a Stale Synchr onous Par al lel Par ameter Server , In NIPS, 2013. [10] M. Johnson, J. Saunderson, and A. Willsky . A nalyzing Ho gwild Par al lel Gaussian Gibbs Sampling , In NIPS, 2013. [11] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. Gibson, and E. .P . Xing, Primitives for Dynamic Big Mo del Par al lelism , In NIPS, 2014. [12] M. Li, L. Zhou, Z. Y ang, A. Li, F. Xia, D. G. Andersen, and A. J. Smola. Par ameter server for distribute d machine le arning . In W orkshop on Big Learning, NIPS, 2013. [13] Z. Liu, Y. Zhang, E. Y. Chang, and M. Sun, PLDA+: Par al lel L atent Dirichlet Al lo c ation with Data Plac ement and Pip eline Pr o c essing . ACM T ransactions on In telligent Systems and T echnology , sp ecial issue on Large Scale Mac hine Learning, 2011. [14] Y. Lo w, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin and J. M. Hellerstein. Distribute d Gr aphL ab: A F r amework for Machine L e arning and Data Mining in the Cloud. PVLDB, 2012. [15] R. Nallapati, W. Cohen, and J. Lafferty . Par al lelize d V ariational EM for L atent Dirichlet Al lo c ation: A n Exp erimental Evaluation of Sp e e d and Sc alability . In Pro ceedings of the Sev enth IEEE International Conference on Data Mining W orkshops, 2007. [16] D. Newman, A. Asuncion, P . Smyth, and M. W elling. Distribute d infer enc e for latent Dirichlet al lo c ation. In NIPS, 2007. [17] B. Rec ht, C. R ´ e, S. J. W right, and F. Niu. Ho gwild!: A L o ck-F r e e Appr o ach to Par al lelizing Sto chastic Gr adient Desc ent , In NIPS, 2011. 11 [18] Y. W. T eh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hier ar chic al Dirichlet Pr o c esses . Journal of the American Statistical Asso ciation, 101(476):1566–1581, 2006. [19] F. Y an, N. Xu, and Y. Qi. Par al lel Infer enc e for L atent Dirichlet A l lo c ation on Gr aphics Pr o c essing Units , In NIPS, 2009. [20] L. Y ao, D. Mimno, and A. McCallum. Efficient metho ds for topic mo del infer enc e on str e aming do cument c ol le ctions , In A CM SIGKDD Conference on Knowledge Disco very and Data Mining, 2009. [21] M. Zaharia, M. Chowdh ury , M. J. F ranklin, S. Shenk er and I. Stoica. Sp ark: Cluster Computing with Working Sets , HotCloud, 2010. [22] K. Zhai, J. Bo yd-Grab er, N. Asadi, and M. Alkhouja. Mr. LD A: A Flexible L ar ge Sc ale T opic Mo deling Package using V ariational Infer enc e in MapR e duc e . In Pro ceedings of the 21th Internat ional W orld Wide W eb Conference (WWW), 2012. [23] J. Zhu, A. Ahmed, and E. Xing. Me dLDA: maximum mar gin sup ervise d topic mo dels . Journal of Machine Learning Research, (13):2237-2278, 2012. 12
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment