SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs

SaberLD A: Sparsity-A war e Lear ning of T opic Models on GPUs Kaiwei Li Jianfei Chen W enguang Chen Jun Zhu Tsinghua Univ erstiy { likw14, chenjian14 } @mails.tsinghua.edu.cn { cwg, dcszj } @tsinghua.edu.cn Abstract Latent Dirichlet Allocation (LDA) is a popular tool for ana- lyzing discrete count data such as text and images. Applica- tions require LD A to handle both large datasets and a large number of topics. Though distributed CPU systems ha ve been used, GPU-based systems have emerged as a promis- ing alternati ve because of the high computational po wer and memory bandwidth of GPUs. Ho wever , e xisting GPU-based LD A systems cannot support a large number of topics be- cause they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper , we propose SaberLDA, a GPU-based LD A system that implements a sparsity-aware algorithm to achiev e sublinear time complexity and scales well to learn a large number of topics. T o address the challenges intro- duced by sparsity , we propose a nov el data layout, a new warp-based sampling kernel, and an ef ﬁcient sparse count matrix updating algorithm that improves locality , makes ef- ﬁcient utilization of GPU warps, and reduces memory con- sumption. Experiments show that SaberLD A can learn from billions-token-scale data with up to 10,000 topics, which is almost tw o orders of magnitude larger than that of the pre vi- ous GPU-based systems. W ith a single GPU card, SaberLDA is able to learn 10,000 topics from a dataset of billions of to- kens in a few hours, which is only achiev able with clusters with tens of machines before. 1. Introduction T opic models provide a suite of widely adopted statistical tools for feature extraction and dimensionality reduction for bag-of-words ( i.e. , discrete count) data, such as text docu- ments and images in a bag-of-words format [5]. Gi ven an input corpus, topic models automatically extract a number of latent topics , which are unigram distributions ov er the words in a given vocab ulary . The high-probability words in each topic are semantically correlated. Latent Dirichlet Alloca- tion [3] (LD A) is the most popular of topic models due to its simplicity , and has been deployed as a k ey component in data visualization [10], text analysis [4, 28], computer vision [5], network analysis [6], and recommendation systems [8]. In practice, it is not uncommon to encounter large-scale datasets, e.g . , text analysis typically consists of hundreds of T able 1. A summary of GPU-based LD A systems, D : the number of documents, K : the number of topics, V : the size of vocab ulary , T : the number of tokens. Implementation D K V T Y an et al. [21] 300K 128 100K 100M BIDMach [27] 300K 256 100K 100M Steele and T ristan [17] 50K 20 40K 3M SaberLD A 19.4M 10K 100K 7.1B millions of documents [24], and recommendation systems need to tackle hundreds of millions of users [1]. Further- more, as the scale of the datasets increases, the model size needs to be increased as well — we need a larger number of topics in order to e xploit the richer semantic structure un- derlying the data. A reasonable design goal for modern topic modeling systems is thousands of topics, to hav e a good cov- erage for both industry scale applications [20] and research- ing [4, 5, 10]. Howe ver , it is highly challenging to ef ﬁciently train large LD A models. The time complexity of training LDA is high because it inv olves iterativ ely scanning the input corpus for many times ( e.g . , 100), and the time complexity of process- ing each token is not constant but related to the number of topics. T o train LDA in acceptable time, CPU clusters are often used. Howe ver , due to the limited memory bandwidth and low computational po wer of CPUs, large clusters are typi- cally required to learn large topic models [1, 23, 24]. For example, a 32-machine cluster is used to learn 1,000 topics from a 1.5-billion-token corpus. A promising alternati ve is to train LD A with graphics pro- cessing units (GPUs), le veraging their high computational power and memory bandwidth. Along this line, there ha ve been a number of previous attempts. For example, Y an et al. [21] implement the collapsed Gibbs sampling algorithm, BIDMach [27] implements the v ariational Bayes algorithm as well as a modiﬁed Gibbs sampling algorithm, and Steele and T ristan [17] propose a expectation-maximization algo- rithm. These GPU-based systems are reported to achie ve su- perior performance than CPU-based systems [17, 21, 27]. Unfortunately , current GPU-based LD A systems can only learn a few hundred topics (See T able 1), which may not be sufﬁcient to capture the rich semantic structure underly- ing the large datasets in industry scale applications [20]. It is fundamentally difﬁcult for these systems to learn more topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. T o address this problem, we propose SaberLDA, a novel GPU-based system that adopts a sparsity-awar e algorithm for LD A. Sparsity aware algorithms are based on the insight that a single document is not likely to ha ve many topics, and are able to achiev e sub-linear (or ev en amortized constant) time complexity with respect to the number of topics. Rep- resentativ e examples include AliasLDA [12], F+LD A [23], LightLD A [24], W arpLD A [7] and ESCA [26], which are implemented in general purpose CPU systems. Therefore, the running time is not sensitiv e with the number of topics. Howe ver , it is considerably more challenging to design and implement sparsity-aware algorithms on GPUs than on CPUs. Comparing with CPUs, GPUs hav e much lar ger number of threads, much smaller per thread cache size and longer cache lines, which makes it much more dif ﬁcult to use caches to mitigate the random access problems introduced by sparsity . The branch di ver gence issues of GPUs suggest we should fully vectorize the code. But for sparse data struc- tures, loop length is no longer ﬁxed and the data are not aligned, which indicates that straightforward vectorization is not feasible. Finally , the limited GPU memory capacity requires streaming input data and model data, which add another dimension of complexity for data partition and lay- out, to enable parallelism, good locality and efﬁcient sparse matrix updates all together . SaberLD A addresses all these challenges for supporting sparsity-aware algorithms on GPU, our k ey technical contri- bution includes: • A novel hybrid data layout “partition-by-document and order-by-w ord” (PDO W) that simultaneously maximizes the locality and reduces the GPU memory consumption; • A warp-based sampling kernel that is fully vectorized, and is equipped with a W -arysampling tree that supports both efﬁcient construction and sampling; • An efﬁcient “shufﬂe and segmented count” (SSC) algo- rithm for updating sparse count matrices. Our experimental results demonstrate that SaberLD A is able to train LD A models with up to 10,000 topics, which is more than an order of magnitude larger than pre vious GPU-based systems [17, 21, 27], where the throughput only decreases by 17% when the number of topics increases from 1,000 to 10,000. SaberLD A is also highly ef ﬁcient, under various topic settings, SaberLDA conv erges 5 times faster than previous GPU-based systems and 4 times faster than CPU-based systems. With a single card, SaberLD A is able to learn 10,000 topics from a dataset of billions of tokens, Token List L Doc1 iOS 3 Android 3 Doc2 apple 1 iPhone 1 apple 1 iOS 3 Doc3 apple 2 orange 2 1 2 3 Doc 1 2 Doc 2 3 1 Doc 3 2 1 2 3 iOS 2 Android 1 apple 2 1 iPhone 1 orange 1 Word Topic Doc 1 iOS 3 Android 3 Doc 2 apple 1 iPhone 1 apple 1 iOS 3 Doc 3 apple 2 orange 2 Document-Topic Matrix A Word-Topic Matrix B Figure 1. An e xample token list and related count matrices. which is only achie v able with clusters with tens of machines before [23]. The rest the paper is organized as follows, Sec. 2 in- troduces the basics for LDA. Sec. 3 presents the design of SaberLD A. Sec. 4 contains the experiments and Sec. 5 con- cludes. 2. Latent Dirichlet Allocation In this section, we introduce the latent Dirichlet allocation (LD A) model and its sampling algorithm. 2.1 Deﬁnition LD A is a hierarchical Bayesian model that learns latent topics from text corpora [3]. For an input text corpus, the scale of the learning task is determined by the follo wing four numbers: • D : the number of documents in the gi ven corpus; • T : the number of tokens, i.e. , words, appearing in all the documents; • V : the number of unique words in the corpus, also known as vocab ulary size; • K : the number of topics, where D , T and V are determined by the corpus, and K is a parameter that users can specify . The text corpus is represented as a token list L , where each occurrence of word v ∈ [1 , V ] in document d ∈ [1 , D ] is called a token , and represented by a triplet ( d, v , k ) . While d and v are speciﬁed by the corpus, training LD A in volv es assigning a topic assignment k ∈ [1 , K ] for each token ( d, v , k ) . After the token list is given, we construct two count matrices as below: • The document-topic count matrix A is D × K , where A dk is the number of tokens t with ( d t = d, k t = k ) . • The word-topic count matrix B is V × K , where B v k is the number of tokens t with ( v t = v , k t = k ) , where the matrix A is often sparse, i.e. , has man y zero elements (See below for an e xample). 2 2016/10/13 W e omit the mathematical details of LDA and refer the interested readers to standard LD A literature [3, 9]. Infor- mally , the LD A model is designed in a way that maximizes some objectiv e function (the likelihood) related to the topic assignments k . A document d can be characterized by the d - th ro w of the document-topic matrix A , and a topic k can be characterized by the k -th column of the word-topic matrix B . Figure 1 is a concrete example, where there are D = 3 documents, T = 8 tokens, V = 5 words in the vocab ulary , and K = 3 topics. Each token is assigned with a topic as- signment from 1 to 3, indicated as the superscript in the ﬁg- ure. A is the document-topic matrix, e.g. , A 13 = 2 because both tokens in document 1 is assigned to topic 3. B is the word-topic matrix, and its each column characterizes what the corresponding topic is about, e.g. , the ﬁrst topic has the word “apple” and “iPhone”, and is about the device iPhone; the second topic is about fruits, and the third topic is about mobile OS. Likewise, the ro ws of A characterizes what doc- uments are about, e.g . , the ﬁrst document is about mobile OS, the second document is about iPhone and mobile OS, and the last document is about fruits. Note that the matrix A is sparse, because a document is not likely to be relev ant with all the topics at the same time. 2.2 Inference Giv en the token list L , our goal is to infer the topic assign- ment k t for each token t . Man y algorithms exist for this pur - pose, such as variational inference [3], Markov chain Monte- Carlo [9], and expectation maximization [7, 26]. W e choose the ESCA algorithm [26] to implement on GPU for its fol- lowing adv antages • It is sparsity-aware, so the time complexity is sub-linear with respect to the number of topics. This property is critical to support the efﬁcient training of lar ge models; • It enjoys the best degree of parallelism because the count matrices A and B only need to be updated once per iteration. This matches with the massiv ely parallel nature of GPU to achiev e high performance. The ESCA algorithm alternativ ely performs the E-step and the M-step for a given number of iterations (e.g., 100 itera- tions), where the E-step updates k t giv en the counts A , B , and the M step updates A , B giv en k t : • E-step: for each token ( d, v , k ) , sample the topic assign- ment k : p ( k ) ∝ ( A dk + α ) ˆ B v k , (1) where ∝ means “proportional to” and the word-topic probability matrix ˆ B is a normalized version of the count matrix B in the sense that ˆ B is roughly proportional to B , b ut each of its column sums up to 1. ˆ B v k is computed as follows: ˆ B v k = B v k + β P V v =1 B v k + V β , (2)     Figure 2. Sampling from a multinomial distrib ution, ar - eas are proportional to the probabilities, p ( k = 1) = 0 . 25 , p ( k = 2) = 0 . 125 , p ( k = 3) = 0 . 375 , p ( k = 4) = 0 . 25 . • M-step: update the counts A and B , and calculate ˆ B . Here, α and β are two user speciﬁed parameters that control the granularity of topics. Lar ge α and β values mean that we want to discover a few general topics, while small α and β values mean that we w ant to discov er many speciﬁc topics. The updates Eq. (1, 2) can be understood intuitively . A token ( d, v , k ) is likely to have the topic assignment k if both A dk and B v k are large, i.e. , a lot of tokens in document d are topic k and a lot of tokens of word v are topic k . For example, if we wish to update the topic assignment of the “apple” in document 3, it is more likely to be topic 2 rather than topic 1, because the other token “orange” in the same document is assigned to topic 2. 2.3 Sampling from a multinomial distrib ution The core of the above algorithm is sampling according to Eq. (1). T o help readers understand the sampling procedure, we begin with a vanilla sampling algorithm, and then pro- ceed to a more complicated sparsity-aware algorithm. Sam- pling according to Eq. (1) can be viewed as throwing a nee- dle onto the ground, and report the number of the region where the needle falls in, where the area of region k is the probability p ( k ) (Figure 2).This can be implemented with three steps: 1. For k = 1 , . . . , K , compute the probabilities p ( k ) and their sum S = P k p ( k ) ; 2. Generate a random number u ∈ [0 , S ) ; 3. Compute the preﬁx sum c k = c k − 1 + p ( k ) , where c 0 = 0 , and return the ﬁrst k such that u ≤ c k , with a binary search. W e refer the abov e procedure as the vanilla algorithm, whose time complexity is limited by step 1 and step 3, which are O ( K ) . Step 3 is a very important routine which we will use ov er and over again, and we refer its result as “the position of u in the preﬁx sum array of p ( k ) ” for brief. While the vanilla algorithm is O ( K ) , the sparsity-aware algorithms [12, 26] utilize the sparsity of A , and improve the time comple xity to O ( K d ) , where K d is the average number of non-zero entries per ro w of A . The algorithm decomposes the original sampling problem as two easier sampling sub- problems. For sampling each token, it returns the result of a random sub-problem, where the probability of choosing the 3 2016/10/13 Algorithm 1 ESCA algorithm for LD A. 1: Input: token list L 2: V ariable: sparse matrix A , dense matrix B and ˆ B , vector Q , sampling trees of all unique words T 3: for i ← 1 to num iteration do 4: // E Step: 5: for ( d, v , k ) ∈ L do 6: k ← Sample ( A d , ˆ B v , Q v , T v ) 7: end for 8: // M Step: 9: A ← CountByDZ ( L ) 10: B ← CountByVZ ( L ) 11: ˆ B ← Preprocess ( B , β ) 12: for v ← 1 to V do 13: Q v , T v ← BuildT ree ( ˆ B v ) 14: end for 15: end for ﬁrst sub-problem is S S + Q v , where S = P K k =1 A dk ˆ B v k and Q v = α P K k =1 ˆ B v k . The two sub-problems are deﬁned as follows: P RO B L E M 1 . Sample p 1 ( k ) ∝ A dk ˆ B v k . This can be sampled with the v anilla algorithm we de- scribed before. But sparsity can be utilized: if A dk = 0 , then p 1 ( k ) = 0 as well. Therefore, we only need to con- sider the indexes k where A dk 6 = 0 . There are only K d such index es on average, therefore, the time complexity is only O ( K d ) instead of O ( K ) . Similarly , S can be computed in O ( K d ) . P RO B L E M 2 . Sample p 2 ( k ) ∝ α ˆ B v k ∝ ˆ B v k . This problem is only relev ant with v but not d . W e can pre- process for each v . There are various approaches for pre- processing which we will cov er in detail in Sec. 3.2. In brief, we construct a tree T v for each v , and then each sample can be obtained with the tree in O (log W K ) . The pre-processing step is not the bottleneck because it is done only once per iteration. 2.4 Pseudo-Code T o make the presentation more concrete, we present the pseudocode of the ESCA algorithm as Alg. 1, which is a Bulk Synchronous Parallel (BSP) programming model. In the E-step of each iteration, all the tokens in the token list L are updated independently , by calling Sample for each token. All arguments of the function Sample are read-only , and the return v alue is the ne w topic assignment k . In the M-step, the matrices A and B are calculated from the token list L , by functions CountByDZ and CountByVZ. Then, ˆ B is updated by the function Preprocess follo wing Eq. (2). Finally , generate the sampling trees T v and sums Q v for each word v . Algorithm 2 Sparsity aware sampling 1: Input: sparse vector A , dense vector ˆ B , scalar Q , tree T 2: S ← 0 3: P ← new sparse vector 4: for k ← non-zero elements of A do 5: P k ← A k × ˆ B k 6: S ← S + P k 7: end for 8: if random (0 , 1) < S/ ( S + Q ) then 9: k ← sample from P 10: else 11: k ← T .sample() 12: end if 13: return k As shown in Alg. 2, the function Sample samples the topic assignment k giv en the rows A d , ˆ B v , the sum Q v , and the tree T v , by implementing the sparsity-aware algorithm described in Sec. 2.3. First, we compute the probability for Problem 1, P = A d J ˆ B v as well as S , where J is the element-wise product. Next, we ﬂip a coin with the probability of head being S S + Q . If the coin is head, perform sampling from p 1 ( k ) ∝ P k , which in volves ﬁnding the location of a random number in the preﬁx-sum array of the sparse vector P as discussed in Sec. 2.3. Otherwise, sample from p 2 ( k ) ∝ ˆ B v k with the tree T , which is pre-processed. 3. Design and Implementation In this section we present SaberLDA, a high performance GPU-based system for training LD A. The design goals of SaberLD A are: • supporting large models up to 10,000 topics; • supporting large dataset of billions of tok ens; • providing comparable performance with a moderate size CPU cluster using a single GPU card. It is difﬁcult for pre vious GPU-based systems [17, 21, 27] to satisfy all these goals, because all of them are based on the vanilla algorithm mentioned in Sec. 2.3 which have O ( K ) time complexity , i.e. , the training time will be 100 times longer if the number of topics increase from hundreds (current) to 10,000 (goal), which is not acceptable. T o address this problem, SaberLD A adopts the sparsity- aware sampling algorithm whose time complexity is O ( K d ) , which is not very sensitive to the number of topics. Ho w- ev er, exploiting sparsity is highly challenging on GPUs. The memory access is no longer sequential as that for the dense case, and the small cache size and long cache line of GPU aggrav ate this problem. Therefore, the memory accesses need to be organized very carefully for good locality . More- ov er , v ectorization is not as straightforward as the dense case since the loop length is no longer ﬁxed and the data are not 4 2016/10/13 aligned, so waiting and unconcealed memory accesses can happen. Finally , updating the sparse count matrix is more complicated than dense count matrices. W e now present our design of SaberLD A which addresses all the challenges. 3.1 Streaming W orkﬂow W e ﬁrst consider how to implement Alg. 1, deferring the de- tails of the functions Sample , CountByDZ , CountByVZ , Prepr ocess and BuildT ree to subsequent sections. The im- plementation in volves designing the storage format, data placement (host / device), and partitioning strategies for all the data items, as well as designing the order of sampling to- kens (Line 5 of Alg. 1). The data items include the tok en list L , the document-topic count matrix A , the word-topic count matrix B , and the word-topic distrib ution matrix ˆ B . The aforementioned design should maximize the locality , and meanwhile, keep the size of the data items within the budget of GPU memory e ven when both K and the data size are large. This is challenging because locality emerges as a new design goal to support sparsity-aware algorithms, while previous dense matrix based GPU systems [17, 21, 27] only ha ve one goal ( i.e. , ﬁtting into the memory b udget) instead of the two which need to be obtained simultaneously in our method. As we will see soon, the simple data layout in previous systems such as sorting all tokens by document- id [17, 21, 26, 27] or by word-id [23] cannot satisfy both requirements, and we will propose a hybrid data layout to address this problem. 3.1.1 Storage Format W e analyze the access pattern of these data items described in Alg. 1 and Alg. 2. The token list L is accessed sequen- tially , and we store it with an array . The document-topic count matrix A is accessed by row , and the Sample func- tion iterates over its non-zero entries, so we store it by the compressed sparse rows (CSR) format to av oid enumerat- ing over zero entries. Besides efﬁcienc y , the CSR format also reduces the memory consumption and host/device data transferring time comparing with the dense matrix format in previous implementations [17, 21, 27]. The word-topic ma- trices B and ˆ B are randomly accessed, and we store them as dense matrices. T able 2 lists the memory consumption of each data item in the PubMed dataset (See Sec. 4 for the de- tails), where we can observe that representing A as sparse matrix indeed saves a lot of memory when K ≥ 1 , 000 , and the GPU memory is large enough to hold B and ˆ B for the desired number of topics. 3.1.2 Data Placement and Partitioning Strategy Howe ver , the token list L and the document-topic matrix A cannot ﬁt in the GPU memory , because the L gro ws linearly with the number of tokens, and the size of A grows linearly with the number of documents. Therefore, the sizes of both data items grow rapidly with the size of the dataset. Since one of our major design goals is to support large datasets T able 2. Memory Consumption of PubMed Dataset Data W ord-T opic Matrix B , ˆ B T oken List L Doc-T opic Matrix A Dense Sparse K V=141k T=738M D=8.2M 100 0.108 GB 8.65 GB 3.2 GB 5.8 GB 1k 1.08 GB 8.65 GB 32 GB 5.8 GB 10k 10.8 GB 8.65 GB 320 GB 5.8 GB CPU Chunk 3 Chunk 1 Chunk 4 Chunk 2 GPU Worker 2 L 2 B B L 2 L 4 Worker 1 L 1 L 3 A 3 A 1 A 2 A 4 A 2 A 1 L 1 Update Share Read AtomicAdd ^ Figure 3. Streaming workﬂow of SaberLD A. L : T oken List, A : document-topic count matrix, B : word-topic count ma- trix, ˆ B : word-topic probability matrix. with billions of tokens, L and A cannot be held in GPU memory because they can be arbitarily large as the size of the dataset grows. T o address this problem, we treat L and A as str eams : we store them in the main memory , partition them into chunks, and process a few chunks with the GPU at a time. W e par- tition L and A by document , i.e. , each chunk has all to- kens from a certain set of documents along with the corre- sponding ro ws of A . Se veral workers are responsible of per- forming the sampling (Alg. 2) for the tokens in each chunk, as illustrated in Fig. 3. Each worker is a cudaStream that fetches a chunk from the main memory , samples the topic assignments for all the tokens in that chunk, updates the document-topic count maxis A , and sends the updated A back to the main memory . The computation and communica- tion are overlapped by having multiple workers. The word- topic matrices B and ˆ B are in the GPU memory , while in the sampling stage, all the workers read ˆ B and updates B . When the sampling of each iteration is ﬁnished, ˆ B is updated with B . 3.1.3 Order of Sampling T okens W e now discuss the order of sampling tokens, i.e. , execut- ing Sample for each tok en in Alg. 1. Although theoreti- cally these tokens can be sampled in any order , the ordering greatly impacts the locality , as we shall see soon; therefore calling for a careful treatment. As illustrated as Fig. 4(a), sampling k for a token ( d, v , k ) requires ev aluating a element-wise product of two ro ws (Line 4-7 of Alg. 2), i.e. , the d -th row of the document- 5 2016/10/13 d v k ... ... ... 2 1 1 2 3 3 2 4 1 ... ... ... k d 1 2 3 1 1 1 2 2 1 3 1 k v 1 2 3 1 0.1 0.5 0.2 2 0.3 0.2 0.2 3 0.3 0.1 0.3 4 0.3 0.2 0.3 d v k ... ... ... 1 3 1 2 3 3 7 3 1 ... ... ... k d 1 2 3 1 1 1 2 2 1 ... 7 1 d v k ... ... ... 2 1 1 2 3 3 2 4 1 ... ... ... k d 1 2 3 1 1 1 2 2 1 3 1 (a) Sample a token (b) doc-id major order (c) word-id major order Token List L Doc-Topic A (sparse matrix) Word-Topic B (dense matrix) × k v 1 2 3 1 0.1 0.5 0.2 2 0.3 0.2 0.2 3 0.3 0.1 0.3 4 0.3 0.2 0.3 k v 1 2 3 1 0.1 0.5 0.2 2 0.3 0.2 0.2 3 0.3 0.1 0.3 4 0.3 0.2 0.3 Reused Random across rows Random inside a row Reused in shared memory Random across rows Sequential in a row A d B v Figure 4. Memory Access Pattern of Sampling. topic count matrix A d and the v -th ro w of the word-topic probability matrix ˆ B v . The element-wise product inv olves accessing all non-zero entries of A d (sequential), and ac- cessing the elements of ˆ B v index ed by the non-zero entries of A d (random). There are two particular visiting orders which reuse the result of previous memory accesses for better locality . The doc-major or der sorts the tokens by their document-id’ s, so that the tokens belonging to the same document are adjacent in the token list. Before processing the tokens in the d - th document, the row A d can be fetched into the shared memory and reused for sampling all the subsequent tokens in that document. On the other hand, the access of ˆ B v cannot be reused, because each token requires to access r andom elements (index ed by the non-zero entries of A d ) in random r ow of the word-topic count matrix ˆ B , as sho wn in Fig. 4 (b). The bottleneck of this approach is accessing ˆ B , where both the row inde x and column index are random. On the contrary , the word-major order sorts the tokens by their word-id’ s, so that the tokens belong to same word are adjacent in the token list. Before processing the tokens of the v -th word, the row ˆ B v can be fetched into the shared memory and reused for sampling all the subsequent tokens of that word. Each token needs to access all the elements of a random row A d (Fig. 4 (c)). The bottleneck of this approach is accessing A , where only the row inde x is random. The memory hierarchies are quite dif ferent on CPUs and GPUs. CPUs ha ve larger cache size ( > 30MB) and shorter cache line (64B), while GPUs have smaller cache size ( > 2MB) and longer cache line. On CPUs, when K i n t W a r p S a m p l e ( S p a r s e V e c t o r A , _ _ s h a r e d _ _ D e n s e V e c t o r B _ h a t , W a r y _ t r e e T , R a n d o m S e e d & s e e d ) { f l o a t S = 0 ; _ _ s h a r e d _ _ f l o a t P [ A . s i z e ( ) ] ; f o r ( i n t i = t h r e a d _ i d ; i < A . s i z e ( ) ; i + = 3 2 ) { i n t k = A [ i ] . i d x ; P [ i ] = A [ i ] . v a l * B _ h a t [ k ] ; S + = P [ i ] ; } S = w a r p _ s u m ( S ) ; i f ( R a n d o m F l o a t ( s e e d ) < S / ( S + T . s u m ( ) ) ) { f l o a t x = R a n d o m F l o a t ( s e e d ) * S ; f l o a t p s = 0 ; f o r ( i n t i = 0 ; i < A . s i z e ( ) ; i + = 3 2 ) { p s + = w a r p _ p r e f i x _ s u m ( P [ i + t h r e a d _ i d ] ) ; i n t v o t e _ i d = w a r p _ v o t e ( p s > = x ) ; i f ( v o t e _ i d ! = - 1 ) r e t u r n A [ i + v o t e _ i d ] . i d x ; p s = w a r p _ c o p y ( p s , 3 1 ) ; } } e l s e r e t u r n T . S a m p l e ( s e e d ) ; } Figure 5. W arp-based sampling kernel. and V are small, the document-major order can have bet- ter cache locality than the word-major order because ˆ B can ﬁt in cache [26]. But for GPUs, the word-major order has clear advantage that it ef ﬁciently utilizes of the cache line by accessing whole ro ws of A d instead of random elements. Therefore, we choose the word-major order for SaberLD A. 3.1.4 Partition-by-Document and Order -by-W ord Putting all the above analysis together, SaberLDA adopts a hybrid data layout called partition-by-document and order- by-word (PDO W), which means to ﬁrstly partition the token list by document-id, and then sort the tokens within each chunk by word-id. Unlik e the simple layouts such as sorting all tokens by document-id or by word-id in previous sys- tems [17, 21, 23, 24, 26, 27], PDO W combines the advan- tages of both by-document partitioning and word-major or- dering, and simultaneously improv es cache locality with the word-major order, and k eeps the GPU memory consumption small with the by-document partitioning. The number of chunks presents a tradeof f between mem- ory consumption and locality . The memory consumption is in versely proportional to the number of chunks, but the num- ber of times to load each row of B into shared memory is proportional to the number of chunks. The chunk size should not be too small to ensure good locality , i.e. , the pre-loaded B v should be reused for a suffciently large number of times. In SaberLD A, we minimize the number of chunks as long as the memory is sufﬁcient. 3.2 W arp-based Sampling W e no w turn to the Sample function (Alg. 2), which is the most time consuming part of LD A. T o understand the 6 2016/10/13 challenges and ef ﬁciency-related considerations, it is helpful to hav e a brief re view of GPU architecture. GPUs follo w a single instruction multiple data (SIMD) pattern, where the basic SIMD unit is warp , which has 32 data lanes . Each lane has its o wn ALU and re gisters, and all the lanes in a warp ex ecute the same instruction. In CUDA, each thread is e xecuted on a lane, and ev ery adjacent 32 threads share the same instruction. Readers can make an analogy between GPU warp instruction and CPU vector instruction. The most straightforward implementation of sampling on GPU is thr ead-based sampling , which samples each token with a GPU thread. Therefore, 32 tokens are processed in parallel with a warp. Thread-based sampling is acceptable when A is dense because the loop length (Line 4 of Alg. 2) is always K , and are no branches (Line 8 of Alg. 2). Ho wev er, this approach has several disadvantages when A becomes sparse. Firstly , because each token corresponds to different rows of A , the loop length of each thread are dif ferent (Line 4 of Alg. 2). In this case, all the threads need to wait until the longest loop is ﬁnished, introducing long waiting time. Secondly , there are branches in the sampling procedure, for example, the branch choosing which distribution to sample (Line 8 of Alg. 2). The branches cause the thread di ver gence problem, i.e. , if some of the threads go to one branch and other threads go to another branch, the warp need to perform the instructions of both branches, which again increases the waiting time. Finally , the access to A is unconcealed (Line 5 of Alg. 2) because the threads are accessing different and discontinuous addresses of the global memory . T o ov ercome the disadv antages of thread-based sam- pling, SaberLD A adopts warp-based sampling , where all the threads in a warp collaborativ ely sample a single token. Howe ver , there are also challenges for w arp-based sampling — all the operations need to be vectorized to maximize the performance. W e now present our v ectorized implementation. As men- tioned in Sec. 2.4, the sampling in volv es computing the element-wise product (Line 4-7 of Alg. 2), randomly decide which branch to sample (Line 8 of Alg. 2), and depending on which branch is chosen, sample from P (Line 9 of Alg. 2), or sample with the pre-processed tree (Line 11 of Alg. 2). Before starting the presentation, we also remind the readers that A is in global memory , and all the other data items such as B v and T v are pre-loaded into shared memory taking ad- vantage of PDO W as discussed in Sec. 3.1.3. 3.2.1 Element-wise Product The element-wise product step is vectorizable by simply letting each thread process an index and a warp compute the element-wise product for 32 indices at a time. All the threads are efﬁciently utilized except for the last 32 indices if the number of non-zero entries of Ad is not a multiple of 32. The waste is very small since the number of non-zero entries is typically much larger than 32, e .g. , 100. 3.2.2 Choosing the Branch This step only consists a random number generation and a comparison, whose costs are negligible. Note that thread div ergence will not occur as for the thread-based sampling since the whole warp goes to one branch or the other . 3.2.3 Sample from P This step consists of the three steps mentioned in Sec. 2.3, where the element-wise product is already computed. W e need to generate a random number , and ﬁnd its position in the preﬁx-sum array of P . Firstly , we need to vectorize the computation of the pre- ﬁx sum, i.e. , computing the preﬁx sum of 32 numbers us- ing 32 threads. Because of the data dependenc y , the preﬁx sum cannot be computed with a single instruction. Instead, shuf down operations are required, and the preﬁx sum can be computed in O (log 2 32) instructions [13]. W e refer this routine as warp preﬁx sum . Giv en the preﬁx sum, we need to ﬁgure out the index of the ﬁrst element which is greater than or equal to the value. This can be achiev ed in two steps: 1. The warp-vote intrinsic ballot forms a 32-bit integer , where the i -th bit is one if the i -th preﬁx sum is greater than or equal to the giv en value [16], 2. The ffs intrinsic returns the index of the ﬁrst bit 1 of a 32-bit integer , where we refer these two steps as warp vote , which returns an inde x that is greater than or equal to the gi ven v alue, or -1 if there is no such index. Deferring the discussion for the pre-processed sampling for a while, Fig. 5 is an example code of our vectorized warp-based sampling, where we omit some details such as whether the vector length is a multiple of the warp size, and the warp copy(a, id) function returns a on the thread id . There is no waiting or thread div ergence issue as discussed abov e. Besides the eliminated waiting time and thread diver - gence, the memory access beha vior of our implementation is good as well. For the element-wise product, the access to A is continuous. More speciﬁcally , the warp accesses two 128- byte cache lines from the global memory , and each thread consumes two 32-bit numbers (an integer index and a ﬂoat32 value). The accesses to B are random, but they are still efﬁ- cient since the current row B v is in shared memory . All the rest operations only access data items in the shared memory , e.g . , P . 3.2.4 W -ary Sampling T ree W e no w present the deferred details of the pre-processed sampling (Line 11 of Alg. 2). The pre-processed sampling is ne w in sparsity-aw are algorithms, since the v anilla al- gorithm does not break the original sampling problem into sub-problems. Therefore, previous GPU-based systems do not hav e this problem. As discussed in Sec. 2.3, the pre- 7 2016/10/13 s t r u c t W a r y _ t r e e { f l o a t L 1 , L 2 ; _ _ s h a r e d _ _ A r r a y L 3 , L 4 ; W a r y _ t r e e ( A r r a y p ) { L 4 . A l l o c ( p . s i z e ( ) ) L 4 = a r r a y _ p r e f i x _ s u m ( p ) ; L 3 . A l l o c ( L 4 . s i z e ( ) / 3 2 ) f o r ( i n t i = t h r e a d _ i d ; i < L 3 . s i z e ( ) ; i + = 3 2 ) L 3 [ i ] = L 4 [ i * 3 2 - 1 ] ; L 2 = L 3 [ t h r e a d _ i d * 3 2 - 1 ] ; L 1 = w a r p _ c o p y ( L 2 , 3 1 ) ; } f l o a t S u m ( ) { r e t u r n L 1 ; } i n t S a m p l e ( R a n d o m S e e d & s e e d ) { f l o a t x = R a n d o m F l o a t ( s e e d ) * L 1 ; i n t o f f 3 = w a r p _ v o t e ( L 2 > = x ) * 3 2 ; i n t o f f 4 = ( o f f 3 + w a r p _ v o t e ( L 3 [ o f f 3 + t h r e a d _ i d ] > = x ) ) * 3 2 ; r e t u r n o f f 4 + w a r p _ v o t e ( L 4 [ o f f 4 + t h r e a d _ i d ] > = x ) ; } } ; Figure 6. W -ary Sampling Tree. processed sampling problem is essentially the same problem as sampling from P , b ut there are only V dif ferent sampling problems, so we can pre-process for each problem to accel- erate this process. In previous literature, there are two main data structures for that problem, and we brieﬂy revie w them. • An alias table [18] can be built in O ( K ) time, and each sample can be obtained in O (1) time afterwards. Ho w- ev er, b uilding the alias table is sequential; • A Fenwick tree [23] can be built in O ( K ) time, and each sample can be obtained in O (log 2 K ) time afterwards. Howe ver , the branching factor of the Fenwick tree is only two, so the 32-thread GPU warp cannot be fully utilized. Both approaches are designed for CPU, and are slo w to con- struct on GPU because they cannot be vectorized. V ector - ization is critical because using only one thread for pre- processing can be much slower than using a full warp for pre-processing. T o allow vectorized pre-processing, we pro- pose a W -arytree, which can be constructed in O ( K ) time with full utilization of GPU warp . Subsequent samples can be obtained in O (log W K ) time, where W is the number of threads in a GPU warp, i.e. , 32. W e emphasize that our main focus is on the ef ﬁcient con- struction of the tree instead of efﬁcient sampling using the tree, because the cost of sampling using the tree is negligible comparing with sampling from P . Moreover , the sampling using our W -arytree is efﬁcient an yway because K is in the order of 10,000, and log W K = 4 , so O (log W K ) = O (4) is at the same lev el of the O (1) alias table algorithm. 1 0 2 3 9 8 9 0 2 0 0 1 1 1 3 3 5 8 8 8 9 p = random(0, 1) ✕ 9 = 7.5 warp_vote(node[i]≥p) = 1 warp_vote(node[i]≥p) = 2 prefix_sum internal nodes Building key=5, val=3 Sampling root node (sum) 3 Figure 7. Building a 3-ary tree and sampling from the tree. Our W -arytree is designed for efﬁciently ﬁnding the loca- tion of a gi ven number in the preﬁx-sum array . Each node of the tree stores a number , where the bottom-most le vel nodes store the preﬁx-sums of the gi ven array . The length of an up- per level is equal to the length of the lower level divided by W , and the i -th node in an upper le vel is equal to the iW − 1 node in the lo wer lev el. Constructing the tree is illustrated in Fig. 7. This procedure is ef ﬁcient because all the nodes in one layer can be constructed in parallel. Therefore, the GPU warp can be efﬁciently utilized with the warp pr eﬁx sum function we mentioned before, which uses W threads to compute the preﬁx-sum of W numbers in parallel. T o ﬁnd the position of a giv en value in the preﬁx-sum array , we recursiv ely ﬁnd the position at each level, from top to bottom (Fig. 7). Based on the particular construction of our tree, if the position on the l -th level is i , then the position at the l + 1 -th le vel is between iW and iW + W − 1 . Therefore, only W nodes need to be checked on each lev el. This checking can be done efﬁciently using the warp vote function we mentioned before, and the memory access is efﬁcient because the tree is stored in the shared memory (Sec. 3.1.3), and it only needs to read W continuous ﬂoating point numbers, i.e. , a 128-byte cache line for each le vel. The amount of memory accesses can be further reduced. W e use a four-le vel tree for SaberLD A, which supports up to W 3 = 32 , 768 topics. The ﬁrst and second layers contain only 1 and 32 nodes, respectiv ely , and they can be stored in the thread registers. In this way , only two shared memory cache lines for le vel 3 and 4 are accessed per query . Fig. 6 is the code for the W -arytree. 3.3 Efﬁcient Count Matrix Update Finally , we discuss how to efﬁciently update the count matri- ces A and B . When the sampling of all tokens with the same word is ﬁnished, the corresponding row B v of the word- topic count matrix is ready to be updated. The atomicAdd function must be used because there may be multiple work- ers updating the same ro w , but the overhead is very lo w since the time complexity of updating is lo wer than the time com- plexity of sampling. The word-topic probability matrix ˆ B 8 2016/10/13 1 8 5 1 3 5 5 3 1 1 3 3 5 5 5 8 start radix sort adjecent difference 0 0 1 0 1 0 0 1 0 0 1 1 2 2 2 3 prefix sum 1 3 5 8 keys 2 2 3 1 count d[0] = 0, d[i] = (a[i]!=a[i-1]) p[i] = p[i-1] + d[i] nKeys = p[N-1] k[p[i]] = a[i], c[p[i]]++ a[i] radix_sort(a, a+N) (1) (2) (3) Figure 8. An example for se gmented count can be easily generated according to Eq. (2) from B after all the updates of the latter are ﬁnished. Maximal parallel performance can be achiev ed since both matrices are dense. Howe ver , the update of the document-topic matrix A is challenging since A is sparse. T o update an entry of a sparse matrix, one must ﬁnd the location of that entry , which is difﬁcult to vectorize. Therefore, instead of updating, we r ebuild the count A after the sampling of each partition is ﬁnished. A na ¨ ıve approach of rebuilding the count matrix is to sort all the tokens by ﬁrst document-id d then topic assignment k , and perform a linear scan afterwards. Howe ver , the sorting is expensiv e since it requires to access the global memory frequently . Moreov er , the efﬁcienc y of sorting decreases as the chunk size increases. W e propose a procedure called shuf ﬂe and se gmented count (SSC) to address this problem. T o rebuild the count matrix, we ﬁrst perform a shufﬂe to organize the token list by the document-ids, i.e. , segment the token list as smaller lists where the tokens in each list share the same document- id. Since the document-ids of the tokens are ﬁxed, the shufﬂe can be accelerated with a pre-processed pointer array , which points each token to its position in the shuf ﬂed list. There- fore, to perform shufﬂing, we only need to place each token according to the pointers. Furthermore, we can reduce the accesses to global memory by creating the counts for each smaller token list indi vidually , where these token lists are small enough to ﬁt in the shared memory . Creating the counts is a common problem as kno wn as se gmented count , which means for se gmenting the tokens by d and counting k in each segment. Unlike similar problems such as se gmented sort [15] and segmented reduce [14] which have fully optimized algorithms for GPUs, ef ﬁcient GPU solution of segmented count is not well studied yet. W e propose a solution of segmented count which is suf- ﬁciently efﬁcient for SaberLD A. Our procedure consists of three steps, as illustrated in Fig. 8: 1. Perform a radix sort by k in shared memory; 2. Calculate the preﬁx sum of the adjacent difference, in order to get the number of different topics and the order number of each topic; T able 3. Statistics of various datasets, where T is the total number of words in the corpus. Dataset D T V T /D NYT imes [2] 300K 100M 102k 332 PubMed [2] 8.2M 738M 141k 90 ClueW eb12 subset 19.4M 7.1B 100k 365 3. Assign the topic number at corresponding order number , and increase the counter of the same topic. 3.4 Implementation Details and Load Balancing In SaberLDA, a word and a token are processed with a a block and a warp, respectiv ely . Each block has its own shared memory to hold the rows ˆ B v and B v , where the former is fetched from the global memory before sampling any tokens in the current word v and the latter is written back to global memory after the sampling. T o minimize the number of memory transactions, we also align the memory address of each row of A by 128 bytes. The workload for each block is determined by the size of its chunk, the workload for each block is determined by the number of tokens in the word, and the workload for each warp is determined by the number of non-zero entries in the document. T o minimize the imbalance of the workload, we apply dynamic scheduling at all the three le vels, i.e . , a block fetches a new word to process when it is idle, and similarly , a warp fetches a ne w token to process when it is idle. Because the term frequency of a natural corpus of- ten follows the power law [11], there are a few high- frequency words which have much more tokens than other low-frequenc y words. Therefore, the block le vel w orkload is more imbalanced than the other two le vels. T o address this problem, we further sort the words decreasingly by the num- ber of tokens in it, so that the words with most number of tokens will be executed ﬁrst, and then the words with small number of tokens are ex ecuted to ﬁll the imbalance gaps. 4. Evaluation In this section, we provide e xtensiv e experiments to analyze the performance of SaberLDA, and compare SaberLD A with other cutting-edge open source implementations on various widely adopted datasets listed in T able 3. The code of SaberLD A is about 3,000 lines, written in CUD A and C++, and compiled with NVIDIA n vcc 8.0 and Intel ICC 2016. Our testing machine has two Intel E5- 2670v3 CPUs, with 12 cores each CPU, 128 GB main mem- ory and a NVIDIA GTX 1080 GPU. The hyper-parameters α = 50 /K and β = 0 . 01 are set according to previous w orks [7, 12, 22, 23]. The training time of LD A depends on both the num- ber of iterations to con verge and the time for each itera- tion. The former depends on the algorithm, e.g. , variational Bayes algorithm [3] typically requires fewer iterations than ESCA [26] to con verge. The latter depends on the time com- 9 2016/10/13 G0 G1 G2 G3 G4 0 50 100 150 200 Time (s) NYTimes (K=1000) 100 Iterations Sampling A Update Preprocessing Transfer Figure 9. Impact of optimizations. G0: Baseline; G1: PDO W ; G2: W -arytree; G3: SSC; G4: Asynchronous. plexity of sampling each token as well as the implemen- tation, e.g. , the sparsity-a ware algorithm has O ( K d ) time complexity and performs faster than the O ( K ) vanilla algo- rithms. Therefore, we use various metrics to compare LD A implementations: • W e use time per iteration or thr oughput to compare different implementations of the same algorithm, e .g. , compare SaberLDA, which is a GPU implementation of the ESCA algorithm, with a CPU implementation of the same algorithm, because they require the same number of iterations to con ver ge. The throughput is deﬁned as the number of processed tok ens di vided by the running time, and the unit is million tokens per second (Mtoken/s). • Since different algorithms require dif ferent numbers of iterations to con ver ge, the time per iteration metric is no longer informativ e. Therefore, we compare different al- gorithms by the r equir ed time to con verg e to a certain model quality . The model quality is assessed by hold- out log-lik elihood per token, using the partially-observed document approach [19]. Higher log-likelihood indicates better model quality . 4.1 Impact of Optimizations W e ﬁrst in vestigate the impact of each optimization tech- nique we proposed in Sec. 3, by training LDA on the NY - T imes dataset with 1,000 topics for 100 iterations. The re- sults are shown in Fig. 9, where the total elapsed time is decomposed as the Sample function, rebuilding document- topic matrix A , constructing pre-processed data structures for sampling (pre-processing), and data transferring between CPU and GPU. G0 is the most straightforward sparsity-a ware implemen- tation on GPU which sorts all tokens by documents, per- forms the pre-processed sampling with the alias table, and builds count matrices by na ¨ ıve sorting of all tokens. G1 adopts the PDOW strategy proposed in Sec. 3.1.4, and the time of sampling is reduced by almost 40% because of the improv ed locality for sampling. Note rebuilding the doc- T able 4. Memory bandwidth utilization, NYT imes, K = 1000 . Throughput (GB/s) Utilization Global memory 144 50% L2 cache 203 30% L1 uniﬁed cache 894 20% Shared memory 458 20% topic matrix A tak es more time in G1 than in G0, because the tokens are ordered by word, and the sorting becomes slower . The bottleneck of G1 is the construction of the alias table, which is hard to vectorize. In G2, we replace the alias table with the W -arytree, which fully utilizes warps to greatly reduce the construction time by 98%. Then, we optimize the rebuilding of the document-topic matrix A with SSC (Sec. 3.3), which reduces the rebuilding time by 89%. Up to now , the time for updating A and pre- processing is neglectable. Finally , in G4, we enable multiple w orkers running asyn- chronously to hide the data transfer between CPU and GPU. It reduce 12.3% of total running time in this case. Overall, all these optimizations achie ve 2.9x speedup comparing with the baseline version G0. W e emphasize that even G0 is al- ready highly optimized, and should handle more topics than previous GPU implementations because it still adopts the sparsity-aware algorithm whose time comple xity is O ( K d ) . 4.2 Perf ormance T uning T uning parameters, such as the number of workers, the num- ber of chunks and the number of threads in a CUDA kernel can largely affect the performance. In order to fully under- stand the impact of tuning parameters, we analyze the per- formance of SaberLD A under different parameter settings. W e ev aluate the total running time of the ﬁrst 100 iterations on the NYTimes dataset with the number of topics varying from 1,000, to 3,000, and to 5,000. 4.2.1 Number of Chunks Firstly , we analyze the single worker performance with v ari- ous numbers of partitions, as sho wn in Figure 10 (a). W e can see that the performance degrades with more partitions be- cause of the degraded locality , but partitioning is necessary when the dataset is too large to be held in GPU. W e keep the number of partitions as small as possible to maximize the performance. 4.2.2 Number of W orkers Figure 10 (b) presents the performance with different num- bers of workers. W e ﬁx the number of chunks to 10. Us- ing multiple workers can hide the data transfer time between GPU and CPU, and reduce the overall time consumption. W e observe a 10% to 15% speedup from single worker to 4 workers, where the speedup is quite close to the proportion of data transferring shown in Fig. 9. 10 2016/10/13 K=1000 K=3000 K=5000 0 20 40 60 80 100 120 140 Throughput (Mtoken/s) P=1 P=3 P=9 P=30 (a) K=1000 K=3000 K=5000 0 20 40 60 80 100 120 140 Throughput (Mtoken/s) W=1 W=2 W=4 W=8 (b) K=1000 K=3000 K=5000 0 20 40 60 80 100 120 140 Throughput (Mtoken/s) T=32 T=64 T=128 T=256 T=512 T=1024 (c) Figure 10. Dateset: NYTimes. (a) Performance of different number of partitions. (b) Performance of different number of workers. (c) Performance of dif ferent number of threads. 4.2.3 Number of Threads T uning the number of threads in each block maximizes the system performance. For the kernel function, more threads imply fewer activ e blocks, which reduces the total shared memory usage in a multiprocessor , b ut increases the in-block synchronization o verhead of warps. Figure 10 (c) shows that setting 256 threads in a block always achiev es the best performance for various numbers of topics. 4.3 Proﬁling Analysis Next, we analyze the utilization of hardware resources with NVIDIA visual proﬁler . W e focus on memory bandwidth because LD A is a memory intensive task [7]. T able 4 is the memory bandwidth utilization of the ﬁrst 10 iterations on NYT imes with K = 1 , 000 . Statistics show that the throughput accessing global memory reaches more than 140 GB/s, which is about 50% of the bandwidth. This shows clear adv antage over CPUs, giv en that the bandwidth between main memory and CPU is only 40 to 80 GB/s. The throughput of the L2 cache, uniﬁed L1 cache and shared memory is 203GB/s, 894GB/s and 458GB/s respectively , while the utilization is lo wer than 25%. Therefore, these are not the bottleneck of the ov erall system performance. W e further use performance counters to examine of kernel function. The memory dependenc y is the main reason of instruction stall (47%), and the second reason is execution dependency (27%). The hotspot is computing the element- wise product between sparse vector A d and dense vector ˆ B w , which is expected because it accesses global memory . 4.4 Comparing with Other Implementations W e compare SaberLD A with a state-of-the-art GPU-based implementation BIDMach [27] as well as three state-of-the- art CPU-based methods, including ESCA (CPU), DMLC [25] and W arpLDA [7]. BIDMach [27] is the only open-sourced GPU-based method. BIDMach reports better performance than Y an et. als mehtod [21, 27], and Steele and Tristans method only reports tens of topics in their paper . Therefore, BIDMach is a strong competitor . ESCA (CPU) is a care- fully optimized CPU version of the ESCA algorithm which SaberLD A also adopts. DMLC has various multi-thread LD A algorithms on CPU, and we choose its best perform- ing FT reeLD A. W arpLDA is a state-of-art distributed LDA implementation on CPU based on the Metropolis-Hastings algorithm, and uses a cache-ef ﬁcient O (1) sampling algo- rithm to obtain high per-iteration throughput. W e compare the time to conv erge of these implementa- tions on NYT imes and PubMed datasets, with the number of topics K = 1 , 000 . Figure 11 shows the con ver gence ov er time. W e compare the time to con verge to the per- token log-likelihood of − 8 . 0 and − 7 . 3 , for NYTimes and PubMed, respecti vely . SaberLD A is about 5.6 times faster than BIDMach. W e also attempt to perform the compari- son with 3,000 and 5,000 topics, and ﬁnd that BIDMach is more than 10 times slower than SaberLDA with 3,000 topics, and reports an out-of-memory error with 5,000 topics. This is as expected because the time consumption of BIDMach grows linearly with respect to the number of topics, and its dense matrix format is much more memory consuming than SaberLD A. Finally , SaberLDA is about 4 times faster than ESCA (CPU) and 5.4 times faster than DMLC on the two datasets with K = 1 , 000 , where W arpLDA conv erges to a worse local optimum possibly because of its inexact Metropolis- Hastings step and the dif ferent metric with its paperr [7] which we use to assess model quality . This shows that SaberLD A is more efﬁcient than CPU-based implementa- tions. 4.5 A Large Dataset Finally , to demonstrate the ability of SaberLD A to process large datasets, we test the performance of SaberLD A on a large subset of the ClueW eb dataset, which is a crawl of webpages. 1 W e ﬁlter out the stop words and keep the remaining 100,000 most frequent words as our vocab ulary , which is comparable to the vocabulary size of NYTimes. W e use the entire CPU memory to hold as many documents 1 http://www.lemurproject.org/clueweb12.php/ 11 2016/10/13 0 50 100 150 200 time (s) 9.2 9.0 8.8 8.6 8.4 8.2 8.0 7.8 7.6 7.4 log liklihood per token NYTimes K = 1000 0 200 400 600 800 1000 1200 1400 time (s) 9.5 9.0 8.5 8.0 7.5 7.0 6.5 6.0 PubMed K = 1000 SaberLDA BIDMach SCA DMLC WarpLDA Figure 11. Con ver gence over time with 1000 topics. The left is ev aluated on the NYT imes dataset, while the right is on the PubMed dataset. 0 1 2 3 4 5 time (hours) 9.5 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 log liklihood per token ClueWeb Dataset GTX 1080 (K=5K) Titan X (K=5K) Titan X (K=10K) Figure 12. The con vergence of SaberLD A on ClueW eb Subset. as possible, which is 19.4 million, and the total number of tokens is 7.1 billion, which is about 10 times larger than the PubMed dataset. In this experiment, we also use GTX Titan X (Maxwell), which has 12GB global memory besides GTX 1080, which has 8GB global memory . W e compare the performance of GTX 1080 and T itan X with 5,000 topics. The algorithm con ver gences in 5 hours on both cards, where the throughput of GTX 1080 is 135 Mtok en/s and the throughput of T itan X is 116 Mtoken/s. W ith 10,000 tokens, it also con ver gences in 5 hours with a throughput of 92 Mtoken/s. 5. Conclusions W e present SaberLD A, a high performance sparsity-aware LD A system on GPU. Adopting sparsity-aware algorithms, SaberLD A overcomes the problem of previous GPU-based systems, which support only a small number of topics. W e propose nov el data layout, warp-based sampling kernel, and efﬁcient sparse count matrix updating algorithm to address the challenges induced by sparsity , and demonstrate the power of SaberLD A with extensi ve experiments. It can efﬁ- ciently handle large-scale datasets with up to 7 billion tokens and learn large LD A models with up to 10,000 topics, which are out of reach for the existing GPU-based LD A systems. In the future, we plan to extend SaberLD A to multiple GPUs and machines. De veloping algorithms that conv erge faster and enjoy better locality is also our future w ork. References [1] Amr Ahmed, Moahmed Aly , Joseph Gonzalez, Shrav an Narayanamurthy , and Alexander J Smola. Scalable inference in latent variable models. In Pr oceedings of the ﬁfth ACM in- ternational confer ence on W eb searc h and data mining , pages 123–132. A CM, 2012. [2] Arthur Asuncion and David Newman. Uci machine learning repository , 2007. [3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR , 3:993–1022, 2003. [4] Jordan L Boyd-Graber , David M Blei, and Xiaojin Zhu. A topic model for word sense disambiguation. In EMNLP- CoNLL , 2007. [5] Liangliang Cao and Li Fei-Fei. Spatially coherent latent topic model for concurrent segmentation and classiﬁcation of objects and scenes. In ICCV , 2007. [6] J. Chang and D. Blei. Relational topic models for document networks. In AIST A TS , 2009. [7] Jianfei Chen, Kaiwei Li, Jun Zhu, and W enguang Chen. W arplda: a cache efﬁcient o (1) algorithm for latent dirichlet allocation. In VLDB , 2016. [8] W en-Y en Chen, Jon-Chyuan Chu, Junyi Luan, Hongjie Bai, Y i W ang, and Edward Y Chang. Collaborative ﬁltering for orkut communities: discovery of user latent behavior . In WWW , 2009. [9] Thomas L Grif ﬁths and Mark Steyvers. Finding scientiﬁc topics. Pr oceedings of the National academy of Sciences , 101(suppl 1):5228–5235, 2004. [10] T omoharu Iwata, T akeshi Y amada, and Naonori Ueda. Prob- abilistic latent semantic visualization: topic model for visual- izing documents. In KDD , 2008. [11] Zipf George Kingsley . Selectiv e studies and the principle of relativ e frequency in language, 1932. [12] Aaron Q Li, Amr Ahmed, Sujith Ravi, and Alexander J Smola. Reducing the sampling complexity of topic models. In KDD , 2014. [13] John D. Owens Mark Harris, Shubhabrata Sengupta. Parallel preﬁx sum (scan) with cuda. http://http.developer. nvidia.com/GPUGems3/gpugems3_ch39.html , 2007. [14] NVIDIA. Segmented reduction. https://nvlabs.github. io/moderngpu/segreduce.html , 2013. [15] NVIDIA. Segmented sort and locality sort. https:// nvlabs.github.io/moderngpu/segsort.html , 2013. [16] NVIDIA. Cuda c programming guide. http://docs. nvidia.com/cuda/cuda- c- programming- guide/ index.html#warp- vote- functions , 2015. [17] Jean-Baptiste T ristan, Joseph T assarotti, and Guy Steele. Ef- ﬁcient training of lda on a gpu by mean-for-mode estimation. In Pr oceedings of the 32nd International Conference on Ma- chine Learning (ICML-15) , pages 59–68, 2015. 12 2016/10/13 [18] Alastair J W alker . An ef ﬁcient method for generating dis- crete random variables with general distributions. ACM T rans- actions on Mathematical Software (TOMS) , 3(3):253–256, 1977. [19] Hanna M W allach, Iain Murray , Ruslan Salakhutdinov , and David Mimno. Evaluation methods for topic models. In Pr oceedings of the 26th Annual International Confer ence on Machine Learning , pages 1105–1112. A CM, 2009. [20] Y i W ang, Xuemin Zhao, Zhenlong Sun, Hao Y an, Lifeng W ang, Zhihui Jin, Liubin W ang, Y ang Gao, Jia Zeng, Qiang Y ang, et al. T o wards topic modeling for big data. ACM T ransactions on Intelligent Systems and T echnology , 2014. [21] Feng Y an, Ningyi Xu, and Y uan Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in Neural Information Pr ocessing Systems , pages 2134–2142, 2009. [22] Limin Y ao, David Mimno, and Andrew McCallum. Ef ﬁcient methods for topic model inference on streaming document collections. In Pr oceedings of the 15th ACM SIGKDD inter- national conference on Knowledge discovery and data min- ing , pages 937–946. A CM, 2009. [23] Hsiang-Fu Y u, Cho-Jui Hsieh, Hyokun Y un, SVN V ish- wanathan, and Inderjit S Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In Pr oceedings of the 24th International Conference on W orld W ide W eb , pages 1340–1350. International W orld Wide W eb Confer- ences Steering Committee, 2015. [24] Jinhui Y uan, Fei Gao, Qirong Ho, W ei Dai, Jinliang W ei, Xun Zheng, Eric P Xing, Tie-Y an Liu, and W ei-Y ing Ma. Lightlda: Big topic models on modest compute clusters. In WWW , 2015. [25] Manzil Zaheer . Dmlc experimental-lda. https://github. com/dmlc/experimental- lda , 2016. [26] Manzil Zaheer , Michael W ick, Jean-Baptiste T ristan, Alex Smola, and Guy L Steele Jr . Exponential stochastic cellular automata for massiv ely parallel inference. In AIST A TS , 2015. [27] Huasha Zhao, Biye Jiang, John F Canny , and Bobby Jaros. Same but different: Fast and high quality gibbs parameter es- timation. In Pr oceedings of the 21th ACM SIGKDD Interna- tional Confer ence on Knowledge Disco very and Data Mining , pages 1495–1502. A CM, 2015. [28] Jun Zhu, Amr Ahmed, and Eric Xing. Medlda: maximum margin supervised topic models for regression and classiﬁca- tion. In International Conference on Mac hine Learning , 2009. 13 2016/10/13

SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment