A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models

Log-linear models are arguably the most successful class of graphical models for large-scale applications because of their simplicity and tractability. Learning and inference with these models require calculating the partition function, which is a ma…

Authors: Ryan Spring, Anshumali Shrivastava

A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators   for Partition Function Computation in Log-Linear Models
A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators f or Partition Function Computation in Log-Linear Models Ryan Spring 1 Anshumali Shrivasta va 1 Abstract Log-linear models are arguably the most success- ful class of graphical models for large-scale appli- cations because of their simplicity and tractability . Learning and inference with these models require calculating the partition function, which is a major bottleneck and intractable for large state spaces. Importance Sampling (IS) and MCMC-based ap- proaches are lucrativ e. Howe ver , the condition of having a "good" proposal distrib ution is often not satisfied in practice. In this paper , we add a new dimension to effi- cient estimation via sampling. W e propose a new sampling scheme and an unbiased estima- tor that estimates the partition function accurately in sub-linear time. Our samples are generated in near-constant time using locality sensitiv e hashing (LSH), and so are correlated and unnormalized. W e demonstrate the effecti veness of our proposed approach by comparing the accuracy and speed of estimating the partition function against other state-of-the-art estimation techniques including IS and the ef ficient variant of Gumbel-Max sam- pling. W ith our efficient sampling scheme, we accurately train real-world language models using only 1-2% of computations. 1. Introduction Probabilistic graphical models are some of the most fle xible modeling frame works in machine learning, physics, and statistics. A common and con venient way of modeling probabilities is only to model the proportionality function. Such a specification is sufficient because proportionality can be uniquely con verted into actual probability v alue by dividi ng them by normalization constant . The normalization constant is more popularly kno wn as the partition function . 1 Department of Computer Science, Rice Uni versity at Houston, TX, USA. Correspondence to: Anshumali Shri vasta va . Log-linear models ( Koller & Friedman , 2009 ; Lauritzen , 1996 ) are arguably the most successful class of graphical models for large-scale applications. These models include multinomial logistic (Softmax) regression and conditional random fields. Even the famous skip-gram models ( Good- man , 2001 ; Mikolov et al. , 2013 ) are examples of log-linear models. The definition of a log-linear model states that the logarithm of the model is a linear combination of a set of features x ∈ R D . Assume there is a set of states Y . Each state y ∈ Y is represented with a weight vector θ . A fre- quent task is estimating the probability of a state y ∈ Y . The probability distribution for a log-linear model is P ( y | x, θ ) = e θ y · x Z θ where θ y is the weight vector , x is the (current context) feature vector , and Z θ is the partition function. The parti- tion function is the normalization constant the ensures that P ( y | x, θ ) is a v alid probability distribution. Z θ = X y ∈ Y e θ y · x Computing the partition function requires summing ov er all of the states Y . In practice, it is an expensi ve, intractable operation when the size of the state space is enormous. The value of partition function is required during training and inference. Assume there is a training set containing N labeled examples [( x 1 , y 1 ) , . . . , ( x N , y N )] . The model is trained by minimizing the ne gativ e log-likelihood using stochastic gradient descent (SGD). L ( θ ) = − 1 N X x ∈ X θ · x + log( Z θ ) ∇ L ( θ ) = − 1 N X x ∈ X 1[ y i = k ] − P ( y i = k | x i ; θ ) Here, computing P ( y i = k | x i ; θ ) requires the value of partition function Z θ . This process is near -infeasible when the size of | Y | is huge. It is common to have scenarios in NLP ( Chelba et al. , 2013 ) and vision ( Deng et al. , 2009 ) with the size of the state space running into millions. LSH Partition Function Estimate Due to the popularity of log-linear models, reducing the associated computational challenges has been one of the well-studied and emerging topics in large scale machine learning literature. For conv enience, we classify existing line of work concerning efficient log-linear models into three broad categories: 1) Classical Sampling or Monte Carlo Based, 2) Estimation via Gumbel-Max Trick ( Mussmann & Ermon , 2016 ), and 3) Heuristic-Based. 1. Classical Sampling or Monte Carlo: Since the partition function is a summation, it can be v ery well approximated in a prov ably unbiased fashion using Monte Carlo or Impor- tance sampling (IS). IS and its v ariants, such as annealed importance sampling (AIS) ( Neal , 2001 ), are probably the most widely used Monte Carlo methods for estimating the partition function in general graphical model. IS works by drawing samples y from a tractable proposal (or reference) distribution y ∼ g ( y ) , and estimates the target partition function Z θ by av eraging the importance weights f ( y ) /g ( y ) across the samples [see Section 2.1 ], where f ( y ) = e θ y · x is the unnormalized target density . It is widely kno wn the IS estimate often has very high v ari- ance, if the choice of proposal distribution is v ery dif ferent from the target, especially when they are peaked dif ferently . In fact, there is no known effecti ve class of proposal dis- tribution in literature for log-linear models. This line of work is considered a dead end because sampling from a good proposal is almost as hard as sampling from the tar- get. Our solution changes this belief and sho ws a prov able, efficient proposal distrib ution (unnormalized) and a corre- sponding unbiased estimator for partition function whose computational complexity is amortized sub-linear time. 2. Gumbel-Max T rick: Previous work ( Gumbel & Lieblein , 1954 ) has sho wn an elegant connection between the partition function of log-linear models and the maximum of a sequence of numbers perturbed by Gumbel distribu- tion (see Section 3.1 ). The bottom line is the value of the log partition function is estimated by finding the maximum value of the state space Y perturbed by Gumbel noise. This observation does not directly lead to any computational gains in partition function estimation because computing the maximum still requires enumerating over the entire state space. V ery recently , ( Mussmann & Ermon , 2016 ) sho wed that computing the same maximum can be reformulated as a maximum inner product search (MIPS) problem, which can be approximately solved ef ficiently using recent algorithmic advances ( Shriv astav a & Li , 2014 ; 2015a ; b ). The ov erall method requires a single costly pre-processing phase. The initial cost of the pre-processing phase is amortized over sev eral fast inference queries. Since the cost of approximate MIPS is much smaller , estimating the partition function is more efficient than the brute force Gumbel-Max T rick. Unfortunately , as we show in this paper (Section 3.3 ), e ven small perturbations in the identity of the maximum leads to significant performance de viation and poor accuracy . An important thing to note is that since this method needs a MIPS (approximate nearest-neighbor) query for generating a single sample, it is quite inef ficient. Although MIPS and other near-neighbor queries are sub-linear time operations, they still ha ve a significant cost. In theory , the time com- plexity is N ρ where ρ < 1 . Moreov er , the accuracy is v ery sensitiv e to the approximation of the maximum v alue [see Section 3.3 for details]. W e empirically demonstrate that the cost of estimating partition function using a MIPS query per sample is not only inaccurate but also prohibiti vely slow . 3. Heuristic-Based: There are other approaches that av oid estimating the partition function completely . Instead, the y approximate the original log-linear model with an altogether different model, which is cheaper to train. The most pop- ular is the Hierarchical Softmax ( Morin & Bengio , 2005 ). Changing the model’ s assumption based on some heuristic hierarchy may not be desirable in many application, and its ef fect on the accuracy is not very well understood. It is further kno wn that such models are sensitiv e to the structure of the Hierarchical Softmax approach ( Mikolov et al. , 2013 ) Focus of this paper: The focus of this article is on an effi- cient partition function estimation in log-linear models with- out any additional assumptions. W e will focus on simple, efficient, and unbiased estimators with superior properties. W e want to retain the original modeling assumption, and therefore, we will not focus on techniques like hierarchical softmax, which has an explicit hierarchical assumption. All these assumptions are unreliable. In light of our goals, we will ignore (3) Heuristic based techniques and only focus on (1) Classical Sampling or Monte Carlo Based and (2) Estimation via Gumbel-Max T rick as our baselines. Our Contributions: Our proposal also e xploits the MIPS (Maximum Inner Product Search) data structure. The key difference is that the existing approaches ( Mussmann & Ermon , 2016 ) require a MIPS query (relati vely costly) to generate each informativ e sample. On the other hand, our approach can generate a large set of samples from an ele gant, informati ve proposal distribution using a single MIPS query! W e e xplain this difference in Section 6 . Our work is in fact completely different from all existing works. Instead of relying on the Gumbel-Max Trick, we return to the basics of sampling and estimation. W e re veal a very unusual, b ut super-ef ficient class of samplers, which produces a set of correlated samples that are not normalized. W e further sho w an unbiased estimator of partition function using these unusual samples. Our proposal opens a new dimension for sampling and unbiased estimation beyond classical IS, which is worth of study in its own right. The estimators are generic for any partition function estimation. LSH Partition Function Estimate For log-linear models, we sho w that our sampling scheme P M I P S ( y ) has man y similar properties of the target dis- tribution, such as same modes and same peaks, making it very informati ve for estimation. Most interestingly , it is pos- sible to generate T samples from the magical distribution P M I P S ( y ) in sub-linear time. T o the best of our kno wledge, this is also the first work that constructs a provably ef fi cient and informativ e sampling distribution for log-linear models. W e sho w that our LSH sampler provides the perfect balance between speed and accuracy when compared to the other approaches - Uniform IS, Exact Gumbel, and MIPS Gumbel. Our LSH method is more accurate at estimating the partition function than the Uniform IS method while being equally fast. In addition, our method is sev eral orders of magnitude faster than the Exact Gumbel and MIPS Gumbel techniques while maintaining competiti ve accurac y . Furthermore, our method successfully trains real-world language models ac- curately while only requiring 1-2% of the states to estimate the partition function. 2. Background 2.1. Importance Sampling Assume we hav e a proposal distribution g ( y ) where R g ( y ) dy = 1 . Using the proposal distribution, we obtain an unbiased estimator of the partition function Z θ . E h f ( y ) g ( y ) i = P y g ( y ) f ( y ) g ( y ) = P y f ( y ) = Z θ W e draw N samples from the proposal distribution y i ∼ g ( y ) for i = 1 . . . N . Using those samples, we have a Monte-Carlo approximation of the partition function Z θ . Z − 1 = E  e − H  2.2. Locality Sensitive Hashing (LSH) Locality-Sensitiv e Hashing (LSH) ( Gionis et al. , 1999 ; Huang et al. , 2015 ; Gao et al. , 2014 ; Shinde et al. , 2010 ) is a popular , sub-linear time algorithm for approximate nearest- neighbor search. The high-le vel idea is to place similar items into the same buck et of a hash table with high proba- bility . An LSH hash function maps an input data vector to an integer k ey - h ( x ) : R D 7→ [0 , 1 , 2 , . . . , N ] . A collision occurs when the hash values for two data vectors are equal - h ( x ) = h ( y ) . The collision probability of most LSH hash functions is generally a m onotonic function of the similarity - P r [ h ( x ) = h ( y )] = M ( sim ( x, y )) , where M is a mono- tonically increasing function. Essentially , similar items are more likely to collide with each other under the same hash fingerprint. The algorithm uses two parameters - ( K, L ) . W e construct L independent hash tables from the collection C . Each 1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 -1 1 𝑋 ∈ ℝ 𝐷 𝑆𝑅 𝑃 ∈ ℝ 𝐷 𝑥𝑘 𝐿 s ign ( 𝑌 ) 𝑌 ∈ ℝ 𝑘𝐿 1 3 0 K=2 bit s 0 1 1 1 0 0 L=3 T ables 0 1 2 3 1 2 3 4 Figure 1. Locality Sensitive Hashing - Signed Random Pr ojec- tions (1) Compute the projection using a signed, random matrix R D × k L and the item x ∈ R D . (2) Generate a bit from the sign of each entry in the projection R kL (3) From the k L bits, we create L integer fingerprints with k bits per fingerprint. ( 4) Add the item x into each hash table using the corresponding integer ke y hash table has a meta-hash function H that is formed by concatenating K random independent hash functions from F . Giv en a query , we collect one buckets from each hash table and return the union of L buck ets. Intuiti vely , the meta-hash function makes the buck ets sparse (less cro wded) and reduces the amount of f alse positi ves because only v alid nearest-neighbor items are likely to match all K hash v alues for a giv en query . The union of the L buck ets decreases the number of false negativ es by increasing the number of potential buck ets that could hold v alid nearest-neighbor items. The candidate generation algorithm works in two phases (See ( Andoni & Indyk , 2004 ) for details): 1. Pre-processing Phase: W e construct L hash tables from the data by storing all elements x ∈ C . W e only store pointers to the v ector in the hash tables because storing whole data vectors is v ery memory inefficient. 2. Query Phase: Giv en a query Q , we will search for its nearest-neighbors. W e report the union from all of the buckets collected from the L hash tables. Note, we do not scan all the elements in C , we only probe L different b uckets, one bucket for each hash table. After generating the set of potential candidates, the nearest- neighbor is computed by comparing the distance between each item in the candidate set and the query . LSH Partition Function Estimate 2.3. SimHash SimHash or Signed Random Projection (SRP) ( Charikar , 2002 ) is the LSH family for the Cosine Similarity metric. The Cosine Similarity metric is the angle between two vec- tors x, y ∈ R D . Since the definition of an inner product is X · Y = Σ N i =0 x i · y i = k X k k Y k cos ( θ ) , the simple formula for the angle between two vectors is θ = cos − 1 ( x · y k x kk y k ) . For a vector x , the SRP function generates a random hy- perplane w and returns the sign of the projection of x onto w . T w o vectors share the same sign only if the random pro- jection does not fall in-between them. Since all angles are equally likely for a random projection, the probability that two v ectors x, y share the same sign for a gi ven random pro- jection is 1 − θ π . Using the signed random projection hash function, we can create an ( θ 1 , θ 2 , 1 − θ 1 π , 1 − θ 2 π ) -sensiti ve LSH family . 3. MIPS Reduction using the Gumbel Distribution 3.1. Gumbel Distribution The Gumbel distribution G ( Gumbel , 1941 ) is a continu- ous probability distribution with the cumulativ e distribution function (CDF) - P ( G ≤ x ) = e − e − x − µ β A key technique that uses the Gumbel distribution is the Gumbel-Max T rick. ( Gumbel & Lieblein , 1954 ) H = max y ∈ Y [ φ y + G ( y )] ∼ log ( P y ∈ Y e φ y ) + G where φ y = θ y x is the log probability of the log-linear model, G ( y ) is an independent Gumbel random variable for each state y , and G is an independent Gumbel random vari able. Using the Gumbel-Max T rick, we can estimate the in verse partition function Z θ . Z − 1 = E  e − H  ˆ Z − 1 = 1 N P N i  e − H i  ( Mussmann & Ermon , 2016 ) proposed using tw o algorithms, Maximum Inner Product Search (MIPS) and Gumbel-Max T rick, to estimate the partition function Z θ ef ficiently . Their idea was to con vert the Gumbel-Max T rick into a Maximum Inner Product Search (MIPS) problem. W e will provide a brief ov erview of their approach. 3.2. Algorithm For their MIPS reduction, the first step is to build the MIPS data structure. A v ector of k independent Gumbel random variables is concatenated to each weight v ector θ y to form the Gumbel-weight vector v y = ( θ y , { G y ,j } k j =1 ) . These Gumbel-weight v ectors v y are added to the MIPS data struc- 0 200 400 600 800 1000 #Samples 0 1 2 3 4 5 6 MAE TopK Max Gumbel 2nd Largest Gumbel Figure 2. (1) Exact Max Gumbel - Estimate the partition function using the maximum value ov er all states (2) 2nd Largest Gumbel - The maximum (top-1) value is replaced with the 2nd lar gest (top-2) value in the partition function estimate. Notice, there is a large gap in accuracy when the 2nd lar gest value is used to estimate the partition function. Therefore, using the maximum value for the estimate is a requirement for an accurate estimate. ture. During the partition estimate phase, a subset of new weight vectors is queried from the MIPS data structure. A one-hot vector e j where j = 1 . . . k is concatenated to the feature vector x to form the query q j = ( x, e j ) . The one-hot vector ensures that only the j th Gumbel variable is selected at one time. The MIPS data structure returns a subset of Gumbel-weight vectors S j that is likely to produce the max- imum inner product with the query q j . The exact maximum inner product is computed for this small subset of Gumbel- weight vectors S j and the query q j . The o verall process is still sub-linear time, reducing the search space for the exact Gumbel-Max trick and improving the performance of the partition function estimation process. The efficienc y comes at the cost of approximate answers to MIPS queries. 3.3. MIPS-Gumbel Reduction is Inaccurate and Inefficient The MIPS-Gumbel Reduction has two main weaknesses in terms of speed and accuracy . The MIPS data structure obtains an approximate nearest-neighbor set for a query . Due to the randomness in the MIPS data structure, there is a chance that the set may not contain the exact maximum value. In Figure 2 , we empirically explore the accuracy of the partition function estimate when the 2 nd largest v alue is substituted for the exact maximum v alue. It shows that miss- ing the exact maximum v alue drastically increases the error of the partition function estimate. Since the estimates are extremely sensiti ve to perturbations in the maximum v alue, it is necessary to hav e a high confidence MIPS algorithm. It is well know that high confidence search is inefficient, requiring large number of hash functions and tables. LSH Partition Function Estimate In addition, there is another issue that af fects the perfor- mance of the MIPS-Gumbel Reduction. Each sample re- quires querying the MIPS data structure for a subset S j and then taking the maximum inner product between the subset and the query . This cost of approximate MIPS query , although sub-linear, is still a significant fraction of the N . Furthermore, we need lar ge enough T for accurate estima- tion, which amounts to T MIPS queries, which is likely to be inefficient. Our ev aluations clearly v alidates the inef ficiency of this approach on large datasets. 4. K ey Observation: LSH is an Efficient Inf ormative Sampler in Disguise The traditional LSH algorithm retrie ves a subset of potential candidates for a gi ven query in sub-linear time. W e compute the actual distances of these neighbors for this candidate subset and then report the closest nearest-neighbor . A close observation re veals that an item returned as candidate from a ( K, L ) parametrized LSH algorithm is sampled with prob- ability 1 − (1 − p K ) L where p is the collision probability of LSH function. For the classic LSH algorithm, the probability of retrie ving any item y for a gi ven query context x can be computed exactly as follo ws ( Lesko vec et al. , 2014 ): 1. The probability that the hash fingerprints match for a random LSH function - P r [ h ( x ) = h ( y )] = p 2. The probability that the hash fingerprints match for a meta-LSH function - P r [ H ( x ) = H ( y )] = p k 3. The probability that there is at least one mismatch between the K hash fingerprints that compose the meta- LSH function - P r [ H ( x ) 6 = H ( q )] = 1 − p k 4. The probability that none of the L meta-hash finger- prints match - P r [ H ( x ) 6 = H ( y )] = (1 − p k ) L 5. The probability that at least one of the L meta-hash fingerprints match and the two items are a candidate pair - P r [ H ( x ) = H ( y )] = 1 − (1 − p k ) L The precise form of p is defined by the LSH family used to build the hash tables. W e can construct a MIPS hashing scheme such that p = M ( q · x ) = M ( θ y · x ) where M is a monotonically increasing function. Howe ver , the traditional LSH algorithm does not represent a valid probability distribution P N i =1 P r ( y i ) 6 = 1 . Also, due to the nature of LSH, the sampled candidates are likely to be v ery correlated. Thus, standard techniques like IS are not applicable to this kind of samples. It turns out that there is a simple, unbiased estimator for the partition function using the samples from the LSH algorithm. W e take a detour to define a general class of sampling and partition function estimators where the LSH sampling is a special case. 5. A New Class of Estimators f or Partition Function Assume there is a set of states Y = [ y 1 . . . y N ] . W e asso- ciate a probability value with each state [ p 1 . . . p N ] . Define the sampling process as follows: W e select each of these states y i to be a part of the sample set S with probability p i . Note, the probabilities need not sum to 1, and the sampling process is allowed to be correlated. Thus, we get a correlated sample set S . It can be seen that MIPS sampling is a special class of this sampling process with p i = 1 − (1 − p k ) L . Giv en the sample set S , we hav e an unbiased estimator for any partition function P y i ∈ Y f ( y i ) . Theorem 5.1. Assume that e very state y i has a weight given by f ( y i ) with partition function P y i ∈ Y f ( y ) = Z θ . Then we have the following as an unbiased estimator of Z θ : E st = X y i ∈ S f ( y i ) p i = N X i =1 1 y i ∈ S · f ( y i ) p i (1) E [ E st ] = N X i =1 f ( y i ) = Z θ (2) Theorem 5.2. The variance of the partition function esti- mator is: V ar [ E st ] = N X i =1 f ( y i ) 2 p i − N X i =1 f ( y i ) 2 (3) + X i 6 = j f ( y i ) f ( y j ) p i p j Co v( 1 [ y i ∈ S ] · 1 [ y j ∈ S ] ) (4) If the states ar e selected independently , then we can write the variance as: V ar [ E st ] = N X i =1 f ( y i ) 2 p i − N X i =1 f ( y i ) 2 Note 1: In general, this sampling process is inefficient. W e need to flip coins for e very state in order to generate the sample set S . F or log-linear models with feature vector x and function f ( y i ) = e θ y i · x , we sho w a particular form of probability p i = 1 − (1 − M ( θ y i · x ) k ) L ) for which this sampling scheme is v ery ef ficient. In particular, we can efficiently sample for a sequence of queries with varying x in amortized near-constant time [See Section 2.2 ]. Note 2: In our case, where these probabilities p i are generated from LSH (or ALSH for MIPS), the term P i 6 = j f ( y i ) f ( y j ) p i p j Co v( 1 [ y i ∈ S ] · 1 [ y j ∈ S ] ) contains very lar ge negati ve terms. For each dissimilar pair y i , y j , the term Co v( 1 [ y i ∈ S ] · 1 [ y j ∈ S ] ) is negati ve. When 1 [ y i ∈ S ] = 1 and LSH Partition Function Estimate 1 [ y j ∈ S ] = 1 , it implies that y i and y j are both similar to the query . Therefore, they are similar to each other due to triangle inequality ( Charikar , 2002 ). Thus, for random pairs y i , y j , the cov ariance will be negativ e. i.e. If y i is sampled, then y j has less chance of being sampled and vice versa. Hence, we can expect the o verall v ariance with LSH-based sampling to be significantly lower than uncorrelated sam- pling. This is something unique about LSH, and so it is super-ef ficient and its correlations are beneficial. 5.1. Why is MIPS the corr ect LSH function for Log-Linear Models? The terms P N i =1 f ( y i ) 2 p i in the v ariance is similar in nature to the χ 2 ( f || p ) term in the v ariance of Importance Sampling (IS) ( Liu et al. , 2015 ). The variance of the IS estimate is high when the tar get f and the proposal p distributions are peaked differently . i.e. they gi ve high mass to dif ferent parts of the sample space or hav e different modes Therefore, for similar reasons as importance sampling, our scheme is likely to hav e low v ariance when f and p i are aligned. It should be noted that there are very specific forms of probability p i for which the sampling is ef ficient. W e sho w that with the MIPS LSH function, the probabilities p i and the function f ( y i ) = e θ y i · x align well. W e ha ve the follo wing relationship between the probability of each state p i and the Log-Linear unnormalized target distribution P ( y | x, θ ) . Theorem 5.3. F or any two states y 1 and y 2 : P ( y 1 | x ; θ ) ≥ P ( y 2 | x ; θ ) ⇐ ⇒ p 1 ≥ p 2 wher e p i = 1 − (1 − M ( θ y i · x ) K ) L P ( y | x, θ ) ∝ e θ y · x Corollary 5.3.1. The modes of both the sample and the tar get distrib utions ar e identical. Therefore, we can expect hashing for MIPS to be a good choice for low v ariance. 6. Estimate Partition Function using LSH Sampling The combination of these observations is a fast, scalable approach for estimating the partition function of Log-linear models. The pseudo-code for this process is shown in Algo- rithms 1 and 2 . Here is an ov erview of our LSH sampling process: 1. During the pre-processing phase, we use randomized hash functions to build hash tables from the weight vectors θ y for each state y ∈ Y . Algorithm 1 LSH Sampling - Initialization Input: [ θ y ] y ∈ Y weight vectors, k, L HT = Create(k, L) for each y ∈ Y do Insert(HT , θ y ) end for Return: HT 2. For each partition function estimate, we sample weight vectors from the hash tables with probability propor- tional to the unnormalized density of the weight vector for the state and the feature vector e θ y · x . 3. For each weight vector θ y in the sample set S , we determine p , the probability of a hash collision with the feature vector x . This probability is dependent on the LSH family used to build the hash tables. W e chose an LSH family such that the probability p is monotonic with respect to the inner product θ y · x 4. The partition function estimate for the feature vector x is the sum of each weight vector θ y in the sample set S weighted by the hash collision probability p . ˆ Z θ = P N i =1 1 [ y i ∈ S ] · f ( y i ) p i Running Time: There is a ke y distinction in performance between the MIPS-Gumbel Reduction and our LSH sampler . The MIPS-Gumbel Reduction needs to query the MIPS data structure for each individual sample. Our LSH sampler uses a single query to the LSH data structure to retrieve the entire sample set S for the partition function estimate. Our sampling is roughly constant time. In terms of comple xity , the MIPS-Gumbel Reduction re- quires T full nearest-neighbor queries, which includes the costly filtering of retrieved candidates. While the running time for our LSH Sampling estimate is the cost of a single LSH query for all the samples. In section 7.3 , we sup- port our complexity analysis with empirical results, which our LSH Sampling estimate is significantly faster than the MIPS-Gumbel Reduction. 7. Experiments W e design experiments to answer the follo wing four impor- tant questions: 1. Ho w accurately does our LSH Sampling approach esti- mate the partition function? 2. What is the running time of our Sampling approach? 3. How does our Sampling approach compare with the alternati ve approaches in terms of speed and accuracy? 4. Ho w does using our LSH Sampling approach affect the accuracy (perple xity) of real-world language models? LSH Partition Function Estimate Algorithm 2 LSH Sampling - Partition Estimate Input: LSH data structure HT , k, L [ θ y ] y ∈ Y weight vectors x feature vector p ( x, y ) — LSH Collision Probability union = query(HT , x) total = 0 for each y ∈ union do w eig ht = 1 − (1 − p ( x, y ) k ) L log it = e θ y · x total += log it weig ht end for Return: ˆ Z θ = total For ev aluation, we implemented the following three ap- proaches to compare and contrast against our approach. • Uniform Importance Sampling: An IS estimate where the proposal distribution is a uniform distrib ution U[0, N]. All samples are weighted equally . • Exact Gumbel: The Max-Gumbel Trick is used to estimate the partition function. The maximum ov er all of the states is used for this estimate. • MIPS Gumbel ( Mussmann & Ermon , 2016 ): A MIPS data structure is used to collect a subset of the states ef ficiently . This subset contains the states that are most likely to hav e a lar ge inner product with the query . The Max-Gumbel Trick estimates the partition function using the subset instead of all the states. 7.1. Datasets • Penn T ree Bank (PTB) ( Marcus et al. , 1993 ) 1 - This dataset contains a vocab ulary of 10K words. It is split into 929k training words, 73k v alidation words, and 82k test words. • T ext8 ( Mikolov et al. , 2014 ) 2 - This dataset is a pre- processed version of the first 100 million characters from W ikipedia. It is split into a training set (first 99M characters) and a test set (last 1M characters) It has a vocab ulary of 44k words. 7.2. T raining Language Models The goal of a neural netw ork language model is to predict the ne xt word in the text given the pre vious history of words. The performance of language models is measured by its perplexity . The perplexity score e loss measures how well 1 http://www .fit.vutbr.cz/~imikolo v/rnnlm/simple- examples.tgz 2 http://mattmahoney .net/dc/text8.zip the language model is likely to predict a word from a dataset. The loss function is the av erage negati ve log likelihood for the target w ords. loss = − 1 N N X i =1 log p y i For our experiments, our language model is a single layer LSTM with 512 hidden units. The size of the input word embeddings is equal to the number of hidden units. The model is unrolled for 20 steps for back-propagation through time (BPTT). W e use the Adagrad optimizer with an initial learning rate of 0.1 and an epsilon parameter of 1e-5. W e also clip the norm of the gradients to 1. The models are trained for 10 epochs with a mini-batch size of 32 examples. The output layer is a softmax classifier that predicts the next word in the text using the context v ector x generated by the LSTM. The entire vocab ulary for the text is the state space Y for the softmax classifier . In this experiment, we test how well the v arious approaches estimate the partition function by training a language model. At test time, we measure the effecti veness of each approach by using the original partition function and comparing the model’ s perplexity scores. The settings for our LSH data structure were k=10 bits and L=16 tables. Using these set- tings, our approach samples around 1.5-2% of the entire vocab ulary for its partition function estimate. (i.e. 200 sam- ples - PTB, 800 samples - T e xt8) The Uniform IS estimate uses the same number of samples as our LSH estimate. For each Exact Gumbel estimate, we randomly sample 50 out of 1000 Gumbel random v ariables. For the MIPS Gumbel approach, we use 50 samples per estimate and an LSH data structure with k=5 bits and L=16 tables that collects around 35% of the entire vocab ulary per sample. From T able 3 , our approach closely matches the accuracy of the standard partition function with minimal error . In addition, the poor estimate from the Uniform IS approach results in terrible performance for the language model. This highlights the fact that an accurate, stable estimate of the partition function is necessary for successfully training of the log-linear model. The Exact Gumbel approach is the most accurate approach but is significantly slo wer than our LSH approach. The MIPS Gumbel approach di ver ged dur- ing training because of its poor accuracy in estimating the partition function. 7.3. Accuracy and Speed of Estimation For this experiment, we take a snapshot of the weights θ y and the context v ector x , after training the language model for a single epoch. The number of examples in the snapshot is the mini-batch size × BPTT steps. i.e. (32 examples x 20 steps = 640 total) Using the snapshot, we show how well LSH Partition Function Estimate Standard LSH Uniform Exact Gumbel MIPS Gumbel 91.8 98.8 524.3 91.9 Di verged 140.7 162.7 1347.5 152.9 Figure 3. Language Model Performance (Perplexity) for the PTB (T op) and T e xt8 (Bottom) datasets. each approach estimates the partition function in Figure 6 . The x-axis is the number of samples used for the partition function estimate. The partition function is estimated with [50, 150, 400, 1000] samples for the PTB dataset and [50, 400, 1500, 5000] samples for the T e xt8 dataset. The accu- racy of the partition function estimate is measured with the Mean Absolute Error (MAE). T ables 4 and 5 show the total computation time for estimating the partition function for all of the examples. From Figure 6 and T ables 4 , 5 , we conclude the follo wing: • Exact Gumbel is the most accurate estimate of the par- tition function with the lowest MAE for both datasets. • MIPS Gumbel is 50% faster than the Exact Gumbel but its accurac y is significantly worse. • Exact Gumbel and LSH Gumbel are much slower than the Uniform IS and LSH approaches by several orders of magnitude. • Our LSH estimate is more accurate than the Uniform IS and LSH Gumbel estimates. • As the number of samples increases, the MAE for the Uniform IS and LSH estimate decreases. Samples Uniform LSH Exact Gumbel MIPS Gumbel 50 0.103 0.191 79.34 45.72 150 0.325 0.604 248.47 140.91 400 0.944 1.743 690.39 406.11 1000 1.874 3.440 1,646.59 1,064.31 Figure 4. W all-Clock Time (seconds) for the Partition Function Estimate - PTB Dataset Samples Uniform LSH Exact Gumbel MIPS Gumbel 50 0.13 0.23 531.37 260.75 400 0.92 1.66 3,962.25 1,946.22 1500 3.41 6.14 1,4686.73 7,253.44 5000 9.69 17.40 42,034.58 20,668.61 Figure 5. W all-Clock Time (seconds) for the Partition Function Estimate - T ext8 Dataset 8. Discussion In this section, we briefly discuss the implementation details. LSH Family : For the experiments, we used the SimHash LSH f amily to estimate the partition function. [See ( Shriv as- tav a & Li , 2015a ) for the MIPS formulation for the SimHash LSH Family] It is computationally ef ficient to generate the LSH fingerprints and the collision probability values p for the LSH function. Generating the LSH fingerprints takes advantage of fast matrix multiplication operations while calculating the hash collision probability only requires nor- malizing the inner product between the weights θ y and the context x . [See Figure 1 and Section 2.3 ] Fixed sample set S size : It is often desirable to have a fixed- sized sample set S . Howe ver , the size of the sample set retriev ed from the LSH data structure is stochastic and not directly controlled. Here is our approach for controlling the size of the sample set S . First, we tune the (K, L) parameters for the LSH data structure to retriev e a sample set S whose size is close to the desired threshold. Then, we randomly sub-sample the sample set S such that its size meets the desired threshold. The old probability is multiplied by the sampling probability to get the new v alues. 0 200 400 600 800 1000 #Samples 0 1 2 3 4 5 6 7 8 MAE PTB Uniform LSH Exact Gumbel MIPS Gumbel 0 1000 2000 3000 4000 5000 6000 #Samples 0 2 4 6 8 10 12 MAE Text8 Figure 6. Accuracy of the Partition Function estimate Mean Absolute Error (MAE) - PTB (T op) and T e xt8 (Bottom) LSH Partition Function Estimate 9. Acknowledgments The work of Ryan Spring w as supported from NSF A w ard 1547433. This work was supported by Rice F aculty Initia- tiv e A ward 2016. References Andoni, Ale xandr and Indyk, Piotr . E2lsh: Exact euclidean locality sensitiv e hashing. T echnical report, MIT , 2004. Charikar , Moses S. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual A CM symposium on Theory of computing , pp. 380–388. A CM, 2002. Chelba, Ciprian, Mikolov , T omas, Schuster , Mike, Ge, Qi, Brants, Thorsten, Koehn, Phillipp, and Robinson, T on y . One billion word benchmark for measuring progress in statistical language modeling. arXiv pr eprint arXiv:1312.3005 , 2013. Deng, Jia, Dong, W ei, Socher , Richard, Li, Li-Jia, Li, Kai, and Fei- Fei, Li. Imagenet: A large-scale hierarchical image database. In Computer V ision and P attern Reco gnition, 2009. CVPR 2009. IEEE Confer ence on , pp. 248–255. IEEE, 2009. Gao, Jin yang, Jagadish, Hosagrahar V isvesv araya, Lu, W ei, and Ooi, Beng Chin. Dsh: data sensiti ve hashing for high- dimensional k-nnsearch. In Proceedings of the 2014 ACM SIGMOD , pp. 1127–1138. A CM, 2014. Gionis, Aristides, Indyk, Piotr , Motwani, Rajee v , et al. Similarity search in high dimensions via hashing. VLDB , 99(6):518–529, 1999. Goodman, Joshua T . A bit of progress in language modeling. Computer Speech & Languag e , 15(4):403–434, 2001. Gumbel, E. J. The return period of flood flows. Ann. Math. Statist. , 12(2):163–190, 06 1941. doi: 10.1214/aoms/1177731747. URL http://dx.doi.org/10.1214/aoms/1177731747 . Gumbel, Emil Julius and Lieblein, Julius. Statistical theory of extr eme values and some practical applications: a series of lectur es . US Government Printing Of fice W ashington, 1954. Huang, Qiang, Feng, Jianlin, Zhang, Y ikai, Fang, Qiong, and Ng, W ilfred. Query-aw are locality-sensiti ve hashing for ap- proximate nearest neighbor search. Pr oceedings of the VLDB Endowment , 9(1):1–12, 2015. K oller , Daphne and Friedman, Nir . Probabilistic graphical models: principles and techniques . MIT press, 2009. Lauritzen, Steffen L. Graphical models , volume 17. Clarendon Press, 1996. Leskov ec, Jure, Rajaraman, Anand, and Ullman, Jef frey David. Mining of massive datasets . Cambridge University Press, 2014. Liu, Qiang, Peng, Jian, Ihler , Alexander , and Fisher III, John. Estimating the partition function by discriminance sampling. In Pr oceedings of the Thirty-F irst Confer ence on Uncertainty in Artificial Intelligence , pp. 514–522. A U AI Press, 2015. Marcus, Mitchell P , Marcinkie wicz, Mary Ann, and Santorini, Beatrice. Building a large annotated corpus of english: The penn treebank. Computational linguistics , 19(2):313–330, 1993. Mikolov , T omas, Sutsk ev er , Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality . In Advances in neural information pr ocessing systems , pp. 3111–3119, 2013. Mikolov , T omas, Joulin, Armand, Chopra, Sumit, Mathieu, Michael, and Ranzato, Marc’Aurelio. Learning longer memory in recurrent neural networks. arXiv pr eprint arXiv:1412.7753 , 2014. Morin, Frederic and Bengio, Y oshua. Hierarchical probabilistic neural network language model. In Aistats , volume 5, pp. 246– 252. Citeseer , 2005. Mussmann, Stephen and Ermon, Stefano. Learning and inference via maximum inner product search. In Pr oceedings of The 33rd International Confer ence on Mac hine Learning , pp. 2587–2596, 2016. Neal, Radford M. Annealed importance sampling. Statistics and Computing , 11(2):125–139, 2001. Shinde, Rajendra, Goel, Ashish, Gupta, Pankaj, and Dutta, De- bojyoti. Similarity search and locality sensitiv e hashing using ternary content addressable memories. In Pr oceedings of the 2010 ACM SIGMOD International Confer ence on Management of data , pp. 375–386. A CM, 2010. Shriv astava, Anshumali and Li, Ping. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Ad- vances in Neural Information Pr ocessing Systems , pp. 2321– 2329, 2014. Shriv astava, Anshumali and Li, Ping. Improv ed asymmetric local- ity sensiti ve hashing (alsh) for maximum inner product search (mips). In Confer ence on Uncertainty in Artificial Intelligence (U AI) , 2015a. Shriv astava, Anshumali and Li, Ping. Asymmetric minwise hash- ing for indexing binary inner products and set containment. In Pr oceedings of the 24th International Confer ence on W orld W ide W eb , pp. 981–991. International W orld Wide W eb Confer- ences Steering Committee, 2015b. LSH Partition Function Estimate A. A ppendix Theorem A.1. Assume ther e is a set of states Y . Each state y occurs with pr obability [ p 1 . . . p N ] . F or some feature vector x , function f ( y i ) = e θ y i · x Then, there is a random variable whose expected value is the partition function. E st = N X i =1 1 [ y i ∈ S ] · f ( y i ) p i E [ E st ] = N X i =1 f ( y i ) = Z θ Theorem A.2. The variance of the partition function esti- mator is: V ar [ E st ] = N X i =1 f ( y i ) 2 p i − N X i =1 f ( y i ) 2 (5) + X i 6 = j f ( y i ) f ( y j ) p i p j Co v( 1 [ y i ∈ S ] · 1 [ y j ∈ S ] ) (6) If the states ar e selected independently , then we can write the variance as: V ar [ E st ] = N X i =1 f ( y i ) 2 p i − N X i =1 f ( y i ) 2 Pr oof. The expression for the variance of the partition func- tion estimator is: V ar [ E st ] = E [ E st 2 ] − E [ E st ] 2 E st 2 = X ij 1 [ y i ∈ S ] 1 [ y j ∈ S ] f ( y i ) f ( y j ) p i p j (7) = X i 1 [ y i ∈ S ] f ( y i ) 2 p 2 i + X i 6 = j 1 [ y i ∈ S ] 1 [ y j ∈ S ] f ( y i ) f ( y j ) p i p j (8) Notice: E [1 [ y i ∈ S ] 1 [ y j ∈ S ] ] = E [1 [ y i ∈ S ] ] E [1 [ y j ∈ S ] ]+Co v[1 [ y i ∈ S ] 1 [ y j ∈ S ] ] E [1 [ y i ∈ S ] ] = p i E [ E st 2 ] = N X i =1 f ( y i ) 2 p i + N X i =1 f ( y i )[ Z θ − f ( y i )] (9) + X i 6 = j f ( y i ) f ( y j ) p i p j Co v( 1 [ y i ∈ S ] · 1 [ y j ∈ S ] ) (10) = N X i =1 f ( y i ) 2 p i + Z 2 θ − N X i =1 f ( y i ) 2 (11) + X i 6 = j f ( y i ) f ( y j ) p i p j Co v( 1 [ y i ∈ S ] · 1 [ y j ∈ S ] ) (12) Notice: N X i =1 f ( y i ) = Z θ N X i 6 = j f ( y j ) = N X j =1 f ( y j ) − f ( y i ) = Z θ − f ( y i ) Notice: E [ E st ] 2 = Z 2 θ . Therefore, V ar [ E st ] = N X i =1 f ( y i ) 2 p i − N X i =1 f ( y i ) 2 (13) + X i 6 = j f ( y i ) f ( y j ) p i p j Co v( 1 [ y i ∈ S ] · 1 [ y j ∈ S ] ) (14) Theorem A.3. F or any two states y 1 and y 2 : P ( y 1 | x ; θ ) ≥ P ( y 2 | x ; θ ) ⇐ ⇒ p 1 ≥ p 2 wher e p i = 1 − (1 − M ( θ y i · x ) K ) L P ( y | x, θ ) ∝ e θ y · x Pr oof. Follo ws immediately from monotonicity of e x and 1 − (1 − M ( x ) K ) L with respect to the feature vector x . Thus, the target and the sample distrib utions hav e the same ranking for all the states under the probability .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment