A Sparse Johnson--Lindenstrauss Transform

Dimension reduction is a fundamental primitive with many algorithmic applications including nearest-neighbor search [2,19], compressed sensing [11], data stream computations [5], computational geometry [13], numerical linear algebra [14,17,26,28], machine learning [8,33], graph sparsification [30], and more; see the monograph [32] for further applications. The seminal random projection method of Johnson and Lindenstrauss [20] consists of just multiplying the input vector by a suitably sampled random projection matrix -n vectors in d-dimensional space can be mapped into an O( 1 ǫ 2 log n)-dimensional subspace such that the length of each vector is distorted by at most (1 ± ǫ). This simple and elegant method has the following desirable properties: (i) it is linear, (ii) it is oblivious to the input, (iii) it works with high probability for a given set of input points, and (iv) the target dimension is independent of d. Given its algorithmic importance, much effort has been devoted to speeding up the mapping. One line of work achieves this goal by making the projection matrix sparse, and hence its multiplication with the input vectors faster. Sparsity is typically achieved by independently setting each matrix entry to zero with a certain probability [1,2,23]. There is however a limit on the extent of sparsity achievable by this approach: a result of Matousek [23,Theorem 4.1] states that such matrices need to contain Ω( α 2 ǫ 2 ) non-zeroes in expectation per column, if they were to preserve the length of a unit vector with infinity norm at most α. Our results. We obtain a sparse random projection matrix of size k × d that has O( 1 ǫ log 2 ( k δ ) log( 1 δ )) non-zero entries per column, where k = O( 1 ǫ 2 log( 1 δ )). This is the first construction with o( 1 ǫ 2 ) non-zero entries in the projection matrix. (For our results to be improvements, we need to assume that log 2 ( k δ ) = o( 1 ǫ ). Our analysis, however, does not need this assumption.) A highlight of our approach is to construct the projection matrix itself with care. Instead of using independent random variables, as is typically done, we construct it out of a hash function that entails some dependency among the entries. This construction is implicit in the work of Langford et al. [21] and Weinberger et al. [33], where it played a role mostly as a practical heuristic. The hash-based construction introduces new technical difficulties, but ensures that we have exactly a fixed number of non-zero entries in each column, thereby relaxing the requirements on the density of input vectors. Specifically, whereas prior work requires that for a unit vector x, x ∞ = O (ǫ), for a constant number of expected entries per column of the projection matrix, we only need x ∞ = O( √ ǫ). In order to achieve this level of densification, we can use a simple replication technique on x [33]. To manage the technical difficulties that arise from the dependencies, we show that the contribution from each hash bucket is bounded, and that the total amount of noise arising from the collisions in each hash bucket is small. The reduction in overall variance comes from the fact that each dimension is mapped to exactly one hash bucket, and the lack of self-collisions (which would be present if the entries in the matrix were i.i.d.) leads to a reduction in the variance of the cross-product error. There are several subtleties in analyzing this, in particular, the errors from different hash buckets being correlated. We handle this by an application of the FKG inequality on the product of the moment generating function of the random variables capturing the errors. This helps us in obtaining a concentration on the sum of the errors. Our choice of ±1 random variables (instead of Gaussian random variables1 ) plays a critical role in making our proofs work. Implications for sparse vectors. The resulting running time for an input vector x having nnz(x) non-zeros is Õ nnz(x) ǫ -better than the running time obtained by [22,23] for sparse vectors in terms of the sparsity ratio nnz(x) d as well as by the factor 1 ǫ . Furthermore, using a block-Hadamard based preconditioner, instead of a global Hadamard transform, we can actually ensure that our running time for all vectors is Õ(min( nnz ǫ , d)), which is once again an improvement over existing results. The qualitative difference in the running times is starker in the turnstile model of streaming. Since the updates in the stream come as (i, vi), updating any sketch that requires computing a global Hadamard transform is very expensive, taking Õ(d) time per update. Our update time, on the other hand, is only Õ( 1 ǫ ) per entry. Our technique speeds up nearest-neighbor computation for sparse vectors as well. We can use our construction to preprocess the input vectors before applying the algorithm as described in [2,Theorem 3.2]. The effective running time is then Õ( nnz (x) For sparse vectors, this could represent a significant improvement. Since the original Johnson-Lindenstrauss result, several authors have shown that the projection matrix could be constructed element-wise using Gaussian or uniform ±1 variables [1,7,16,19]. Alon showed a lower bound of on the target dimensionality [4]. In order to circumvent the sparsity lower bound of Matousek [23], the ingenious Fast Johnson-Lindenstrauss transform (FJLT) of Ailon and Chazelle preconditions the input with a randomized Hadamard transform thereby making it dense, and then applies a sparse projection matrix [2]. The computation of the Hadamard transform (via a fast Hadamard transform), however, forces an Õ(d) running time irrespective of the number of non-zeros in the input vector. This makes it less desirable for sparse input vectors. Ailon and Liberty [3] showed that the sparse projection matrix in [2] could be replaced by a dense, deterministic, but well-structured code matrix, and improved the running time to O(d log k) over a wide range of parameters; however, like before, the running time of these methods are unable to take advantage of the sparsity of the input vector. Liberty, Ailon, and Singer [22] proved that there exists projection matrices that are applicable in O(d) time if the input satisfies density conditions that are significantly stricter than those required for hashing. Since hashing works in linear time, our work improves upon these results. Finally we remark that although [3,22] contain a spectral condition derived from Talagrand's inequality that could be applied to our hashing construct2 , but the resulting bound is too weak; it fails to show that hashing improves over even the most basic Johnson-Lindenstrauss transform. Charikar, Chen, and Farach-Colton [12] introduced the COUNT SKETCH data structure that used hash tables combined with pairwise independent ±1 random variables for finding the most frequent items in a data stream. Thorup and Zhang [31] observed that this hashing trick could be used to speed up the celebrated AMS sketch [5] for estimating F2; this was also noted by Cormode and Garofalakis [15]. Hashing decreases the update time from ). These estimators, however, are nonlinear: they return the median of estimates obtained from O(log( 1 δ )) independent hash functions, which makes them less desirable for some applications. Our results essentially show that by increasing the update time to Õ( 1 ǫ log( 1 δ )), the median could be replaced by an average. Lastly, we note that random projection using hashing has found practical applications in machine learning [21,29,33]. In particular, the densification by replication was suggested by Weinberger et al. [33]. Although they claim a concentration bound for hashing-based dimensionality reduction, unfortunately, their claim is false due to an error in the application of Talagrand's inequality. and zero otherwise. Let nnz(x) denote the number of non-zero entries in vector x. Let h ′ : [cd] → [k] be a hash function chosen uniformly at random and let H ′ ∈ {0, ±1} k×cd be defined as H ′ ij = δ ih ′ (j) rj. Let the pre-conditioner P ∈ {0, ±1} cd×d be defined as Theorem 1 For any given vector x ∈ R d , with probability 1 -4δ, Φ satisfies the following property: The time required to compute Φx is This is easily implied by the following. Let h : [d] → [k] be a hash function chosen uniformly at random. Let H ∈ {0, ±1} k×d be defined as Hij = δ ih(j) rj; note that the matrix H has only d non-zero entries, exactly one per column. Theorem 2 For any given vector x ∈ R d such that x ∞ ≤ 1 √ c , for ǫ < 1 and δ < 1 10 , with probability 1 -3δ, H satisfies the following property: For dense vectors, Theorem 1 gives a run-time of O( d ǫ log 3 ( 1 ǫδ )); this, for a small enough ǫ, could be significantly worse than the running time obtained by Ailon and Liberty in [3] and Matousek in [23]. However, we can modify the construction of the preconditioner so that we guarantee a running time of O(d log c log log c) for all vectors. Our new preconditioner is based on the randomized Hadamard construction by Ailon et al. [2,3]. There exists a preconditioner G ∈ ℜ d×d such that for any input vector x ∈ R d , with probability The time required to compute (HG)x is given by O min nnz(x) ǫ log 4 1 ǫδ , d log 1 ǫδ . Without loss of generality, we can assume x 2 = 1. Let Yi = j Hijxj = j δ ih(j) rjxj. and let where Er is the expectation taken with respect to the random variables r = {rj }. Thus, δ ih(j) x 2 j , since the cross-product terms cancel out by the independence i.e., Er[rj The outline of the proof is as follows. We need to prove that i , we will show that i Zi is concentrated around zero. Indeed, since our hash function guarantees that each coordinate j ∈ [d] is mapped to one and exactly one hash bucket, we have that Showing that i Zi is concentrated around zero is thus enough. We will utilize the following form of the FKG inequality [6, Theorem 6.2.1]. Let L be a finite distributive lattice and let µ : L → ℜ + be a log-supermodular function. Then, for an increasing function f and a decreasing function g, we have that we will assume α ≥ 3. We define the following function as a shorthand to denote the upper bound on conditional expectation of the MGF with respect to the {rj} variables. For a given h, let Gi denote the event that the ith hash bucket is good. Let G be the event that the hash function h is good. By abusing notation we use G and Gi to represent the indicator variables of the corresponding events. where Yi = j δ ih(j) rjxj, i.e., Observe that E [Zi] = 0 and our goal is to show that i Zi is concentrated around 0. Here is an overview of the proof. We first show in Lemma 6 that most h are good. In Lemma 7, we bound the moment generating function (MGF) of the random variable Zi, for a fixed h. A usual step at this point would be to remove the effect of the bad choice of the random variables from the MGF by perhaps considering a truncated random variable Ẑi = min(Zi, M ). In our case, however, such a construction would introduce a dependence among the {rj } and h variables, which appears to be insurmountable when trying to apply the FKG inequality. We have to instead utilize the notion of goodness of h only in defining the truncated random variable Ẑi. Using the result of Lemma 7, we first get Corollary 8 that gives the expected and the worst-case bounds on the MGF for a good hash function h. We utilize these bounds to define Ẑi in (5). Next, in Lemma 9, we define two set functions fs and gs and show that they are monotone, in accordance with the requirements of the FKG inequality (Theorem 4). These functions are then used in Lemma 10 to show that the MGF of i Zi can be bound by the product of the individual MGF's Zi. We then bound the probability of an ǫ-deviation for i Zi in Theorem 11. Subsequently, we use Theorem 11 to prove Theorems 1 and 2. Section 4 gives the proof of Theorem 3. The proof (Appendix 9.1) is an application of the Bernstein's inequality [24, Theorem 2.7] and utilizes the fact that since x ∞ ≤ 1 √ c , and the hash function is random, with high probability, no σi can be too large. In essence, this generalizes well-known facts about the maximum load in the balls into bins problem for the weighted case 3 . The following lemma gives a bound on the MGF of the variable Zi for a fixed h. The proof can be found in Appendix 9.2. , then for a fixed h, Lemma 7 leads to the following. , then the expectation of the MGF can be bounded as Similarly, PROOF. By taking expectation over h and using we have that The upper bound on Er[exp(uZi) | G] is given by where we use Er[ Next, we have to handle the fact that the Zi variables are not independent. Yet, intuitively, since Zi is roughly related to the crossproduct of the set of entries xj that map into the ith hash bucket, conditioned on the fact that one of the Zi variables has achieved a large value, the probability that another Z i ′ is also large decreases. In fact, we show that we can apply the FKG inequality (Theorem 4) on the MGF of the Zi random variables. Note that this situation is more involved that the simple negative dependence obtained on a set of random variables by conditioning their sum to be a constant -we cannot make such claims on i Zi. For all i = 1, . . . , k let us define We first need the following lemma in preparation for the application of the FKG inequality (Theorem 4). Then fs is an increasing and gs is a decreasing set function. PROOF. First we prove that fs is increasing by showing that for all A ⊆ [d] and for all a ∈ [d] \ A, it holds that fs (A ∪ {a}) ≥ fs(A). Observe that if h -1 (s) is good (i.e., if Gs holds), then by Corollary 8, we have Er[exp(uZs)] ≤ G(u, σ4 * ). Thus for all h and s, it holds from (5) that There are two cases to consider. Suppose A ∪ {a} is bad. Then, Also note that if h -1 (s) = A ∪ {a} and the sth bucket is good, then Ẑs = Zs = VA + raWA holds. Therefore we have that (By Jensen's inequality, Here, (a) follows since only ra is random in the inner expectation and And, (b) follows since a / ∈ A and VA does not depend on h(a) by the independence of the values of r and h. Finally, (c) follows since if A ∪ {a} is good then so is A; therefore if h -1 (s) = A, then we have that Ẑs = Zs = VA. The proof that fs is increasing is complete. The proof of gs being a decreasing function is similar, and can be found in Appendix 9.3. Given our construction of the two functions, fs and gs, we can now proceed to apply the FKG inequality (Theorem 4) to show that the MGF of the random variable k i=1 Ẑi is bounded by the product of the MGF's of each Ẑi variable. where the expectation is taken over both h and r = {rj}. PROOF. For all 1 ≤ s ≤ k, we prove by induction on s. The base case of s = 1 is obvious. Now assume that the inductive hypothesis (7) holds for s -1. It is easy to check that µs is a log-supermodular measure 4 Furthermore, observe that for any random variable X we have and consequently, Combining the latter with the induction hypothesis for s -1 concludes the proof. For the variables Zi we have The proof of Theorem 11 involves a standard but tedious calculation that is similar to one done by Matousek [23]. The proof can be found in Appendix 9.4. Finally, we are ready to prove the main result. Recall that Yi = j Hijxj, thus . Plugging these values in Theorem 11(i), we have i Zi > ǫ, with probability at most exp(- 9 5 ln( 1 δ ) + 4δ) + δ < 2δ, for δ < 1 10 . Similarly, from Theorem 11(ii), have i Zi < -ǫ with probability at most 2δ. Putting them together, with probability at least 1 -4δ, Using multiple small copies the randomized Hadamard matrix, we create the following preconditioner. Without loss of generality, we assume that d b is an integer, for the given value of b. We note that [3] also contains a similar construct; here we present a more straightforward analysis using a different vector norm. PROOF. If A is b×b randomized Hadamard matrix, then for any b-dimensional vector z with z 2 = 1 it holds that Az 2 = 1. Using a Chernoff-type argument Ailon and Chazelle [2] showed holds as well. Observe that the previous inequality trivially holds for z 2 ≤ 1 as well. Let y = Gx, and Gj denote the jth diagonal block of G, and partition x and y into d b blocks xj and define yj = Gj xj. Now for a block j, if xj 2 ≤ 1 √ c , then yj ∞ ≤ yj 2 ≤ 1 √ c holds as well, since Gj is an isometry. Since x is unit length, there could be at most c blocks j such that xj 2 ≥ 1 √ c . Thus setting s to 1 √ c in (8) and taking the union bound over these at most c blocks, we have that establishing the claim. Using the block-Hadamard preconditioner, we are ready to prove Theorem 3. The ǫ-approximation guarantee of the projection matrix Φ follows trivially from the statements of Theorem 2 and of Lemma 13. In order to bound the running time, let n nzb (x) denote the number of blocks that have non-zero coordinates in x. Then the running time of the block-Hadamard based hashing is Note that if δ is not too small then the running time of Theorem 3 is comparable to the best existing methods for dense vectors [3] yet it is much faster for sparse vectors. We remark that the localized Hadamard preconditioner presented in this section could also be combined with suitably sparse random matrices from [23] by making b larger, approximately equal to k. This variant would reproduce the results of [3], but it fails to show any improvement for sparse vectors over the naive construction as the running time would be Ω( 1 ǫ 2 ) per non-zero element. A random matrix Φ is said to have the JL property if for every vector x, Φx satisfies (1) with probability 1 -δ over the choice of Φ. We show a lower bound on the sparsity for a class of constructions of matrices with the JL property. The construction of the matrix is modeled as a two stage process: first, the set of indices that have non-zero entries is chosen, and then each column is chosen independently random. Note that we do not assume that the random variables are independent within a column. The lower bound argument of Matousek [23] shows that if the set of non-zero indices in the first stage is chosen by independent coin tosses and if the random variables in the second stage are independent (scaled) ±1 with equal probability, in expectation, then Ω( ) non-zero entries per column are needed to guarantee that the resulting matrix has the JL property. We show a lower bound on the sparsity for the case when the non-zero indices are chosen arbitrarily. As mentioned earlier, if the random variables in the second stage are N (0, 1), then it is easy to obtain a lower bound of Ω( 1ǫ 2 ) on the number of non-zero entries per column: indeed, the lower bound follows since Ω( 1ǫ 2 ) such random variables are needed so that their sum is (1 ± ǫ), w.h.p. Under mild technical conditions on the random variables, we can prove the following lower bound stated in Theorem 14. It is easy to see that the conditions of Theorem 14 are satisfied if the random entries are independent (scaled) ±1 or when they are generated by the replicated hashing construct of Theorem 1. Thus the upper bound of Theorem 1 is tight with to ǫ. The bound on the number of non-zeros per column implies a bound on the worst case update time over all vectors as well. Theorem 14 Let 1 ≤ c ≤ k < d be integers and M be an arbitrary, fixed or random, k × d 0-1 matrix with at most c nonzeroes per column. Let P be a k × d random matrix of the fol- Here the vector valued U * j random variables are independent and for each j it holds that PROOF. For all i = 1, . . . , d let Ci = {s ∈ [k]|Msi = 0} denote the index set of non-zeros in the ith column of P . Furthermore, let V = {e1, . . . , e d }, where ei denotes the ith unit vector. For i = j we also define Xij = Ci ∩ Cj and S = t∈X ij UtiUtj. Then we have that Using the fourth moment method [9], we show that S has a large deviation with constant probability unless c is large enough. Towards this goal for all t ∈ Xij set Yt = UtiUtj and let xij = |Xij |. W.l.o.g. we can assume that each column of M contains exactly c non-zeroes and if 2 ) hold as well; otherwise we replace P with a copy of P whose rows are randomly permuted. Furthermore we can also assume that E[UsiUti] = 0 holds as multiplying each row of P with independent uniformly distributed ±1 random variables does not change (9) or the theorem's conditions. Finally, w.l.o.g. we can assume that for all s, t1, t2, t3 where s / ∈ {t1, t2, t3} it holds that E[YsYt 1 Yt 2 Yt 3 ] = 0 as multiplying the rows of P with random ±1 ensures the latter condition as well. Similarly note that By our assumptions it holds that holds by independence and hence from Hölder's inequality we have that . Lastly, recall that that for all s, t1, t2, t3 where s / ∈ {t1, t2, t3} we have that E[YsYt 1 Yt 2 Yt 3 ] = 0. Now [9, Theorem 3.5] states that . Therefore we have that On the other hand, it follows from the assumed JL property of P that with probability 1 -o( 1), for all 1 ≤ i < j ≤ d, we have that P (ei + ej) 2 2 -2 ≤ 2ǫ and that Therefore from combining equality (9) with inequality (10) it follows that ≤ 4ǫ must hold for all i = j, or equivalently |Ci ∩ Cj| ≤ z with z = 16ǫ 2 c 2 for all i = j. If z < 1, then the Ci are pairwise disjoint and therefore k ≥ dc ≥ d, a contradiction. Thus z ≥ 1, and hence c ≥ 1 16ǫ immediately. In what follows we strengthen the latter lower bound for a large range of d and k as claimed. If c ≥ 1 32ǫ 2 then the lemma clearly holds as Ω 1 ǫ 2 is the largest of the lower bounds claimed. Now note that c ≥ 2. Since if c = 1 were to hold, then from ǫ < 1/4 it follows that z < 1, which is a contradiction as before. Therefore if c ≤ 1 32ǫ 2 then all Ci's are distinct as Observe that any z + 1 element set is contained in at most one Ci. Therefore the number of distinct Ci is at most a well known fact from block designs and set packing [18]. From the Stirling formula, for all n ≥ 1, and it follows that for all 1 ≤ y < x it holds that ex y y 0.8 √ 2π x-y xy ≤ x y ≤ ex y y 1.1 √ 2π x y(x-y) . Therefore we have that Now observe that d ≤ f (k, c, z + 1) as all Ci are distinct. Combining the latter with inequality (11), we have that log k (d)-3 ≤ z. Recalling that 1 ≤ z = 16ǫ 2 c 2 concludes the proof. Using a replication argument it is easy to see that if a matrix P only has the JL property for vectors x with x ∞ x 2 ≤ α for some α, then under the conditions of Theorem 14 we have that at least one If the fourth moment of the random entries per column scales with the number of non-zeros per column, the next theorem strengthens the previous claim by bounding the average number of nonzeroes per column. This condition is satisfied, say, if the non-zero entries are independent scaled ±1 random variables. Theorem 15 Let 0 < ǫ ≤ 1/4 and M be an arbitrary k × d 0-1 matrix with 2k 2 < d. Let cj denote the number of non-zeroes in the jth column of M . Let P be a k × d random matrix of the Here the vector valued U * j random variables are independent and for each j it holds that For all j = 1, . . . , k, assemble the columns of P with ci = j into the k × matrix Pj . For all j if nj > k then from assumed JL property of P it follows that Pj satisfies the conditions of Theorem 14 with c = j and thus j ≥ s. Therefore for all j < s we have that nj ≤ k. The number of non-zeroes in P is d i=1 ci = k j=1 njj, which we lower bound as follows We can show the following result for the case that the target metric is ℓ1. The result and the corresponding proof is similar to that of Ailon and Chazelle [2]. We construct the matrix H as follows: Hij = δ ih(j) rj, where rj are now drawn i.i.d. random variables N (0, 1) instead of being ±1. We then have the following. Let β0 = E[|z|] where z ∼ N (0, 1). By the 2-stability of the normal distribution, Yi = j xjrjδ ih(j) ∼ N (0, σi) where Theorem 16 There exists a constant ǫ0 such that for all ǫ < ǫ0, if c = k/ǫ, and The proof is omitted in this version. The most important open question is resolving the gap between the upper and lower bounds with respect to the error probability. It would be interesting to see whether our claims could be proven more directly using stronger concentration inequalities. Application of the current result to streaming settings would also require proving the claims for a k-wise independent hash-function and ±1 variables. The chief hurdle in applying the techniques of Clarkson and Woodruff [14] seems to be proving the FKG inequality for the limited independence case. Note that Nisan's pseudorandom number generator construction [25] can be used to derandomize the hash function, but the naive way of doing this increases the update time to k. We leave efficient derandomization as an open question. It is worthwhile to note that the hash-function represents a bipartite expander. In a similar vein, Berinde et al. [10] use an unbalanced expander graph based construction to create matrices with restricted isometry property for sparse signal recovery. Their argument crucially uses two facts -that the error-norm is ℓ1, and that the input vector is sparse. It would interesting to investigate possible connections between these results. PROOF. We show that σ 2 1 ≤ σ 2 * , with probability 1 -δ/k; the proof will then follow from the union bound. c . We also have Also, j Xj = σ 2 1 -1 k . Plugging this into the Bernstein's inequality [24,Theorem 2.7], Choosing c = 4k 3α log( k δ ), we get the above probability to be smaller than δ/k. Since α = 1 ǫ log(k/δ) , and k = 12 ǫ 2 log(1/δ), we have that choosing c = 16 ǫ log( 1 δ ) log 2 ( k δ ) is sufficient. We first compute the expectation of the MGF for different conditions on the hashing function. We begin by proving Lemma 7. PROOF. We have that Zi = j =g∈h -1 (i) rjrgxjxg . Hence where Yi = j:h(j)=i xjrj. Then, By the Markov inequality, we get the probability of Yi being larger than t as . Note that we do not need to worry about σi being zero, as then Yi = 0. Then, we bound Er[exp(uZi)] as follows. Denote p(t) = Pr[Zi = t]. We first compute the expectation with respect to r. For any value of θ > 0, we have The first term can be bounded as follows: where the last inequality follows since in the range [θ, ∞], the integral is positive. Then, the calculation can be simplified as follows: For the second term, we have t>θ exp(ut)p(t) By assumption of the lemma, since u < 1 4σ 2 i , we have that ℓ - . By putting together the two parts, we have that * ln(k/δ), the proof is complete. We finish the proof of Lemma 9 by showing that gs is decreasing. To this end, we prove that for all A ⊆ [d] and for all a ∈ [d] \ A, gs (A ∪ {a}) ≤ gs(A). Recalling the definition of gs(A), we have where the inner expectation is over the random variables {rj} only. Since h is completely independent we have that and similarly Therefore it is sufficient to show that for all (h1, . . . , ha-1, ha+1, . . . , h We shall prove the following stronger inequality: for all (h1, . . . , ha-1, ha+1, . . Now observe that only r are random in the above expectations and that Ẑi are conditionally independent given h. Therefore, From the non-negativity of the exponential function, it follows that it is sufficient to show that for all i = 1, . . . , s -1 and for all (h1, . . . , ha-1, ha+1, . . . , h d ) ∈ [k] d-1 with ∀j = a : hj = s ⇔ j ∈ A and for all ha ∈ [k] \ {s} it holds that EL ≥ ER, where We prove inequality (13) by a case analysis. If ha = i, then EL = ER by definition. If ha = i and the ith bucket of EL's hash function is bad, then EL = G(u, σ 4 * ) ≥ ER, as shown earlier in Corollary 8. If ha = i and the ith bucket of EL's hash function is good then, the ith bucket of ER's hash function is also good. As before, define Va = j =a g =a,j =g rjrgxjxgδ h(j)i δ h(g)i , and Wa = j =a rj xjxaδ h(j)i δ h(a)i . Again, note that if the ith bucket is good as assumed then Ẑi = Zi = Va + raWa. Therefore we have that Here the last equality follows from the fact that all r and h values are independent and since Va does not depend on a. If h -1 (s) = A ∪ {a} and the i hash bucket is good as assumed then Ẑi = Zi = Va and we observe that E [exp (uVa) | ∀j = a : h(j) = hj, h(a) = s] = ER (16) Combining ( 15) and ( 16), we conclude that EL ≥ ER for all cases and hence gs is decreasing as claimed. PROOF. Recall that the random variable Ẑi is defined as Note that G(u, σ 4 * ) > 1, and hence for u > 0, 1 u log G(u, σ 4 * ) > 0. Also recall that Gi is the indicator vector for bucket i being good, and G is the indicator for the hash function being good. By definition of Ẑi, since 1 u log G(u, σ 4 * ) > 0, i Zi ∧ G < i Ẑi and hence we have that, Taking the product over the k terms, by using Lemma 10, ≤ exp 2 kθ 2 (exp(uθ) -1 -uθ) + 4δ . For completeness, we show how to determine the optimal u. Note that we need to restrict u < In fact, we need an average of 1 ǫ Gaussians to get a (1 ± ǫ)approximation.2 It is not hard to see that σ of[22] equals to max{σi} studied in Lemma 6. Sanders [27] contains a proof of the expected load for the weighted ball-and-bins problem, but does not contain a proof of the high probability statement. See [6, Section 6.2, page 87] for a precise definition and proof of this fact.

A Sparse Johnson--Lindenstrauss Transform

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment