Kronecker Determinantal Point Processes

Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of $N$ items. They have recently gained prominence in several applications that rely on "diverse" subsets. However, their applicability to large problems is s…

Authors: Zelda Mariet, Suvrit Sra

Kronecker Determinantal Point Processes
Kronecker Determinantal Point Processes Zelda Mariet Massachusetts Institute of T echnology Cambridge, MA 02139 zelda@csail.mit.edu Suvrit Sra Massachusetts Institute of T echnology Cambridge, MA 02139 suvrit@mit.edu Abstract Determinantal Point Processes (DPPs) are probabilistic models o ver all subsets a ground set of N items. They ha ve recently g ained prominence in sev eral applications that rely on “di verse” subsets. Ho wev er , their applicability to lar ge problems is still limited due to the O ( N 3 ) complexity of core tasks such as sampling and learning. W e enable efficient sampling and learning for DPPs by introducing K RO N D P P , a DPP model whose k ernel matrix decomposes as a tensor product of multiple smaller kernel matrices. This decomposition immediately enables fast e xact sampling. But contrary to what one may expect, le veraging the Kronecker product structure for speeding up DPP learning turns out to be more dif ficult. W e overcome this challenge, and deri ve batch and stochastic optimization algorithms for efficiently learning the parameters of a K R O N D P P . 1 Intr oduction Determinantal Point Processes (DPPs) are discrete probability models over the subsets of a ground set of N items. The y provide an eleg ant model to assign probabilities to an exponentially large sample, while permitting tractable (polynomial time) sampling and marginalization. They are often used to provide models that balance “di versity” and quality , characteristics v aluable to numerous problems in machine learning and related areas [ 17 ]. The antecedents of DPPs lie in statistical mechanics [ 24 ], but since the seminal work of [ 15 ] they hav e made inroads into machine learning. By now the y hav e been applied to a variety of problems such as document and video summarization [ 6 , 21 ], sensor placement [ 14 ], recommender systems [ 31 ], and object retriev al [ 2 ]. More recently , they have been used to compress fully-connected layers in neural networks [ 26 ] and to pro vide optimal sampling procedures for the Nyström method [ 20 ]. The more general study of DPP properties has also garnered a significant amount of interest, see e.g., [ 1 , 5 , 7 , 12 , 16 – 18 , 23 ]. Howe ver , despite their eleg ance and tractability , widespread adoption of DPPs is impeded by the O ( N 3 ) cost of basic tasks such as (e xact) sampling [ 12 , 17 ] and learning [ 10 , 12 , 17 , 25 ]. This cost has moti vated a string of recent works on approximate sampling methods such as MCMC samplers [ 13 , 20 ] or core-set based samplers [ 19 ]. The task of learning a DPP from data has receiv ed less attention; the methods of [ 10 , 25 ] cost O ( N 3 ) per iteration, which is clearly unacceptable for realistic settings. This burden is partially ameliorated in [ 9 ], who restrict to learning low-rank DPPs, though at the expense of being unable to sample subsets lar ger than the chosen rank. These considerations motiv ate us to introduce K R O N D P P , a DPP model that uses Kronecker (tensor) product kernels. As a result, K RO N D P P enables us to learn large sized DPP kernels, while also permitting ef ficient (exact and approximate) sampling. The use of Kronecker products to scale matrix models is a popular and ef fectiv e idea in se veral machine-learning settings [ 8 , 27 , 28 , 30 ]. But as we will see, its efficient e xecution for DPPs turns out to be surprisingly challenging. T o make our discussion more concrete, we recall some basic facts no w . Suppose we have a ground set of N items Y = { 1 , . . . , N } . A discrete DPP over Y is a probability measure P on 2 Y parametrized by a positiv e definite matrix K (the mar ginal kernel ) such that 0  K  I , so that for any Y ∈ Y drawn from P , the measure satisfies ∀ A ⊆ Y , P ( A ⊆ Y ) = det( K A ) , (1) 1 where K A is the submatrix of K index ed by elements in A (i.e., K A = [ K ij ] i,j ∈ A ). If a DPP with marginal k ernel K assigns nonzero probability to the empty set, the DPP can alternati vely be parametrized by a positiv e definite matrix L (the DPP kernel ) so that P ( Y ) ∝ det( L Y ) = ⇒ P ( Y ) = det( L Y ) det( L + I ) . (2) A brief manipulation (see e.g., [ 17 , Eq. 15]) sho ws that when the in verse exists, L = K ( I − K ) − 1 . The determinants, such as in the normalization constant in ( 2 ) , make operations ov er DPPs typically cost O ( N 3 ) , which is a key impediment to their scalability . Therefore, if we consider a class of DPP kernels whose structure makes it easy to compute determinants, we should be able to scale up DPPs. An alternati ve approach to wards scalability is to restrict the size of the subsets, as done in k -DPP [ 16 ] or when using rank- k DPP kernels [ 9 ] (where k  N ). Both these approaches still require O ( N 3 ) preprocessing for exact sampling; another ca veat is that they limit the DPP model by assigning zero probabilities to sets of cardinality greater than k . In contrast, K R O N D P P uses a kernel matrix of the form L = L 1 ⊗ . . . ⊗ L m , where each sub- kernel L i is a smaller positi ve definite matrix. This decomposition has two key advantages: (i) it significantly lowers the number of parameters required to specify the DPP from N 2 to O ( N 2 /m ) (assuming the sub-kernels are roughly the same size); and (ii) it enables fast sampling and learning. For ease of exposition, we describe specific details of K RO N D P P for m = 2 ; as will become clear from the analysis, typically the special cases m = 2 and m = 3 should suf fice to obtain low-comple xity sampling and learning algorithms. Contributions. Our main contribution is the K RO N D P P model along with efficient algorithms for sampling from it and learning a Kroneck er factored k ernel. Specifically , inspired by the algorithm of [ 25 ], we dev elop K R K - P I C A R D ( Kr onecker - K ernel Picard), a block-coordinate ascent proce- dure that generates a sequence of Kronecker factored estimates of the DPP kernel while ensuring monotonic progress on its (difficult, noncon vex) objecti ve function. More importantly , we show ho w to implement K R K - P I C A R D to run in O ( N 2 ) time when implemented as a batch method, and in O ( N 3 / 2 ) time and O ( N ) space, when implemented as a stochastic method. As alluded to above, un- like many other uses of Kronecker models, K RO N D P P does not admit trivial scaling up, lar gely due to extensi ve dependence of DPPs on arbitrary submatrices of the DPP kernel. An interesting theoretical nugget that arises from our analysis is the combinatorial problem that we call subset clustering , a problem whose (ev en approximate) solution can lead to further speedups of our algorithms. 2 Pr eliminaries W e begin by recalling basic properties of Kronecker products needed in our analysis; we omit proofs of these well-known results for bre vity . The Kronecker (tensor) product of A ∈ R p × q with B ∈ R r × s two matrices is defined as the pr × q s block matrix A ⊗ B = [ a ij B ] p,q i,j =1 . W e denote the block a ij B in A ⊗ B by ( A ⊗ B ) ( ij ) for any valid pair ( i, j ) , and extend the notation to non-Kronecker product matrices to indicate the submatrix of size r × s at position ( i, j ) . Proposition 2.1. Let A, B , C, D be matrices of sizes so that AC and B D ar e well-defined. Then, (i) If A, B  0 , then, A ⊗ B  0 ; (ii) If A and B are in vertible then so is A ⊗ B , with ( A ⊗ B ) − 1 = A − 1 ⊗ B − 1 ; (iii) ( A ⊗ B )( C ⊗ D ) = ( AC ) ⊗ ( B D ) . An important consequence of Prop. 2.1 ( iii ) is the following corollary . Corollary 2.2. Let A = P A D A P > A and B = P B D B P > B be the eigen vector decompositions of A and B . Then, A ⊗ B diagonalizes as ( P A ⊗ P B )( D A ⊗ D B )( P A ⊗ P B ) > . W e will also need the notion of partial trace operators, which are perhaps less well-kno wn: 2 Definition 2.3. Let A ∈ R N 1 N 2 × N 1 N 2 . The partial tr aces T r 1 ( A ) and T r 2 ( A ) are defined as follows: T r 1 ( A ) :=  T r( A ( ij )  1 ≤ i,j ≤ N 1 ∈ R N 1 × N 1 , T r 2 ( A ) := X N 1 i =1 A ( ii ) ∈ R N 2 × N 2 . The action of partial traces is easy to visualize: indeed, T r 1 ( A ⊗ B ) = T r( B ) A and T r 2 ( A ⊗ B ) = T r( A ) B . For us, the most important property of partial trace operators is their positi vity . Proposition 2.4. T r 1 and T r 2 ar e positive operator s, i.e., for A  0 , T r 1 ( A )  0 and T r 2 ( A )  0 . Pr oof. Please refer to [ 4 , Chap. 4]. 3 Learning the k ernel matrix f or K R O N D P P In this section, we consider the key dif fi cult task for K RO N D P P s: learning a Kroneck er product kernel matrix from n observed subsets Y 1 , . . . , Y n . Using the definition ( 2 ) of P ( Y i ) , maximum-likelihood learning of a DPP with kernel L results in the optimization problem: arg max L  0 φ ( L ) , φ ( L ) = 1 n n X i =1 (log det( L Y i ) − log det( L + I )) . (3) This problem is noncon ve x and conjectured to be NP-hard [ 15 , Conjecture 4.1]. Moreov er the constraint L  0 is nontrivial to handle. Writing U i as the indicator matrix for Y i of size N × | Y i | so that L Y i = U > i LU i , the gradient of φ is easily seen to be ∆ := ∇ φ ( L ) = 1 n X n i =1 U i L − 1 Y i U > i − ( L + I ) − 1 . (4) In [ 25 ], the authors deriv ed an iterative method (“the Picard iteration”) for computing an L that solv es ∆ = 0 by running the simple iteration L ← L + L ∆ L. (5) Moreov er , iteration ( 5 ) is guaranteed to monotonically increase the log-likelihood φ [ 25 ]. But these benefits accrue at a cost of O ( N 3 ) per iteration, and furthermore a direct application of ( 5 ) cannot guarantee the Kronecker structure required by K RO N D P P . 3.1 Optimization algorithm Our aim is to obtain an efficient algorithm to (locally) optimize ( 3 ) . Beyond its noncon vexity , the Kronecker structure L = L 1 ⊗ L 2 imposes another constraint. As in [ 25 ] we first rewrite φ as a function of S = L − 1 , and re-arrange terms to write it as φ ( S ) = log det( S ) | {z } f ( S ) + 1 n X n i =1 log det  U > i S − 1 U i  − log det( I + S ) | {z } g ( S ) . (6) It is easy to see that f is concav e, while a short argument shows that g is con ve x [ 25 ]. An appeal to the con ve x-concav e procedure [ 29 ] then shows that updating S by solving ∇ f ( S ( k +1) ) + ∇ g ( S ( k ) ) = 0 , which is what ( 5 ) does [ 25 , Thm. 2.2], is guaranteed to monotonically increase φ . But for K R O N D P P this idea does not apply so easily: due the constraint L = L 1 ⊗ L 2 the function g ⊗ : ( S 1 , S 2 ) → 1 n X n i =1 log det  U > i ( S 1 ⊗ S 2 ) − 1 U i  − log det( I + S 1 ⊗ S 2 ) , fails to be con vex, precluding an easy generalization. Nevertheless, for fix ed S 1 or S 2 the functions ( f 1 : S 1 7→ f ( S 1 ⊗ S 2 ) g 1 : S 1 7→ g ( S 1 ⊗ S 2 ) , ( f 2 : S 2 → f ( S 1 ⊗ S 2 ) g 2 : S 2 → g ( S 1 ⊗ S 2 ) 3 are once again concav e or con ve x. Indeed, the map ⊗ : S 1 → S 1 ⊗ S 2 is linear and f is concave, and f 1 = f ◦ ⊗ is also concav e; similarly , f 2 is seen to be concave and g 1 and g 2 are con ve x. Hence, by generalizing the arguments of [ 29 , Thm. 2] to our “block-coordinate” setting, updating via ∇ f i  S i ( k +1)  = −∇ g i  S i ( k )  , for i = 1 , 2 , (7) should increase the log-likelihood φ at each iteration. W e prove belo w that this is indeed the case, and that updating as per ( 7 ) ensure positiv e definiteness of the iterates as well as monotonic ascent. 3.1.1 Positi ve definite iterates and ascent In order to show the positi ve definiteness of the solutions to ( 7 ), we first deriv e their closed form. Proposition 3.1 (Positi ve definite iterates) . F or S 1  0 , S 2  0 , the solutions to ( 7 ) ar e given by the following expr essions: ∇ f 1 ( X ) = −∇ g 1 ( S 1 ) ⇐ ⇒ X − 1 = T r 1 (( I ⊗ S 2 )( L + L ∆ L )) / N 2 ∇ f 2 ( X ) = −∇ g 2 ( S 2 ) ⇐ ⇒ X − 1 = T r 2 (( S 1 ⊗ I )( L + L ∆ L )) / N 1 . Mor eover , these solutions ar e positive definite. Pr oof. The details are somewhat technical, and are hence gi ven in Appendix A . W e know that L  0 = ⇒ L + L ∆ L ≥ 0 , because L − L ( I + L ) − 1 L  0 . Since the partial trace operators are positiv e (Prop. 2.4 ), it follo ws that the solutions to ( 7 ) are also positive definite. W e are now ready to establish that these updates ensure monotonic ascent in the log-likelihood. Theorem 3.2 (Ascent) . Starting with L (0) 1  0 , L (0) 2  0 , updating according to ( 7 ) generates positive definite iterates L ( k ) 1 and L ( k ) 2 , and the sequence  φ  L ( k ) 1 ⊗ L ( k ) 2  k ≥ 0 is non-decr easing. Pr oof. Updating according to ( 7 ) generates positiv e definite matrices S i , and hence positiv e definite subkernels L i = S i . Moreov er , due to the con ve xity of g 1 and conca vity of f 1 , for matrices A, B  0 f 1 ( B ) ≤ f 1 ( A ) + ∇ f 1 ( A ) > ( B − A ) , g 1 ( A ) ≥ g 1 ( B ) + ∇ g 1 ( B ) > ( A − B ) . Hence, f 1 ( A ) + g 1 ( A ) ≥ f 1 ( B ) + g 1 ( B ) + ( ∇ f 1 ( A ) + ∇ g 1 ( B )) > ( A − B ) . Thus, if S ( k ) 1 , S ( k +1) 1 verify ( 7 ), by setting A = S ( k +1) 1 and B = S ( k ) 1 we hav e φ  L ( k +1) 1 ⊗ L ( k ) 2  = f 1  S ( k +1) 1  + g 1  S ( k +1) 1  ≥ f 1  S ( k ) 1  + g 1  S ( k ) 1  = φ  L ( k ) 1 ⊗ L ( k ) 2  . The same reasoning holds for L 2 , which prov es the theorem. As T r 1 (( I ⊗ S 2 ) L ) = N 2 L 1 (and similarly for L 2 ), updating as in ( 7 ) is equiv alent to updating L 1 ← L 1 + T r 1  ( I ⊗ L − 1 2 )( L ∆ L )  / N 2 , L 2 ← L 2 + T r 2  ( L − 1 1 ⊗ I )( L ∆ L )  / N 1 . Genearlization. W e can generalize the updates to take an additional step-size parameter a : L 1 ← L 1 + a T r 1  ( I ⊗ L − 1 2 )( L ∆ L )  / N 2 , L 2 ← L 2 + a T r 2  ( L − 1 1 ⊗ I )( L ∆ L )  / N 1 . Experimentally , a > 1 (as long as the updates remain positi ve definite) can provide faster conv ergence, although the monotonicity of the log-lik elihood is no longer guaranteed. W e found experimentally that the range of admissible a is larger than for Picard, b ut decreases as N grows lar ger . The arguments abov e easily generalize to the multiblock case. Thus, when learning L = L 1 ⊗ · · · ⊗ L m , by writing E ij the matrix with a 1 in position ( i, j ) and zeros elsewhere, we update L k as ( L k ) ij ← ( L k ) ij + N k / ( N 1 . . . N m ) T r [( L 1 ⊗ . . . ⊗ L k − 1 ⊗ E ij ⊗ L k +1 ⊗ . . . ⊗ L m )( L ∆ L )] . From the abov e updates it is not transparent whether the Kronecker product sav es us any compu- tation. In particular , it is not clear whether the updates can be implemented to run faster than O ( N 3 ) . W e show belo w in the next section ho w to implement these updates ef ficiently . 4 3.1.2 Algorithm and complexity analysis From Theorem 3.2 , we obtain Algorithm 1 (which is different from the Picard iteration in [ 25 ], because it operates alternatingly on each subkernel). It is important to note that a further speedup to Algorithm 1 can be obtained by performing stochastic updates, i.e., instead of computing the full gradient of the log-likelihood, we perform our updates using only one (or a small minibatch) subset Y i at each step instead of iterating over the entire training set; this uses the stochastic gradient ∆ = U i L − 1 Y i U > i − ( I + L ) − 1 . The crucial strength of Algorithm 1 lies in the follo wing result: Algorithm 1 K R K - P I C A R D iteration Input: Matrices L 1 , L 2 , training set T , parameter a . for i = 1 to maxIter do L 1 ← L 1 + a T r 1  ( I ⊗ L − 1 2 )( L ∆ L )  / N 2 // or update stochastically L 2 ← L 2 + a T r 2  ( L − 1 1 ⊗ I )( L ∆ L )  / N 1 // or update stochastically end for retur n ( L 1 , L 2 ) Theorem 3.3 (Complexity) . F or N 1 ≈ N 2 ≈ √ N , the updates in Algorithm 1 can be computed in O ( nκ 3 + N 2 ) time and O ( N 2 ) space, wher e κ is the size of the larg est training subset. Furthermore , stochastic updates can be computed in O ( N κ 2 + N 3 / 2 ) time and O ( N + κ 2 ) space. Indeed, by leveraging the properties of the Kronecker product, the updates can be obtained without computing L ∆ L . This result is non-trivial: the components of ∆ , 1 n P i U i L − 1 Y i U > i and ( I + L ) − 1 , must be considered separately for computational ef ficiency . The proof is provided in App. B . Ho wever , it seems that considering more than 2 subkernels does not lead to further speed-ups. If N 1 ≈ N 2 ≈ √ N , these complexities become: – for non-stochastic updates: O ( nκ 3 + N 2 ) time, O ( N 2 ) space, – for stochastic updates: O ( N κ 3 + N 3 / 2 ) time, O ( κ 2 + N ) space. This is a marked improvement ov er [ 25 ], which runs in O ( N 2 ) space and O ( nκ 3 + N 3 ) time (non-stochastic) or O ( N 3 ) time (stochastic); Algorithm 1 also provides faster stochastic updates than [ 9 ]. Howe ver , one may wonder if by learning the sub-kernels by alternating updates the log- likelihood con verges to a sub-optimal limit. The next section discusses how to jointly update L 1 and L 2 . 3.2 Joint updates W e also analyzed the possibility of updating L 1 and L 2 jointly: we update L ← L + L ∆ L and then recov er the Kronecker structure of the kernel by defining the updates L 0 1 and L 0 2 such that: ( ( L 0 1 , L 0 2 ) minimizes k L + L ∆ L − L 0 1 ⊗ L 0 2 k 2 F L 0 1  0 , L 0 2  0 , k L 0 1 k = k L 0 2 k (8) W e show in appendix C that such solutions e xist and can be computed by from the first singular v alue and vectors of the matrix R =  v ec(( L − 1 + ∆) ( ij ) ) >  N 1 i,j =1 . Note howe ver that in this case, there is no guaranteed increase in log-likelihood. The pseudocode for the related algorithm ( J O I N T - P I C A R D ) is giv en in appendix C.1 . An analysis similar to the proof of Thm. 3.3 shows that the updates can be obtained O ( nκ 3 + max( N 1 , N 2 ) 4 ) . 3.3 Memory-time trade-off Although K R O N D P P S hav e tractable learning algorithms, the memory requirements remain high for non-stochastic updates, as the matrix Θ = 1 n P i U i L − 1 Y i U > i needs to be stored, requiring O ( N 2 ) 5 memory . Howe ver , if the training set can be subdivised such that { Y 1 , . . . , Y n } = ∪ m k =1 S k s.t. ∀ k , |∪ Y ∈ S k Y | < z , (9) Θ can be decomposed as 1 n P m k =1 Θ k with Θ k = P Y i ∈ S k U i L − 1 Y i U > i . Due to the bound in Eq. 9 , each Θ k will be sparse, with only z 2 non-zero coefficients. W e can then store each Θ k with minimal storage and update L 1 and L 2 in O ( nκ 3 + mz 2 + N 3 / 2 ) time and O ( mz 2 + N ) space. Determining the existence of such a partition of size m is a variant of the NP-Hard Subset-Union Knapsack Problem (SUKP) [ 11 ] with m knapsacks and where the value of each item (i.e. each Y i ) is equal to 1: a solution to SUKP of v alue n with m knapsacks is equi v alent to a solution to 9 . Howe ver , an approximate partition can also be simply constructed via a greedy algorithm. 4 Sampling Sampling exactly (see Alg. 2 and [ 17 ]) from a full DPP k ernel costs O ( N 3 + N k 3 ) where k is the size of the sampled subset. The b ulk of the computation lies in the initial eigendecomposition of L ; the k orthonormalizations cost O ( N k 3 ) . Although the eigendecomposition need only happen once for many iterations of sampling, e xact sampling is nonetheless intractable in practice for large N . Algorithm 2 Sampling from a DPP kernel L Input: Matrix L . Eigendecompose L as { ( λ i , v i ) } 1 ≤ i ≤ N . J ← ∅ for i = 1 to N do J → J ∪ { i } with probability λ i / ( λ i + 1) . end for V ← { v i } i ∈ J , Y ← ∅ while | V | > 0 do Sample i from { 1 . . . N } with probability 1 | V | P v ∈ V v 2 i Y ← Y ∪ { i } , V ← V ⊥ , where V ⊥ is an orthonormal basis of the subspace of V orthonormal to e i end while retur n Y It follo ws from Prop. 2.2 that for K RO N D P P S , the eigen values λ i can be obtained in O ( N 3 1 + N 3 2 ) , and the k eigen vectors in O ( k N ) operations. For N 1 ≈ N 2 ≈ √ N , exact sampling thus only costs O ( N 3 / 2 + N k 3 ) . If L = L 1 ⊗ L 2 ⊗ L 3 , the same reasoning shows that exact sampling becomes linear in N , only requiring O ( N k 3 ) operations. One can also resort to MCMC sampling; for instance such a sampler was considered in [ 13 ] (though with an incorrect mixing time analysis). The results of [ 20 ] hold only for k -DPPs, b ut suggest their MCMC sampler may possibly tak e O ( N 2 log( N / )) time for full DPPs, which is impractical. Nev ertheless if one develops faster MCMC samplers, they should also be able to profit from the Kronecker product structure of fered by K RO N D P P . 5 Experimental r esults In order to validate our learning algorithm, we compared K R K - P I C A R D to J O I N T - P I C A R D and to the Picard iteration (P I C A R D ) on multiple real and synthetic datasets. 1 5.1 Synthetic tests All three algorithms were used to learn from synthetic data drawn from a “true” kernel. The sub- kernels were initialized by L i = X > X , with X ’ s coef ficients drawn uniformly from [0 , √ 2] ; for P I C A R D , L was initialized with L 1 ⊗ L 2 . For Figures 1a and 1b , training data was generated by 1 All experiments were repeated 5 times and averaged, using MA TLAB on a Linux Mint system with 16GB of RAM and an i7-4710HQ CPU @ 2.50GHz. 6 P I C A R D J O I N T - P I C A R D K R K - P I C A R D K R K - P I C A R D (stochastic) 0 100 200 − 8 − 6 − 4 − 2 0 · 10 3 time (s) Normalized log-likelihood (a) N 1 = N 2 = 50 0 200 400 600 − 4 − 2 0 · 10 4 time (s) (b) N 1 = N 2 = 100 0 20 40 60 80 − 3 − 2 − 1 0 · 10 5 time (s) (c) N 1 = 100 , N 2 = 500 Figure 1: a = 1 ; the thin dotted lines indicated the standard deviation from the mean. sampling 100 subsets from the true kernel with sizes uniformly distrib uted between 10 and 190. T o ev aluate K R K - P I C A R D on matrices too large to fit in memory and with large κ , we dre w samples from a 50 · 10 3 × 50 · 10 3 kernel of rank 1 , 000 (on av erage | Y i | ≈ 1 , 000 ), and learned the kernel stochastically (only K R K - P I C A R D could be run due to the memory requirements of other methods); the likelihood drastically improv es in only two steps (Fig. 1c ). As shown in Figures 1a and 1b , K R K - P I C A R D con ver ges significantly faster than P I C A R D , especially for large values of N . Howe ver , although J O I N T - P I C A R D also increases the log-likelihood at each iteration, it con verges much slo wer and has a high standard de viation, whereas the standard deviations for P I C A R D and K R K - P I C A R D are barely noticeable. For these reasons, we drop the comparison to J O I N T - P I C A R D in the subsequent experiments. 5.2 Small-scale real data: baby registries W e compared K R K - P I C A R D to P I C A R D and EM [ 10 ] on the baby re gistry dataset (described in-depth in [ 10 ]), which has also been used to ev aluate other DPP learning algorithms [ 9 , 10 , 25 ]. The dataset contains 17 categories of baby-related products obtained from Amazon. W e learned kernels for the 6 largest categories ( N = 100 ); in this case, P I C A R D is sufficiently ef ficient to be prefered to K R K - P I C A R D ; this comparison serves only to ev aluate the quality of the final kernel estimates. The initial marginal kernel K for EM was sampled from a Wishart distrib ution with N degrees of freedom and an identity cov ariance matrix, then scaled by 1 / N ; for P I C A R D , L was set to K ( I − K ) − 1 ; for K R K - P I C A R D , L 1 and L 2 were chosen (as in J O I N T - P I C A R D ) by minimizing k L − L 1 ⊗ L 2 k . Con ver gence was determined when the objecti ve change dipped belo w a threshold δ . As one EM iteration takes longer than one Picard iteration but increases the lik elihood more, we set δ P I C = δ K R K = 10 − 4 and δ E M = 10 − 5 . The final log-likelihoods are shown in T able 1 ; we set the step-sizes to their largest possible values, i.e. a P I C = 1 . 3 and a K R K = 1 . 8 . T able 1 shows that K R K - P I C A R D obtains comparable, albeit slightly worse log-likelihoods than P I C A R D and E M, which confirms that for tractable N , the better modeling capability of full kernels make them preferable to K RO N D P P S . 5.3 Large-scale r eal dataset: GENES Finally , to ev aluate K R K - P I C A R D on large matrices of real-world data, we train it on data from the GENES [ 3 ] dataset (which has also been used to e v aluate DPPs in [ 3 , 19 ]). This dataset consists in 10,000 genes, each represented by 331 features corresponding to the distance of a gene to hubs in the BioGRID gene interaction network. W e construct a ground truth Gaussian DPP kernel on the GENES dataset and use it to obtain 100 training samples with sizes uniformly distributed between 50 and 200 items. Similarly to the 7 T able 1: Final log-likelihoods for each large category of the baby re gistries dataset (a) T raining set Category E M P I C A R D K R K - P I C A R D apparel -10.1 -10.2 -10.7 bath -8.6 -8.8 -9.1 bedding -8.7 -8.8 -9.3 diaper -10.5 -10.7 -11.1 feeding -12.1 -12.1 -12.5 gear -9.3 -9.3 -9.6 (b) T est set Category E M P I C A R D K R K - P I C A R D apparel -10.1 -10.2 -10.7 bath -8.6 -8.8 -9.1 bedding -8.7 -8.8 -9.3 diaper -10.6 -10.7 -11.2 feeding -12.2 -12.2 -12.6 gear -9.2 -9.2 -9.5 synthetic experiments, we initialized K R K - P I C A R D ’ s kernel by setting L i = X > i X i where X i is a random matrix of size N 1 × N 1 ; for P I C A R D , we set the initial kernel L = L 1 ⊗ L 2 . P I C A R D K R K - P I C A R D K R K - P I C A R D (stochastic) 0 100 200 300 − 40 − 30 − 20 − 10 0 · 10 3 time (s) Normalized log-likelihood (a) Non-stochastic learning 0 50 100 − 40 − 30 − 20 − 10 0 · 10 3 time (s) (b) Stochastic vs. non-stochastic Figure 2: n = 150 , a = 1 . Figure 2 shows the performance of both algorithms. As with the synthetic experiments, K R K - P I C A R D con ver ges much faster; stochastic updates increase its performance e ven more, as sho wn in Fig. 2b . A verage runtimes and speed-up are gi ven in T able 2 : K R K - P I C A R D runs almost an order of magnitude faster than P I C A R D , and stochastic updates are more than two orders of magnitude faster , while providing slightly lar ger initial increases to the log-likelihood. 6 Conclusion and futur e work W e introduced K R O N D P P S , a variant of DPPs with kernels structured as the Kronecker product of m smaller matrices, and sho wed that typical operations o ver DPPs such as sampling and learning the kernel from data can be made ef ficient for K RO N D P P S on previously untractable ground set sizes. By carefully lev eraging the properties of the Kronecker product, we derived for m = 2 a low- complexity algorithm to learn the k ernel from data which guarantees positi ve iterates and a monotonic increase of the log-likelihood, and runs in O ( nκ 3 + N 2 ) time. This algorithm pro vides ev en more significant speed-ups and memory gains in the stochastic case, requiring only O ( N 3 / 2 + N κ 2 ) time T able 2: A verage runtime and performance on the GENES dataset for N 1 = N 2 = 100 P I C A R D K R K - P I C A R D K R K - P I C A R D (stochastic) A verage runtime 161.5 ± 17.7 s 8.9 ± 0.2 s 1.2 ± 0.02 s NLL increase (1st iter .) (2 . 81 ± 0 . 03) · 10 4 (2 . 96 ± 0 . 02) · 10 4 (3 . 13 ± 0 . 04) · 10 4 8 and O ( N + κ 2 ) space. Experiments on synthetic and real data sho wed that K RO N D P P S can be learned efficiently on sets lar ge enough that L does not fit in memory . While discussing learning the kernel, we showed that L 1 and L 2 cannot be updated simultaneously in a CCCP-style iteration since g is not conv ex over ( S 1 , S 2 ) . Howe ver , it can be shown that g is geodesically con ve x ov er the Riemannian manifold of positive definite matrices, which suggests that deriving an iteration which w ould take adv antage of the intrinsic geometry of the problem may be a viable line of future work. K R O N D P P S also enable fast sampling, in O ( N 3 / 2 + N k 3 ) operations when using two sub- kernels and in O ( N k 3 ) when using three sub-kernels; this allo ws for exact sampling at comparable or ev en better costs than previous algorithms for approximate sampling. Howe ver , as we improve computational efficienc y L , the subset size k becomes limiting, due to the O ( N k 3 ) cost of sampling and learning. A necessary line of future work to allow for truly scalable DPPs is thus to ov ercome this computational bottleneck. Refer ences [1] R. Aff andi, A. Kulesza, E. Fox, and B. T askar . Nyström approximation for large-scale Determinantal Point Processes. In Artificial Intelligence and Statistics (AIST A TS) , 2013. [2] R. Affandi, E. Fox, R. Adams, and B. T askar . Learning the parameters of Determinantal Point Process kernels. In ICML , 2014. [3] N. K. Batmanghelich, G. Quon, A. Kulesza, M. Kellis, P . Golland, and L. Bornn. Div ersifying sparsity using variational determinantal point processes. , 2014. [4] R. Bhatia. P ositive Definite Matrices . Princeton Univ ersity Press, 2007. [5] A. Borodin. Determinantal point processes. , 2009. [6] W . Chao, B. Gong, K. Grauman, and F . Sha. Large-mar gin determinantal point processes. In Uncertainty in Artificial Intelligence (U AI) , 2015. [7] L. Decreusefond, I. Flint, N. Priv ault, and G. L. T orrisi. Determinantal point processes, 2015. [8] S. Flaxman, A. W ilson, D. Neill, H. Nickisch, and A. Smola. Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods. In ICML , pages 607–616, 2015. [9] M. Gartrell, U. Paquet, and N. K oenigstein. Low-rank f actorization of determinantal point processes for recommendation. , 2016. [10] J. Gillenwater , A. Kulesza, E. Fox, and B. T askar . Expectation-Maximization for learning Determinantal Point Processes. In NIPS , 2014. [11] O. Goldschmidt, D. Nehme, and G. Y u. Note: On the set-union knapsack problem. Naval Resear ch Logistics , 41:833–842, 1994. [12] J. B. Hough, M. Krishnapur, Y . Peres, and B. V irág. Determinantal processes and independence. Pr obability Surve ys , 3(206–229):9, 2006. [13] B. Kang. Fast determinantal point process sampling with application to clustering. In Advances in Neural Information Pr ocessing Systems 26 , pages 2319–2327. Curran Associates, Inc., 2013. [14] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: theory , efficient algorithms and empirical studies. JMLR , 9:235–284, 2008. [15] A. Kulesza. Learning with Determinantal Point Pr ocesses . PhD thesis, Univ ersity of Pennsylvania, 2013. [16] A. Kulesza and B. T askar . k-DPPs: Fixed-size Determinantal Point Processes. In ICML , 2011. [17] A. Kulesza and B. T askar . Determinantal Point Pr ocesses for mac hine learning , volume 5. Foundations and T rends in Machine Learning, 2012. [18] F . Lav ancier, J. Møller , and E. Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 77(4):853–877, 2015. [19] C. Li, S. Jegelka, and S. Sra. Efficient sampling for k-determinantal point processes. , 2015. [20] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nyström with application to kernel methods. arXiv:1603.06052 , 2016. [21] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization. In Uncertainty in Artificial Intelligence (U AI) , 2012. [22] C. V . Loan and N. Pitsianis. Approximation with kronecker products. In Linear Algebra for Lar ge Scale and Real T ime Applications , pages 293–314. Kluwer Publications, 1993. 9 [23] R. L yons. Determinantal probability measures. Publications Mathématiques de l’Institut des Hautes Études Scientifiques , 98(1):167–212, 2003. [24] O. Macchi. The coincidence approach to stochastic point processes. Adv . Appl. Pr ob . , 7(1), 1975. [25] Z. Mariet and S. Sra. Fixed-point algorithms for learning determinantal point processes. In ICML , 2015. [26] Z. Mariet and S. Sra. Diversity networks. Int. Conf. on Learning Repr esentations (ICLR) , 2016. URL arXiv:1511.05077 . [27] J. Martens and R. B. Grosse. Optimizing neural networks with Kronecker -factored approximate curv ature. In ICML , 2015. [28] G. W u, Z. Zhang, and E. Y . Chang. Kronecker factorization for speeding up kernel machines. In SIAM Data Mining (SDM) , pages 611–615, 2005. [29] A. L. Y uille and A. Rangarajan. The conca ve-con ve x procedure (cccp). In Advances in Neural Information Pr ocessing Systems 14 , pages 1033–1040. MIT Press, 2002. [30] X. Zhang, F . X. Y u, R. Guo, S. Kumar, S. W ang, and S.-F . Chang. Fast orthogonal projection based on kronecker product. In ICCV , 2015. [31] T . Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. W akeling, and Y .-C. Zhang. Solving the apparent div ersity-accuracy dilemma of recommender systems. PN AS , 107(10):4511–4515, 2010. 10 A ppendix: Kroneck er Determinantal Point Pr ocesses A Pr oof of Prop. 3.1 W e use ‘ v ec ’ to denote the operator that stacks columns of a matrix to form a vector; conv ersely , ‘ mat ’ takes a vector with k 2 coefficients and returns a k × k matrix. Let L = L 1 ⊗ L 2 , S 1 = L − 1 1 , S 2 = L − 1 2 and S = S 1 ⊗ S 2 = L − 1 . W e note E ij the matrix with all zeros except for a 1 at position ( i, j ) , its size being clear from conte xt. W e wish to solve ∇ f 2 ( X ) = −∇ g 2 ( S 1 ) and ∇ f 1 ( X ) = −∇ g 1 ( S 2 ) (10) It follows from the f act that log det( S 1 ⊗ S 2 ) = N 2 log det S 1 + N 1 log det S 2 that ∇ f S 2 ( X ) = N 2 X − 1 and ∇ f S 1 ( X ) = N 1 X − 1 . Moreov er , we kno w that ∇ g ( S ) = − ( I + S ) − 1 − S − 1 1 n X i U i ( U > i S − 1 U i ) − 1 U i S − 1 = − S − 1 − S − 1  1 n X i U i ( U > i S − 1 U i ) − 1 U i − ( I + S − 1 ) − 1  S − 1 = − ( L + L ∆ L ) . The Jacobian of S 1 → S 1 ⊗ S 2 is giv en by J =  v ec( E 11 ⊗ S 2 ) , . . . , v ec( E N 1 N 1 ⊗ S 2 )  . Hence, ∇ f 1 ( X ) ij = − ( ∇ g 1 ( S 1 )) ij ⇐ ⇒ N 2 X − 1 ij = ( J > v ec( −∇ g ( S ))) ij ⇐ ⇒ N 2 X − 1 ij = v ec( E ij ⊗ S 2 ) > v ec( L + L ∆ L ) ⇐ ⇒ N 2 X − 1 ij = T r(( E ij ⊗ S 2 )( L + L ∆ L )) ⇐ ⇒ N 2 X − 1 ij = T r( S 2 ( L + L ∆ L ) ( ij ) ) ⇐ ⇒ N 2 X − 1 ij = T r  (( I ⊗ S 2 )( L + L ∆ L )) ( ij )  The last equiv alence is simply the result of indices manipulation. Thus, we have ∇ f 2 ( X ) = −∇ g 2 ( S 1 ) ⇐ ⇒ X − 1 = 1 N 2 T r 1 (( I ⊗ S 2 )( L + L ∆ L )) Similarly , by setting J 0 = (v ec( S 1 ⊗ E 11 ) , . . . , v ec( S 1 ⊗ E N 1 N 1 )) , we hav e that ∇ f 2 ( X ) ij = − ( ∇ g 2 ( S 2 )) ij ⇐ ⇒ N 1 X − 1 ij = ( J 0> v ec( −∇ g ( S ))) ij ⇐ ⇒ N 1 X − 1 ij = v ec( S 1 ⊗ E ij ) > v ec( L + L ∆ L ) ⇐ ⇒ N 1 X − 1 ij = T r(( S 1 ⊗ E ij )( L + L ∆ L )) ⇐ ⇒ N 1 X − 1 ij =  X N 1 k,` =1 S 1 k` ( L + L ∆ L ) ( `k )  ij ⇐ ⇒ N 1 X − 1 ij =  X N 1 ` =1 (( S 1 ⊗ I )( L + L ∆ L )) ( `` )  ij Hence, ∇ f S 1 ( X ) = −∇ g S 1 ( S 2 ) ⇐ ⇒ X − 1 = 1 N 1 T r 2 (( S 1 ⊗ I )( L + L ∆ L )) , which prov es Prop. 3.1 . 11 B Efficient updates f or K R K - P I C A R D The updates to L 1 and L 2 are obtained efficiently through different methods; hence, the proof to Thm. 3.3 is split into two sections. W e write Θ = 1 n n X i =1 U i L − 1 Y i U > i (or Θ = U i L − 1 Y i U > i for stochastic updates) so that ∆ = Θ − ( I + L ) − 1 . Recall that ( A ⊗ B ) ( ij ) = a ij B . B.1 Updating L 1 W e wish to compute X = T r 1  ( I ⊗ L − 1 2 )( L ∆ L )  efficiently . W e hav e X ij = T r  (( I ⊗ L − 1 2 )( L ∆ L )) ( ij )  = T r  L − 1 2 ( L ∆ L ) ( ij )  = T r  L − 1 2 X N 1 k,` =1 L ( ik ) ∆ ( k` ) L ( `j )  = X N 1 k,` =1 L 1 ik L 1 `j T r( L − 1 2 L 2 ∆ ( k` ) L 2 ) = X N 1 k,` =1 L 1 ik L 1 `j T r(Θ ( k` ) L 2 ) | {z } A k` − T r(( I + L ) − 1 ( k` ) L 2 ) | {z } B k` = ( L 1 AL 1 − L 1 B L 1 ) ij . The N 1 × N 1 matrix A can be computed in O ( nκ 3 + N 2 1 N 2 2 ) simply by pre-computing Θ in O ( nκ 3 ) and then computing all N 2 1 traces in O ( N 2 2 ) time. When doing stochastic updates for which Θ is sparse with only κ 2 non-zero coefficients, computing A can be done in O ( N 2 1 κ 2 + κ 3 ) . By diagonalizing L 1 = P 1 D 1 P > 1 and L 2 = P 2 D 2 P > 2 , we hav e ( I + L ) − 1 = P DP > with P = P 1 ⊗ P 2 and D = ( I + D 1 ⊗ D 2 ) − 1 . P 1 , P 2 , D 1 , D 2 and D can all be obtained in O ( N 3 1 + N 3 2 + N 1 N 2 ) as a consequence of Prop. 2.1 . Then B ij = T r(( I + L ) − 1 ( ij ) L 2 ) = X k T r( P ( ik ) D ( kk ) P > ( kj ) L 2 ) = X k P 1 ik P 1 j k T r( P 2 D ( kk ) P > 2 P 2 D 2 P > 2 ) = X k P 1 ik P 1 j k T r( D ( kk ) D 2 ) | {z } α k . Let b D = diag ( α 1 , . . . , α N 1 ) , which can be computed in O ( N 1 N 2 ) . Then L 1 B L 1 = P 1 D 1 b D D 1 P 1 is computable in O ( N 3 1 + N 3 2 ) . Overall, the update to L 1 can be computed in O ( nκ 3 + N 2 1 N 2 2 + N 3 1 + N 3 2 ) , or in O ( N 2 1 κ 2 + κ 3 + N 3 1 + N 3 2 ) if the updates are stochastic. Moreov er , if Θ is sparse with only z non-zero coefficients (for stochastic updates z = κ ), A can be computed in O ( κ 2 ) space, leading to an overall O ( z 2 + N 2 1 + N 2 2 ) memory cost. 12 B.2 Updating L 2 W e wish to compute X = T r 2  ( L − 1 1 ⊗ I )( L ∆ L )  efficiently . X = X N 1 i =1  ( L − 1 1 ⊗ I )( L ∆ L )  ( ii ) = X N 1 i =1  ( I ⊗ L 2 )(Θ − ( I + L ) − 1 )( L 1 ⊗ L 2 )  ( ii ) = X N 1 i,j =1 L 1 ij L 2 Θ ( ij ) L 2 − N 1 X i =1 (( I ⊗ L 2 )( I + L ) − 1 ( L 1 ⊗ L 2 )) ( ii ) = L 2 X N 1 i,j =1 L 1 ij Θ ( ij ) L 2 | {z } A − X N 1 i =1 (( I ⊗ L 2 )( I + L ) − 1 ( L 1 ⊗ L 2 )) ( ii ) | {z } B A can be computed in O ( nκ 3 + N 2 1 N 2 2 + N 3 2 ) . As before, when doing stochastic updates A can be computed in O ( N 2 1 κ 2 + κ 3 + N 3 2 ) time and O ( N 2 2 + N 2 1 + κ 2 ) space due to the sparsity of Θ . Regarding B , as all matrices commute, we can write ( I ⊗ L 2 )( I + L ) − 1 ( L 1 ⊗ L 2 ) = ( P 1 ⊗ P 2 )Λ( P 1 ⊗ P 2 ) where Λ = ( I ⊗ D 2 )( I + D 1 ⊗ D 2 ) − 1 ( D 1 ⊗ D 2 ) is diagonal and is obtained in O ( N 3 1 + N 3 2 + N 1 N 2 ) . Moreov er , B = X N 1 i =1 ( P Λ P > ) ( ii ) = P 2  X N 1 i,k =1 P 1 ik Λ ( kk ) P 1 ik  P > 2 , which allows us to compute B in O ( N 2 1 N 2 + N 3 2 + N 3 1 ) total. Overall, we can obtain X in O ( nκ 3 + N 2 1 N 2 2 + N 3 1 + N 3 2 ) or in O ( N 2 1 κ 2 + N 2 1 N 2 + N 3 1 + N 3 2 ) for stochastic updates, in which case only O ( N 2 1 + N 2 2 + κ 2 ) space is necessary . C Pr oof of validity f or joint updates In order to minimize the number of matrix multiplications, we equiv alently (due to the properties of the Frobenius norm) minimize the equation k L − 1 + ∆ − X ⊗ Y k 2 F (11) and set ( L 0 1 ← L 1 X L 1 L 0 2 ← L 2 Y L 2 . . Theorem C.1. Let L  0 . Define R := [vec( L (11) ) > ; . . . ; v ec( L ( N 1 N 1 ) ) > ] N 1 i,j =1 ∈ R N 1 N 1 × N 2 N 2 . Suppose that R has an eigengap between its largest singular value and the next, and let u, v , σ be the first singular vector s and value of R . Let U = mat( u ) and V = mat( v ) . Then U and V ar e either both positive definite or ne gative definite. Mor eover , for any value α 6 = 0 , the pair ( αU, σ /αV ) minimizes k L − X ⊗ Y k 2 F . The proof is a consequence of [ 22 , Thm. 11]. This shows that if L is initially positi ve definite, setting the sign of α based on whether U and V are positiv e or negati v e definite 2 , and updating ( L 1 ← α L 1 U L 1 L 2 ← σ /α L 2 V L 2 2 This can easily be done simply by checking the sign of the first diagonal coefficient of U , which will be positi ve if and only if U  0 . 13 maintains positive definite iterates. Giv en that if L 1  0 and L 2  0 , L 1 ⊗ L 2  0 , a simple induction then shows that by choosing an initial k ernel estimate L  0 , subsequent values of L will remain positiv e definite. By choosing α such that the new estimates L 1 and L 2 verify k L 1 k = k L 2 k , we verify all the conditions of Eq. 8 . C.1 Algorithm f or joint updates Theorem C.1 leads to a straightforward iteration for learning matrices L 1 and L 2 based on the decomposition of the Picard estimate as a Kronecker product. Algorithm 3 J O I N T - P I C A R D iteration Input: Matrices L 1 , L 2 , training set T , step-size a ≥ 1 . for i = 1 to maxIter do U, σ, V ← power_method ( L − 1 + ∆) to obtain the first singular v alue and vectors of matrix R . α ← sgn ( U 11 ) p σ k L 2 V L 2 k / k L 1 U L 1 k L 1 ← L 1 + a ( α L 1 U L 1 − L 1 ) L 2 ← L 2 + a ( σ /α L 2 V L 2 ) end forr eturn ( L 1 , L 2 ) 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment