Vertical Consensus Inference for High-Dimensional Random Partition
We review recently proposed Bayesian approaches for clustering high-dimensional data. After identifying the main limitations of available approaches, we introduce an alternative framework based on vertical consensus inference (VCI) to mitigate the cu…
Authors: Khai Nguyen, Yang Ni, Peter Mueller
V er tical Consensus Inference f or High-Dimensional Random P ar tition Khai Nguyen, Y ang Ni and P eter Mueller Abstract. W e re view recently proposed Bayesian approaches for clustering high-dimensional data. After identifying the main limitations of av ailable ap- proaches, we introduce an alternati ve frame work based on vertical consensus infer ence (VCI) to mitigate the curse of dimensionality in high-dimensional Bayesian clustering. VCI b uilds on the idea of consensus Monte Carlo by di- viding the data into multiple shards (smaller subsets of variables), performing posterior inference on each shard, and then combining the shard-le vel pos- teriors to obtain a consensus posterior . The ke y distinction is that VCI splits the data vertically , producing vertical shar ds that retain the same number of observ ations but ha ve lo wer dimensionality . W e use an entropic re gularized W asserstein barycenter to define a consensus posterior . The shard-specific barycenter weights are constructed to fav or shards that provide meaningful partitions, distinct from a trivial single cluster or all singleton clusters, fav or - ing balanced cluster sizes and precise shard-specific posterior random parti- tions. W e show that VCI can be interpreted as a variational approximation to the posterior under a hierarchical model with a generalized Bayes prior . F or relati vely low-dimensional problems, e xperiments suggest that VCI closely approximates inference based on clustering the entire multi v ariate data. For high-dimensional data and in the presence of man y noninformativ e dimen- sions, VCI introduces a new framew ork for model-based and principled infer - ence on random partitions. Although our focus here is on random partitions, VCI can be applied to an y dimension-independent parameters and serves as a bridge to emer ging areas in statistics such as consensus Monte Carlo, optimal transport, v ariational inference, and generalized Bayes. K e y wor ds and phr ases: Random Partition, High-dimensional Bayesian clus- tering, Consensus Monte Carlo, Optimal T ransport, V ariational Inference. 1. INTRODUCTION 1.1 High-dimensional clustering Model-based inference for random partitions is often based on setting up discrete mixture models. Introducing latent indicators that link data points with the terms in the mixture model defines a random partition by defining clusters as all items with matching latent indicator . Set- ting up a prior model for a random partition then reduces Khai Nguyen is Ph.D. Candidate, Department of Statistics and Data Sciences, University of T exas at Austin, T exas, USA (e-mail: khainb@utexas.edu ). Y ang Ni is Associate Pr ofessor , Department of Statistics and Data Sciences, University of T exas at Austin, T exas, USA (e-mail: yang.ni@austin.ute xas.edu ). P eter Mueller is Pr ofessor , Department of Mathematics, University of T exas at Austin, T exas, USA (e-mail: pmueller@math.utexas.edu ). to introducing a prior probability model for the mixing measure of the random partition. In f act, it can be sho wn that subject to some reasonable symmetry constraints (ex- changeability) all random partitions can be represented this way [ 27 ]. Priors placed on these random probabil- ity measures are kno wn as Bayesian nonparametric mod- els, including Dirichlet process mixtures [ 29 , 33 ], Pit- man–Y or process mixtures [ 45 ], and other related con- structions that induce distributions on partitions and allo w the complexity of the clustering structure to grow with the data [ 30 , 31 ]. W e refer the reader to De Blasi et al. (2013) [ 19 ] for a more general discussion of Bayesian nonparametric priors on random partitions. For lo w to moderate dimensional data, such inference is well estab- lished and widely used. In high-dimensional settings where the number of di- mensions p far exceeds the number of observations n , Bayesian clustering methods based on mixture models 1 2 can fail in a fundamental way: as the dimension grows, the posterior distribution over partitions tends to degener - ate to tri vial solutions, assigning either all observ ations to a single cluster or each observation to its o wn singleton cluster , regardless of the true underlying structure [ 15 ]. This phenomenon arises from the interplay between high dimensionality , the choice of likelihood (e.g., multiv ariate Gaussian kernels), and prior specifications, which can ei- ther ov erly penalize or insufficiently penalize model com- plexity [ 15 ]. Beyond model-based clustering based on mixture mod- els, high-dimensional clustering is well-kno wn as a chal- lenging problem. One approach is subspace clustering [ 2 , 43 ], which seeks clusters within dif ferent combinations of dimensions (i.e., subspaces) and, unlike many other methods, does not assume that all clusters exist in the same set of dimensions. Howe v er , these methods are typ- ically designed for point estimation and algorithmic re- cov ery of cluster structure, rather than full probabilistic inference on partitions. Another approach is pr ojection- based clustering [ 57 ], which first reduces the data to a lower -dimensional space before performing clustering. While often effecti ve computationally , projection can ob- scure uncertainty and may discard cluster-rele vant infor - mation when the low-dimensional representation is not well aligned with the latent partition structure. Bagging- based methods [ 21 ] generate clusters by repeatedly sam- pling random subsets of the data, clustering each subset, and then aggregating the results to produce a dissimilar- ity measure for the full dataset. Although such methods can improve rob ustness, they generally provide heuristic aggregation rather than a coherent posterior distribution ov er partitions. In the context of high-dimensional, small-sample data, Rahman, Johnson and Rao [ 48 , 49 ] lev erages a trans- formed Gram matrix to concentrate featur e r epr esenta- tions in a low-dimensional space , enabling accurate re- cov ery of both cluster assignments and the unknown num- ber of clusters. In [ 49 ] they de velop a computationally ef ficient method based on a transformed left Gram ma- trix that preserves cluster structure without requiring di- mension reduction or tuning parameters. Earlier work by Rahman and Johnson (2018) [ 47 ] introduces a model- based, hyperparameter -free clustering algorithm that e x- ploits the Gram matrix representation to shift computa- tional dependence from the feature dimension to the sam- ple size, achieving improved accuracy and efficiency in high-dimensional settings. Chandra, Canale and Dunson (2023) [ 15 ] recently proposed a Bayesian latent factor mixtur e model , which assumes that high-dimensional ob- serv ations arise from a lower -dimensional set of latent v ariables following a flexible mixture distribution, allo w- ing parsimonious and accurate modeling of clusters in high-dimensional data. This provides an elegant model- based solution to high-dimensional clustering, although it is tailored to a specific latent variable formulation rather than a general posterior combination framew ork o ver sub- sets of dimensions. 1.2 V er tical consensus inference In this note, we of fer a new perspectiv e of the high- dimensional clustering problem that we argue to address some of the mentioned limitations. Our approach, named vertical consensus inference (VCI), begins by dividing the data matrix ( n × p ) into K > 1 matrices ( n × p k ) with P K k =1 p k = p , called vertical shards. W e then per - form parallel inference on the random partition within each shard. The final inference on the random partition is obtained by combining the shard posteriors using an entropic-regularized W asserstein barycenter [ 3 , 16 ], with an appropriate ground metric defined on the space of partitions. The approach is justified by recognizing it as approximation of the posterior of a hierarchical model with a generalized Bayes [ 11 ] construction via v ariational inference [ 12 ]. The approach enjoys the computational adv antages of consensus inference or consensus Monte Carlo (CMC) [ 53 ] while mitigating the curse of dimen- sionality in high-dimensional clustering, and it remains interpretable through a principled modeling and inference frame work. Experiments on real datasets demonstrate that in low dimensional problems VCI closely approximates inference based on clustering the entire multiv ariate data, while it performs f av orably in settings with many nonin- formati ve dimensions and in high-dimensional regimes, as they arise, for e xample, in inference for single-cell data. While we focus on random partitions, VCI is ap- plicable for any parameters of interest that are dimension- independent. VCI bridges four emer ging areas: consensus Monte Carlo, optimal transport, v ariational inference, and generalized Bayes. VCI le verages the di vide-and-conquer principle of CMC [ 53 ]. The ke y distinction is that VCI partitions data into vertical shards (splitting outcome dimensions), rather than the horizontal shards (splitting the sample size) used in con ventional CMC. T raditional CMC is primarily mo- ti vated by the poor scalability of Bayesian inference al- gorithms, especially sampling-based methods, with re- spect to the number of observations n . In contrast, VCI is designed to address the curse of dimensionality in ran- dom partition models and, more broadly , in models with dimension-independent parameters. The concept of vertical partitioning has been e xplored in federated learning, particularly in v ertical federated learning (VFL) [ 32 ], positioning VCI as a VFL-like frame work for Bayesian random partition models. Re- lated work in distrib uted clustering [ 56 ] in vestigates clus- tering from partial sets of dimensions, but only at the point-estimate le vel, whereas VCI operates directly at the posterior lev el. Furthermore, VCI can be interpreted as an VER TICAL CONSENSUS INFERENCE FOR HIGH-DIMENSIONAL RANDOM P AR TITION 3 instance of subspace clustering [ 43 ] within the Bayesian random partition context. Finally , VCI can also be inter- preted as an approximation of model-based inference in high-dimensional Bayesian random partition models by way of the already mentioned generalized Bayes con- struction. Like CMC, VCI naturally lends itself to distrib uted and parallel computing: each shard can be assigned to a sepa- rate worker machine [ 25 ], which independently performs posterior inference using methods such as Marko v chain Monte Carlo (MCMC) [ 24 ], sequential Monte Carlo (SMC) [ 20 ], variational inference (VI) [ 12 ], expectation propagation (EP) [ 35 ] or approximate Bayesian computa- tion [ 8 ]. These methods can be used either to draw sam- ples from the posterior or to construct a tractable posterior approximation. For the choice of the consensus mechanism in gen- eral CMC, there are many av ailable methods to com- bine posterior beliefs across all shards. Early work on consensus Monte Carlo distributes data across machines and combines local posterior draws via weighted aver - aging [ 25 , 53 ]. Other approaches approximate the full posterior as a product of local posterior densities us- ing parametric, semiparametric, or nonparametric estima- tors [ 37 , 62 ]. Alternativ e strategies include the W eier - strass sampler [ 61 ], sequential Monte Carlo and mix- tures [ 52 ], iterati ve consensus with global CMC [ 51 ], geo- metric median aggregation in reproducing k ernel Hilbert spaces [ 36 ], W asserstein barycenter [ 54 , 55 ], variational aggregation of local posteriors [ 46 ], and likelihood in- flating sampling (LISA), which inflates the likelihood to make each shard posterior resemble the full-data poste- rior [ 22 ]. CMC for random partition models was intro- duced in [ 41 ]; howe ver , still relying on horizontal split- ting, rather than the vertical splitting emplo yed in VCI. 2. VERTICAL CONSENSUS INFERENCE FOR RANDOM P ARTITION MODELS 2.1 W asserstein Consensus Monte Carlo VCI uses a version of W asserstein barycenter to define a consensus posterior . The use of a W asserstein barycen- ter for CMC was first introduced in [ 54 , 55 ]. W e intro- duce some notation by way of a brief revie w of W asser - stein distance. Gi ven a metric space X equipped with a ground metric c : X × X → R + ( R + is the positiv e real line), W asserstein distance [ 44 , 59 ] between two proba- bility measures p 1 ∈ P c ( X ) and p 2 ∈ P c ( X ) is defined as follo ws: W c ( p 1 , p 2 ) = min π ∈ Π( p 1 ,p 2 ) Z X ×X c ( x 1 , x 2 )d π ( x 1 , x 2 ) , (1) where P c ( X ) denotes the set of all probability measures on X such that for any p ∈ P c ( X ) , there exists x 0 ∈ X with R X c ( x, x 0 )d p ( x ) < ∞ , and Π( p 1 , p 2 ) denotes the set of couplings, i.e., joint probability measures whose marginals are p 1 and p 2 , respectiv ely . If c is a v alid metric between two partitions, then the W asserstein distance is also a v alid metric on P c ( X ) [ 59 ]. W ith W asserstein distance, we can generalize the no- tion of “av eraging" to P c ( X ) . The generalized av erage is called barycenter , or Fréchet mean [ 7 , 58 ]. Gi ven K > 1 probability measures p 1 , p 2 , . . . , p K ∈ P c ( X ) and a ground metric c on X , the W asserstein barycenter [ 3 ] of p 1 , . . . , p k is the defined as follo ws: ¯ p = arg min p ∈P ( X ) K X k =1 λ k W c ( p, p k ) , (2) where λ = ( λ 1 , . . . , λ K ) ∈ ∆ K is a simplex vector of barycenter weights which control the contributions of mar ginals p 1 , . . . , p k to the barycenter . Here, we use ∆ K to denote the K -simple x. Using the uniform weights, Sriv astav a et al (2015, 2018) [ 54 , 55 ] introduced a CMC scheme using a W asser- stein consensus posterior . Consider a dataset of n > 0 samples X = ( X 1 , . . . , X n ) , where X i ∈ X p for i = 1 , . . . , n, with X denoting the data space (e.g., R ) of p > 0 dimensions. In statistical inference, our primary goal is to infer latent v ariables θ that capture the under - lying structure of the data X . CMC implements infer- ence for global parameters θ that are used across shards. CMC, as well as the upcoming discussion, assume that any parameters specific to particular shards are integrated out. Under a model-based Bayesian framew ork, infer - ence on θ proceeds via the posterior distribution p ( θ | X ) = p ( θ ,X ) p ( X ) , derived from the joint distribution p ( θ , X ) . W e divide X into K > 1 shards X (1) , . . . , X ( K ) with X ( k ) = ( X ( k ) 1 , . . . , X ( k ) n k ) ( P K k =1 n k = n ). In [ 54 , 55 ], the shards are defined by splitting the data into subsets of observ ations (in contrast to splitting along the dimen- sions of the outcome, which we will introduce later). Let p k ( θ | X ( k ) ) = p k ( θ ,X ( k ) ) p k ( X ( k ) ) is the k -th shard posterior and c = ∥ · ∥ 2 2 , the W asserstein consensus posterior is the barycenter of p 1 ( θ | X (1) ) , . . . , p K ( θ | X ( K ) ) with the uniform barycenter weights in ( 2 ). 2.2 A V er tical Consensus Inference Approach f or High-Dimensional Clustering In contrast to the horizontal (along samples) split of CMC, the proposed VCI scheme splits the featur e dimen- sion into K > 1 vertical shards: X i = ( X (1) i , . . . , X ( K ) i ) for i = 1 , . . . , n , X ( k ) i ∈ R p k . For simplicity , we assume that P K k =1 p k = p (non-overlapping shards), although the case P K k =1 p k > p is also v alid. The shard splitting pro- cess can be performed manually or arise from distributed data storage or some prior knowledge about data e.g., 4 groups of genes. In the upcoming discussion, we focus on inference for a random partition θ = z , keeping in mind that the approach remains valid for any other global pa- rameter θ . Here z = ( z 1 , . . . , z n ) denotes a partition rep- resented by cluster membership indicators z i = h if the i -th data point in cluster h . For the k -th shard X ( k ) = ( X ( k ) 1 , . . . , X ( k ) n ) , let p k = p k ( z | X ( k ) ) denote the shard- specific posterior (on the partition) for shard k . W e then define a consensus posterior on z as W asserstein barycen- ter of shard posteriors p 1 ( z | X (1) ) , . . . , p k ( z | X ( K ) ) . W asserstein barycenter . Let Z denote the space of par- titions of n samples represented by cluster membership indicators, i.e., z ∈ Z . As Z is a finite space, giv en a ground metric c : Z × Z → R + , W asserstein distance ( 1 ) between two discr ete probability measures p 1 ∈ P ( Z ) and p 2 ∈ P ( Z ) can be re written as follows: W c ( p 1 , p 2 ) = min π ∈ Π( p 1 ,p 2 ) |Z | X i =1 |Z | X j =1 c ( z i , z j ) π ( z i , z j ) , For computational reasons (to turn the minimization into a strongly con vex optimization), in practice one often uses entropic approximation of W asserstein distance [ 16 ]: W c,ϵ ( p 1 , p 2 ) = min π ∈ Π( p 1 ,p 2 ) |Z | X i =1 |Z | X j =1 c ( z i , z j ) π ( z i , z j ) (3) − ϵH π ( z 1 , z 2 ) , where H π ( z 1 , z 2 ) = − P |Z | i =1 P |Z | j =1 π ( z 1 , z 2 ) log π ( z 1 , z 2 ) is the entropy , and ϵ > 0 is the regularization strength co- ef ficient. When p 1 and p 2 hav e at most m atoms, the time complexity for W c,ϵ ( p 1 , p 2 ) is O ( m 2 ) [ 4 ] compared to O ( m 3 log m ) of W c ( p 1 , p 2 ) [ 44 ]. W e then define a consensus posterior on z from shard posteriors p 1 ( z | X (1) ) , . . . , p k ( z | X ( K ) ) using barycen- ter ( 2 ) with W c,ϵ k : ¯ p ( z | X , λ ) = arg min p ∈P ( Z ) K X k =1 λ k W c,ϵ k ( p, p k ) , (4) where ( λ 1 , . . . , λ K ) ∈ ∆ K is a simplex v ector of barycen- ter weights. The probability measures ¯ p ( z | X , λ ) is well- defined and it reflects the belief on z from X . Gr ound metric. For the choice of ground metric c , there are man y options, e.g., Binder loss [ 10 ], v ariation of in- formation (V oI) [ 34 ], normalized variation of informa- tion, information distance, normalized information dis- tance [ 40 ], one minus adjusted Rand index [ 23 , 50 ], gen- eralized Binder, and generalized V oI [ 18 ]. In this work, we focus on using V oI distance, which is a valid met- ric on Z . Assume then two partitions (in the upcom- ing construction, posterior Monte Carlo samples) of n elements, represented as cluster membership indicators z ℓ = ( z ℓ, 1 , . . . , z ℓ,n ) , ℓ = 1 , 2 . W e define cluster propor- tions as p ( ℓ ) j = 1 n P n i =1 1 ( z ℓ,i = j ) for ℓ = 1 , 2 and j = 1 , . . . , H ( z ℓ ) ( H ( z ℓ ) is the number of clusters of z ℓ ), and joint proportions p (12) j h = 1 n n X i =1 1 ( z 1 i = j, z 2 i = h ) , for j = 1 , . . . , H ( z 1 ) and h = 1 , . . . , H ( z 2 ) . The entropies are H ( z ℓ ) = − P H ( z l ) j =1 p ( ℓ ) j log p ( ℓ ) j for ℓ = 1 , 2 , and their mutual information is I ( z 1 , z 2 ) = H ( z 1 ) X j =1 H ( z 2 ) X h =1 p (12) j h log p (12) j h p (1) j p (2) h . The V oI distance between the tw o partitions z 1 and z 2 is defined as V oI( z 1 , z 2 ) = H ( z 1 ) + H ( z 2 ) − 2 I ( z 1 , z 2 ) . W e use the V oI distance as the ground metric in ( 3 ). Barycenter weights. In ( 4 ), the barycenter weight λ k control the contribution of the k -th shard posterior to the consensus posterior . W e can make λ k to be a function of p k to prioritize shard with meaningful partitions. For ex- ample, one could use expected entrop y to penalize a sin- gle cluster and to prioritize shard posteriors with more balanced clusters: ω H k ( p k ) = E [ H ( z ) | X ( k ) ] , k = 1 , . . . , K . λ H = P ∆ K ( ω H ) , where ω H = ( ω H 1 ( p k ) , . . . , ω H K ( p k )) and P ∆ K : R K + → ∆ K is the projection function to the simplex, e.g., we can set P ∆ K ( ω 1 , . . . , ω K ) = ω t 1 P k k =1 ω t k , . . . , ω t K P k k =1 ω t k for any t ≥ 1 , or softmax function. Instead of this simple e xample, we recommend more structured choices that better reflect in vestigator prefer- ences for non-trivial partitions (different from n single- tons, or a single cluster), balanced (with comparable clus- ter sizes), and precisely estimated partitions (high entrop y of the random partition). In our implementation, we use the follo wing construction: ω P k ( p k ) = E " 4( ˜ H − 1)( n − ˜ H ) ( n − 1) 2 | X ( k ) # | {z } (I): cluster complexity E h exp − a E ( z ) | X ( k ) i | {z } (II): entropy control (1 − 4 U ) | {z } (III): uncertainty penalty , λ P = P ∆ K ( ω P ) , VER TICAL CONSENSUS INFERENCE FOR HIGH-DIMENSIONAL RANDOM P AR TITION 5 where a ∈ R , and ˜ H = exp( − H ( z )) , E ( z ) = − H ( z ) log( H ( z )) , p ij = P ( z i = z j | X ( k ) ) , U = 2 n ( n − 1) X i 0 , it pe- nalizes entropy . Overall, λ P strongly penalizes both the case of a single cluster and that of all singletons, fav or- ing shards with low posterior uncertainty , and controls the desired entropy through the parameter a . Other construc- tions can also be used, depending on the desired proper- ties of shard posteriors. Again, we recall that VCI can extend beyond parti- tions for any dimension-independent variables equipped with a well-defined metric on their space of realizations. The definition of barycenter weights would hav e to be ad- justed accordingly . But in this work, we focus on random partitions, as they constitute a widely studied inference target. 2.3 Interpretation via Generalized Bayes Hierarchical Model and V ariational Inference While the consensus posterior in ( 4 ) is well-defined, it was not constructed as principled model-based inference. W e no w show that ¯ p can in fact be interpreted as varia- tional approximation of the posterior under the follo wing hierarchical model. p ( X ( k ) | z ( k ) ) = p k ( X ( k ) | z ( k ) ) , (5) p ( z ( k ) | z ) ∝ exp h − ζ k c ( z ( k ) , z ) i , p ( z ) ∝ K Y k =1 C k ( z ) , where C k ( z ) = P z ( k ) ∈Z exp − ζ k c ( z ( k ) , z ) ( C k ( z ) < ∞ as Z is a finite space), ζ k > 0 ∀ k = 1 , . . . , K , and c is a ground metric between partition. The definition of p ( z ( k ) | z ) in the second line is analogous to centered partition processes [ 42 ] and can be seen as a generalized Bayes [ 11 ] prior . It constructs a conditional distribution by way of a loss function (metric c ) rather than starting with an assumed sampling or prior model. Next, we consider v ariational inference [ 12 ] to approx- imate the posterior on z under model ( 5 ). W e consider a v ariational family Q ⊂ P ( Z K +1 ) and solv e for q ⋆ = arg min q ∈Q KL ( q , p ) , (6) where KL ( q , p ) is the Kullback–Leibler div ergence. It is well-kno wn that ( 6 ) admits a dual problem as minimizing a negati ve evidence lo werbound (NELBO): min q ∈Q L ( q ) := E ( z , z (1):( K ) ) ∼ q [ − log p ( z , z (1):( K ) , X )] (7) − H q ( z , z (1):( K ) ) . In ( 7 ), we minimize the NELBO over a joint distribu- tion on ( z , z (1) , . . . , z ( K ) ) . W e can cast the abov e prob- lem as a nested optimization problem, i.e., maximiz- ing ov er marginals of the joint distribution and max- imizing ov er the couplings of all marginals [ 63 ]. Let q 0 ( z ) , q 1 ( z (1) ) , . . . , q K ( z ( K ) ) be marginals of q , we can re write ( 7 ) into: min q 0 ∈Q 0 ,q 1 ∈Q 1 ,...,q K ∈Q K min q ∈ Π( q 0 ,q 1 ,...,q K ) L ( q ) , where Q 0 , Q 1 , . . . , Q K ⊂ P ( Z ) are corresponding varia- tional families for the marginals, and Π( q 0 , q 1 , . . . , q K ) is the set of joint probability measures of ( z , z (1) , . . . , z ( K ) ) with marginals q 0 , q 1 , . . . , q K respecti vely . At this mo- ment we hav e written the v ariational problem to include optimization w .r .t. the coupling q . In the follo wing result we go one step further and relate the v ariational solution to a W asserstein barycenter of q 1 , . . . , q K (to be further related to the VCI posterior ( 4 ) belo w). P R O P O S I T I O N 2.1. F or q 1 , . . . , q K be marginals of the variational posterior q ( z (1) , . . . , z ( k ) ) , the entr opic W asserstein barycenter ( 4 ) corr esponds to an upper bound for the minimal NELBO over all possible couplings of q 1 , . . . , q K . Specifically: min q ∈ Π( q 0 ,q 1 ,...,q K ) L ( q ) ≤ K X k =1 ζ k W c, 1 K ζ k ( q 0 , q k ) (8) − K X k =1 E z ( k ) ∼ q k h log p k ( X ( k ) | z ( k ) ) i + C , wher e C = P z ∈Z Q K k =1 C k ( z ) > 0 , for C k ( z ) in ( 5 ) , is a constant. The proof appears in Appendix A. Next, fix q k as the shard posteriors, q k ( z ( k ) ) = p k ( z ( k ) | X ( k ) ) , minimiz- ing the right hand side of ( 8 ) with respect to q 0 leads to ¯ p ( z | X, λ ) in ( 4 ) as E z ( k ) ∼ q k log p k ( X ( k ) | z ( k ) ) is a constant with respect to q 0 . In summary , solving the barycenter in ( 4 ) can be seen as minimizing an upper 6 X X (1) X (2) p 1 ( z | X (1) ) p 2 ( z | X (2) ) ( Z ,c ) ≈ 1 4 ! 4 i =1 δ z (1) i min α ∈ ∆ 8 2 ! k =1 λ k W c, ! k ( α , 1 / 4 ; M ( k ) ) = 1 4 ! 4 i =1 δ z (2) i z (1) 1 z (1) 2 z (1) 3 z (2) 1 z (2) 2 z (2) 3 α 1 α 2 α 3 z (2) 4 z (1) 4 α 4 α 5 α 6 α 7 α 8 (1) Dividing into vertical shards (2) Obtaining shard posteriors (3) Aggregate shard posteriors F I G 1 . A simple illustration of VCI with 2 shar ds and 4 posterior samples per shar d. bound of the NELBO with respect to the v ariational pos- terior (consensus posterior) q 0 . In ( 8 ), if we divide ζ k by P K k =1 ζ k , the minimization with respect to q 0 stays the same. Therefore, for the correspondence to ( 4 ), we have ζ k = 1 K ϵ k and λ k = ζ k P K k =1 ζ k . W e recall that multiplying the entrop y H q ( z , z ( k ) ) in ( 11 ) in the proof of the propo- sition by any real positiv e number smaller than 1 still pre- serves a bound. Hence, the choice of λ k and ϵ k is flexible as in ( 4 ). The gap in ( 8 ) is tight when the entropy inequality ( 11 ) and the inequality in ( 12 ) in the proof of the proposi- tion (in Appendix A) are tight. For ( 11 ), the inequality is tight when H q ( z − ( k ) | z , z ( k ) ) = 0 , meaning that the par- tition (up to relabeling) z − ( k ) are deterministically deter- mined by z , z ( k ) under the variational posterior q for all k = 1 , . . . , K . Consequently , the uncertainty under q of the ( z , z (1) , . . . , z ( K ) ) is fully determined once z and any z ( k ) are known. The inequality ( 12 ) is tight when the op- timal v ariational posterior has a star structure, i.e., we can decompose q ( z , z (1) , . . . , z ( K ) ) = q 0 ( z ) Q K k =1 q ( z ( k ) | z ) . 2.4 Computational Aspects W e now discuss how to solve the entropic W asserstein barycenter problem in practice. As the space of parti- tions is discrete, we can write any probability measure on Z as p ( z ) = P Z i =1 α i δ z i where α = ( α 1 , . . . , α |Z | ) ∈ ∆ |Z | . W ithout loss of generality , we can write the k - th shard posterior p k ( z | X ( k ) ) = P Z i =1 α ( k ) i δ z i where α ( k ) = ( α ( k ) 1 , . . . , α ( k ) |Z | ) ∈ ∆ |Z | . Therefore, we can rewrite the consensus posterior problem in ( 4 ) as: min α ∈ ∆ |Z | K X k =1 λ k W c,ϵ k ( α , α ( k ) ; M ) , (9) where M ∈ R |Z |×|Z | + is the cost matrix with M ij = c ( z i , z j ) . The optimization in ( 9 ) is widely known as a fixed support barycenter problem [ 3 , 17 ]. W e can further re write ( 9 ) as: min α ∈ ∆ |Z | K X k =1 λ k min γ k ∈ Γ( α , α ( k ) ) ⟨ γ k , M ⟩ − ϵ k E ( γ k ) where Γ( α , α ( k ) ) = { γ k ∈ R |Z |×|Z | + | γ k 1 = α , γ ⊤ k 1 = α ( k ) } and E ( γ k ) = − P |Z | i =1 P |Z | j =1 γ k,ij log γ k,ij and ⟨ γ k , M ⟩ is the sum ov er Z × Z as in ( 3 ). Let γ = ( γ 1 , . . . , γ K ) , the optimization is equi valent to: min γ K X k =1 λ k [ ⟨ γ k , M ⟩ − ϵ k E ( γ k )] , subject to γ k ∈ R |Z |×|Z | + , γ ⊤ k 1 = α ( k ) for k = 1 , . . . , K , and γ 1 1 = γ 2 1 = . . . = γ K 1 . W e can interpret the con- straint of γ as the intersection of two constraint sets C 1 = { γ ∈ ( R |Z |×|Z | + ) K | γ ⊤ k 1 = α ( k ) , ∀ k = 1 , . . . , K } and C 2 = { γ ∈ ( R |Z |×|Z | + ) K | ∃ α ∈ ∆ |Z | , γ k 1 = α , ∀ k = 1 , . . . , K } . By defining KL λ ( γ , ξ ) = P K k =1 λ k KL ( γ k , ξ k ) and defining KL ( γ k , ξ k ) = P |Z | i =1 P |Z | j =1 γ k,ij log γ k,ij ξ k,ij − 1 (abusing KL notation), we can write the optimization problem into the final form: min γ ∈C 1 ∩C 2 KL λ ( γ , ξ ) , (10) where ξ k = exp( − M /ϵ k ) . The optimization in ( 10 ) is known to have an unique solution, which can be obtained using iterativ e Bregman projections [ 9 ]. In summary , the algorithm iterati vely per- form two steps: (1) γ = arg min γ ∈C 1 KL λ ( γ , ξ ) , and (2) γ = arg min γ ∈C 2 KL λ ( γ , ξ ) . The above two steps hav e the follo wing closed-form updates: for k = 1 , . . . , K , 1. γ k = γ k diag α ( k ) γ ⊤ k 1 , 2. γ k = diag α γ k 1 γ k where α = Q K k =1 ( γ k 1 ) λ k ( Q and ( · ) λ k are entry-wise operators). As mentioned, the algorithm con ver ges to a unique solu- tion as the number of iterations tends to infinity . In prac- tice, there are some cle ver implementation choices such as normalizing α each iteration, and memory ef ficient and parallel implementation (see details in [ 9 ]). Considering all partitions in Z as atoms of distribu- tions is computationally impractical, as the number of possible partitions is enormous. Therefore, we need to VER TICAL CONSENSUS INFERENCE FOR HIGH-DIMENSIONAL RANDOM P AR TITION 7 T A B L E 1 Scenario 1. Results on Old F aithful Geyser Dataset. The right column r eports distances to the gr ound truth p 0 . Posteriors W V oI ( · , p 0 ( z | X )) p 1 ( z | X (1) ) 0.3030 p 2 ( z | X (2) ) 0.3629 1 K P K k =1 p k ( z | X ( k ) ) 0.3142 ¯ p ( z | X, 1 /K ) 0.2510 P K k =1 λ H k p k ( z | X ( k ) ) 0.3139 ¯ p ( z | X, λ H ) 0.2501 P K k =1 λ P k p k ( z | X ( k ) ) 0.3109 ¯ p ( z | X, λ P ) 0.2465 restrict partitions to a feasible set. For example, we use the much smaller sets of posterior MCMC samples from p k ( z | X ( k ) ) , or samples from its variational approxima- tions. In particular , letting z ( k ) 1 , . . . , z ( k ) N k ∼ p k ( z | X ( k ) ) ( N k > 0 small compared to |Z | ), we can approximate p k ( z | X ( k ) ) ≈ 1 N k P N k i =1 δ z ( k ) i and consider finding the consensus posterior ¯ p ( z | X , λ ) = P K k =1 P N k i =1 α ki δ z ( k ) i where ( α 11 , . . . , α 1 N 1 , . . . , α K 1 , . . . , α K N K ) ∈ ∆ K N k . As we do not need the consensus barycenter to hav e shared atoms with shard posteriors, we can further reduce the number of atoms of the consensus posterior using the shard posteriors summarization [ 6 , 18 , 39 , 60 ] or random sampling. For these cases, we can just update ξ in ( 10 ) i.e., ξ k = exp( − M ( k ) /ϵ k ) with M ( k ) being the ground cost matrix between the atoms of the consensus poste- rior and the atoms of the k -th shard posterior . W e refer the reader to Figure 1 for a simple illustration of VCI. In principle, we can also search for the atoms of the consen- sus posterior during optimization (kno wn as fr ee support barycenter [ 17 ]); howe ver , search on the space of parti- tions Z is e xpensiv e. Therefore, we leave this inv estiga- tion to future work. 3. EXPERIMENTS W e demonstrate VCI in three scenarios: scenario 1 val- idates VCI as an approximate of model-based clustering in lo w dimensional problems. Scenario 2 verifies that VCI can recognize the noisy dimensions in a moderate di- mensional problem with irrele v ant featurs. And scenario 3 is designed to validate VCI as meaningful inference with high-dimensional data. 3.1 Scenario 1 For an easy and well-recognized example we consider the two-dimensional Old Faithful Ge yser dataset [ 5 ]. W e fit conjugate truncated Dirichlet process mixture of Gaus- sians models [ 26 ] to these datasets. The posterior distri- bution p 0 ( z | X ) of the random partition based on the full data is treated as the ground truth, represented as an em- pirical distributions over 1,000 MCMC samples after a T A B L E 2 Scenario 2. Results on Noisy Old F aithful Geyser Dataset. The last column r eports distancde to the first two meaningful dimensions. Posterior W V oI ( · , p 1 ( z | X (1) )) p 0 ( z | X ) 0.9464 p 1 ( z | X (1) ) 0 p k ( z | X ( k ) ) , k = 2 , . . . , 10 [0.8847, 3.4301] 1 K P K k =1 p k ( z | X ( k ) ) 1.9794 ¯ p ( z | X, 1 /K ) 0.8850 P K k =1 λ H k p k ( z | X ( k ) ) 3.2301 ¯ p ( z | X, λ H ) 0.8987 P K k =1 λ P k p k ( z | X ( k ) ) 0.2154 ¯ p ( z | X, λ P ) 0.0031 burn-in of 9,000 samples. Hyperparameters are kept con- sistent between the full model and the models for shards. W e partition the data into two one-dimensional shards. For each shard, we obtain 1,000 MCMC samples after a burn-in of 9,000 iterations. The consensus posterior is constructed as a discrete distribution supported on the union of the two sets of MCMC samples, with weights determined by ( 9 ) using ϵ k = 0 . 05 (the smallest value that remains numerically stable). For the barycenter weights, we consider three choices: uniform weights, and weights based on the entropy λ H , and the proposed weights λ P ( a = 1 as the def ault choice), as described in Section 2.2 . W e report W asserstein distances, using the variation of in- formation (V oI) ground metric, between the tar get poste- rior and the shard posteriors, their mixtures, and the con- sensus posterior in T able 1 . W e observe that the consensus posteriors achieve the smallest distances, outperforming e ven the best individual shard posterior . The weights λ P provide the lo west distance. 3.2 Scenario 2 W e add 18 dimensions to the Old Faithful Geyser dataset. The first 10 dimensions are from sampling n ob- serv ations from a Gaussian with mean (3 , 70 , . . . , 3 , 70) ⊤ and co variance diag ([4 , 36 , . . . , 4 , 36]) while the last 8 dimensions are from sampling n observations from a Gaussian with mean (1 , 10 , . . . , 1 , 10) ⊤ and cov ariance diag ([1 , 4 , . . . , 1 , 4]) . W e then repeat the same procedure as in scenario 1 but with 10 shards of 2 dimensions. For each shard, we sa ve 100 posterior MCMC samples after a b urn-in of 9,900. T able 2 reports W asserstein distances to the clean posterior (under the first shard). For shards including noisy dimensions the distances are from 0.88 to 3.43. The consensus barycenters mitigates the contamina- tion with the noisy dimensions, as desired. For the choice of barycenter weights, the weights λ P ( a = 10 to penalize large K ) result in the lo west distance. 3.3 Scenario 3 W e consider high-dimensional single-cell data with 25,348 dimensions/genes [ 1 ]. After pre-processing, we 8 T A B L E 3 Scenario 3. Results on Single Cell Dataset. The last column r eports distance to the pseudo truth. Posteriors E [ V oI ( · , z ⋆ )] p ( z | X ) 2.3026 p k ( z | X ) , k = 1 , . . . , 100 [1.1897, 2.5237] 1 K P K k =1 p k ( z | X ( k ) ) 1.7901 ¯ p ( z | X, 1 /K ) 1.7756 P K k =1 λ H k p k ( z | X ( k ) ) 1.6036 ¯ p ( z | X, λ H ) 1.5496 P K k =1 λ P k p k ( z | X ( k ) ) 1.7375 ¯ p ( z | X, λ P ) 1.7194 obtain 800 cells from 10 cell-types annotated by kno wn cell markers (observed pseudo labels). For model-based inference on random partitions we fit the follo wing model: X i | θ i ∼ p Y d =1 P oisson( N i θ id ) , θ i | G ∼ G, G ∼ DP α, p Y d =1 Gamma( a, b ) ! , where N i = P p d =1 X id is the sequencing depth. W e con- sider K = 100 shards with approximately equal dimen- sions. W e fit the model on the full data and on each shard, saving 100 MCMC samples after a burn-in of 900. For consensus posteriors we use the last 10 MCMC samples per shard to construct empirical shard posteriors (ov er partitions) with 1,000 atoms. For the barycenters, we use ϵ k = 0 . 05 for the entropic regularization. In T able 3 , we report e xpected V oI distances relativ e to the observed pseudo labels. For the full posterior , the expected dis- tance is 2.3026. For shards, the expected distances are from 1.1897 to 2.5237. All consensus posteriors achie ve lo wer expected distances than the full posterior . Re gard- ing the choice of barycenter weights, those based on en- tropy yield the smallest distances, as all shards produce relati vely fe w clusters compared to the pseudo labels. W e also observe that smaller shards tend to generate more clusters; ho wev er , this does not increase the expected V oI distance to the pseudo truth. It is important to note that the labels are pseudo, so the y do not fully capture the true quality of the methods. 4. CONCLUSION VCI provides a principled and scalable frame work for high-dimensional Bayesian clustering. By operating on vertical shards and lev eraging an entropic-regularized W asserstein barycenter , it mitigates the curse of dimen- sionality while retaining interpretability through a gen- eralized Bayes perspecti ve. The frame work fa vors shards with desired properties through controlling the barycenter weights. Empirical results demonstrate that VCI achie ves accurate approximations in low dimensional settings and performs fa vorably in high-dimensional scenarios with many irrelev ant features. Future work may explore al- ternati ve regularized W asserstein barycenter formula- tions [ 14 ], which could correspond to variational in- ference under different probabilistic discrepancies be- yond the KL di ver gence [ 28 ]. Another direction is to iterati vely update both variational and consensus poste- riors in ( 8 ), rather than fixing the v ariational distribu- tions to be the shard posteriors. Dev eloping principled approaches for constructing problem-specific barycenter weights remains an important challenge. Extending VCI beyond random partitions to more general dimension- independent parameterizations is another promising di- rection. Finally , improving the computational efficiency of barycenter estimation, e.g., through sliced optimal transport [ 13 , 38 ], is another important direction for fu- ture work. Appendix A. Proof of Proposition 2.1 First, we utilize the f act that the entropy of discr ete ran- dom variables is non-ne gati ve to obta in a lo wer bound of the joint entropy H q ( z , z (1):( K ) ) : (11) H q ( z , z (1):( K ) ) = H ( z ) + H q ( z (1):( K ) | z ) = H q ( z ) + 1 K K X k =1 [ H q ( z ( k ) | z ) + H q ( z − ( k ) | z , z ( k ) )] ≥ H ( z ) + 1 K K X k =1 H q ( z ( k ) | z ) = 1 K K X k =1 H q ( z , z ( k ) ) , where H q ( z (1):( K ) | z ) = − E ( z , z (1):( K ) ) ∼ q [log q ( z (1):( K ) | z )] is the conditional entropy . W e then define the upper bound: L ( q ) = E ( z , z (1):( K ) ) ∼ q [ − log p ( z , z (1):( K ) )] − 1 K K X k =1 H q ( z , z ( k ) ) ≥ L ( q ) . W e further ha ve log p ( z , z (1):( K ) , X ) = = log " p ( z ) K Y k =1 p ( z ( k ) | z ) p k ( X ( k ) | z ( k ) ) # = K X k =1 − ζ k c ( z ( k ) , z ) + log p k ( X ( k ) | z ( k ) ) − C, for a constant C = P z ∈Z Q K k =1 C k ( z ) > 0 . Let q 0 ,k ( z , z ( k ) ) ∈ Π( q 0 , q k ) be the marginal of the joint q ( z , z (1):( K ) ) at ( z , z ( k ) ) , and Π ∗ ( q 0 , q 1 , . . . , q K ) be the set of star cou- plings such that q ( z , z (1) , . . . , z ( K ) ) = q 0 ( z ) Q K k =1 q ( z ( k ) | VER TICAL CONSENSUS INFERENCE FOR HIGH-DIMENSIONAL RANDOM P AR TITION 9 z ) for an y q ∈ Π ∗ , we hav e: min q ∈ Π( q 0 ,q 1 ,...,q K ) L ( q ) ≤ min q ∈ Π( q 0 ,q 1 ,...,q K ) L ( q ) = min q ∈ Π( q 0 ,q 1 ,...,q K ) E ( z , z (1):( K ) ) ∼ q " K X k =1 ζ k c ( z ( k ) , z ) − log p k ( X ( k ) | z ( k ) ) + 1 K log q ( z , z ( k ) ) + C ≤ min q ∈ Π ∗ ( q 0 ,q 1 ,...,q K ) E ( z , z (1):( K ) ) ∼ q " K X k =1 ζ k c ( z ( k ) , z ) (12) − log p k ( X ( k ) | z ( k ) ) + 1 K log q ( z , z ( k ) ) + C = min q ∈ Π ∗ ( q 0 ,q 1 ,...,q K ) K X k =1 E ( z , z (1):( K ) ) ∼ q h ζ k c ( z ( k ) , z ) − log p k ( X ( k ) | z ( k ) ) + 1 K log q ( z , z ( k ) ) + C = K X k =1 min q 0 ,k E ( z , z ( k ) ) ∼ q 0 ,k h ζ k c ( z ( k ) , z ) − log p k ( X ( k ) | z ( k ) )+ + 1 K log q ( z , z ( k ) ) + C = K X k =1 ζ k min q 0 ,k E ( z , z ( k ) ) ∼ q 0 ,k h c ( z ( k ) , z ) i − 1 K ζ k H q 0 ,k ( z , z ( k ) ) − K X k =1 E z ( k ) ∼ q k h log p k ( X ( k ) | z ( k ) ) i + C = K X k =1 ζ k W c, 1 K ζ k ( q 0 , q k ) (13) − K X k =1 E z ( k ) ∼ q k h log p k ( X ( k ) | z ( k ) ) i + C, which completes the proof. REFERENCES [1] 1 0 X G E N O M I C S (2024). 5k Human PBMCs (Donor 1) with Au- tomated Cell Annotation. Dataset, License: Creative Commons Attribution 4.0 International (CC BY 4.0). [2] A G R AW A L , R . , G E H R K E , J . , G U N O P U L O S , D . and R AG H A - V A N , P . (1998). Automatic subspace clustering of high dimen- sional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Confer ence on Management of Data 94–105. [3] A G U E H , M . and C A R L I E R , G . (2011). Barycenters in the W asserstein space. SIAM Journal on Mathematical Analysis 43 904–924. [4] A LTS C H U L E R , J ., N I L E S - W E E D , J . and R I G O L L E T , P . (2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Pro- cessing Systems 1964–1974. [5] A Z Z A L I N I , A . and B OW M A N , A . W . (1990). A look at some data on the Old Faithful geyser. Journal of the Royal Statistical Society: Series C (Applied Statistics) 39 357–365. [6] B A L O C C H I , C . and W A D E , S . (2025). Understanding uncertainty in Bayesian cluster analysis. arXiv preprint arXiv:2506.16295 . [7] B A N E R J E E , A ., M E R U G U , S . , D H I L L O N , I. S . and G H O S H , J . (2005). Clustering with Bregman div ergences. Journal of Ma- chine Learning Resear ch 6 1705–1749. [8] B E AU M O N T , M . A . , Z H A N G , W . and B A L D I N G , D . J . (2002). Approximate Bayesian computation in population genetics. Ge- netics 162 2025–2035. [9] B E NA M O U , J . - D ., C A R L I E R , G . , C U T U R I , M . , N E N NA , L . and P E Y R É , G . (2015). Iterati ve Bregman projections for re gularized transportation problems. SIAM Journal on Scientific Computing 37 A1111–A1138. [10] B I N D E R , D . A . (1978). Bayesian cluster analysis. Biometrika 65 31–38. [11] B I S S I R I , P . G ., H O L M E S , C . C . and W A L K E R , S . G . (2016). A general frame work for updating belief distrib utions. Journal of the Royal Statistical Society Series B: Statistical Methodology 78 1103–1130. [12] B L E I , D . M . , K U C U K E L B I R , A . and M C A U L I FFE , J . D . (2017). V ariational inference: A re vie w for statisticians. Journal of the American statistical Association 112 859–877. [13] B O N N E E L , N . , R A B I N , J . , P E Y R É , G . and P FI S T E R , H . (2015). Sliced and Radon Wasserstein Barycenters of Measures. Journal of Mathematical Imaging and V ision 1 22–45. [14] B R E S C H , J . and S T E I N , V . (2026). Interpolating between opti- mal transport and KL regularized optimal transport using Rényi div ergences. Results in Mathematics 81 23. [15] C H A N D R A , N . K ., C A N A L E , A . and D U N S O N , D . B . (2023). Escaping the curse of dimensionality in Bayesian model-based clustering. Journal of Mac hine Learning Resear ch 24 1–42. [16] C U T U R I , M . (2013). Sinkhorn Distances: Lightspeed Compu- tation of Optimal Transport. In Advances in Neural Information Pr ocessing Systems (C . J . B U R G E S , L . B O TT O U , M. W E L L I N G , Z . G H A H R A M A N I and K . Q . W E I N B E R G E R , eds.) 26 . Curran Associates, Inc. [17] C U T U R I , M . and D O U C E T , A . (2014). Fast computation of W asserstein barycenters. In International Conference on Ma- chine Learning 685–693. PMLR. [18] D A H L , D . B . , J O H N S O N , D . J . and M Ü L L E R , P . (2022). Search algorithms and loss functions for Bayesian clustering. J ournal of Computational and Graphical Statistics 31 1189–1201. [19] D E B L A S I , P ., F A V A RO , S . , L I J O I , A . , M E N A , R . H . , P RÜ N - S T E R , I . and R U G G I E RO , M. (2013). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE T rans- actions on P attern Analysis and Machine Intelligence 37 212– 229. [20] D O U C E T , A . , D E F R E I T A S , N . and G O R D O N , N . (2001). An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo Methods in Practice 3–14. Springer . [21] D U D O I T , S . and F R I D L Y A N D , J . (2003). Bagging to improv e the accuracy of a clustering procedure. Bioinformatics 19 1090– 1099. [22] E N T E Z A R I , R ., C R A I U , R . V . and R O S E N T H A L , J . S . (2018). Likelihood inflating sampling algorithm. Canadian Journal of Statistics 46 147–175. [23] F R I T S C H , A . and I C K S TA D T , K . (2009). Impro ved criteria for clustering based on the posterior similarity matrix. Bayesian Analysis 4 367 – 391. https://doi.org/10.1214/09- B A414 10 [24] G E Y E R , C . J . (1992). Practical Marko v Chain Monte Carlo. Sta- tistical Science 473–483. [25] H UA N G , Z . and G E L M A N , A . (2005). Sampling for Bayesian computation with large datasets. A vailable at SSRN 1010107 . [26] I S H W A R A N , H . and J A M E S , L . F . (2001). Gibbs sampling meth- ods for stick-breaking priors. J ournal of the American Statistical Association 96 161–173. [27] K I N G M A N , J . F . (1978). The representation of partition struc- tures. Journal of the London Mathematical Society 2 374–380. [28] K N O B L AU C H , J ., J E W SO N , J . and D A M O U L AS , T. (2022). An optimization-centric view on Bayes’ rule: Revie wing and gener- alizing v ariational inference. Journal of Machine Learning Re- sear ch 23 1–109. [29] L AU , J . W. and G R E E N , P . J . (2007). Bayesian model-based clustering procedures. J ournal of Computational and Graphical Statistics 16 526–558. [30] L I J O I , A ., M E N A , R. H . and P RÜ N S T E R , I . (2005). Hierarchi- cal Mixture Modeling with Normalized In verse-Gaussian Priors. Journal of the American Statistical Association 100 1278–1291. [31] L I J O I , A . , M E NA , R . H . and P R ÜN S T E R , I . (2007). Control- ling the reinforcement in Bayesian non-parametric mixture mod- els. J ournal of the Royal Statistical Society Series B: Statistical Methodology 69 715–740. [32] L I U , Y . , K A N G , Y . , Z O U , T., P U , Y . , H E , Y . , Y E , X . , O U Y A N G , Y . , Z H A N G , Y . - Q . and Y A N G , Q . (2024). V ertical federated learning: Concepts, advances, and challenges. IEEE T ransactions on Knowledge and Data Engineering 36 3615– 3634. [33] L O , A . Y . (1984). On a class of Bayesian nonparametric esti- mates: I. Density estimates. The Annals of Statistics 351–357. [34] M E I L ˘ A , M . (2007). Comparing clusterings—an information based distance. Journal of Multivariate Analysis 98 873–895. [35] M I N K A , T. P . (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Confer - ence on Uncertainty in Artificial Intelligence 362–369. [36] M I N S K E R , S . , S R I V A S TA V A , S . , L I N , L . and D U N S O N , D . (2014). Scalable and robust Bayesian inference via the me- dian posterior . In International Confer ence on Mac hine Learning 1656–1664. PMLR. [37] N E I S W A N G E R , W., W A N G , C . and X I N G , E . P . (2014). Asymp- totically exact, embarrassingly parallel MCMC. In Pr oceedings of the Thirtieth Confer ence on Uncertainty in Artificial Intelli- gence 623–632. [38] N G U Y E N , K . (2025). An introduction to sliced optimal trans- port: foundations, adv ances, extensions, and applications. F oun- dations and T rends® in Computer Graphics and V ision 17 171– 391. [39] N G U Y E N , K . and M U E L L E R , P . (2026). Summarizing Nonpara- metric Bayesian Mixture Posteriors–Sliced Optimal Transport Metrics for Gaussian Mixtures. Journal of Computational and Graphical Statistics just-accepted 1–22. [40] N G U Y E N , V . , E P P S , J . and B A I L E Y , J . (2010). Information the- oretic measures for clusterings comparison: v ariants, properties, normalization and correction for chance. Journal of Machine Learning Resear ch 11 2837–2854. [41] N I , Y . , J I , Y . and M Ü L L E R , P . (2020). Consensus Monte Carlo for random subsets using shared anchors. Journal of Computa- tional and Graphical Statistics 29 703–714. [42] P AG A N I N , S ., H E R R I N G , A . H . , O L S H A N , A . F . , D U N - S O N , D . B . and S T U DY , T. N . B . D . P . (2020). Centered par- tition processes: Informati ve priors for clustering (with discus- sion). Bayesian Analysis 16 301. [43] P A R S O N S , L ., H AQ U E , E . and L I U , H . (2004). Subspace clus- tering for high dimensional data: a revie w . A CM SIGKDD Explo- rations Ne wsletter 6 90–105. [44] P E Y R É , G . , C U T U R I , M . et al. (2019). Computational optimal transport: With applications to data science. F oundations and T rends® in Mac hine Learning 11 355–607. [45] P I T M A N , J . and Y O R , M . (1997). The two-parameter Poisson- Dirichlet distrib ution deri ved from a stable subordinator . The An- nals of Pr obability 855–900. [46] R A B I N OV I C H , M . , A N G E L I N O , E . and J O R D A N , M . I . (2015). V ariational Consensus Monte Carlo. Advances in Neur al Infor- mation Pr ocessing Systems 28 . [47] R A H M A N , S . and J O H NS O N , V . E . (2018). A Fast Algorithm for Clustering High Dimensional Feature V ectors. arXiv pr eprint arXiv:1811.00956 . [48] R A H M A N , S . , J O H N S O N , V . E . and R A O , S . S . (2022). A Hyperparameter-Free, Fast and Efficient Frame work to De- tect Clusters From Limited Samples Based on Ultra High- Dimensional Features. IEEE Access 10 116844–116857. [49] R A H M A N , S . , J O H N S O N , V . E . and R AO , S . S . (2022). Us- ing the left Gram matrix to cluster high dimensional data. arXiv pr eprint arXiv:2202.08236 . [50] R A N D , W. M . (1971). Objective criteria for the e valuation of clustering methods. J ournal of the American Statistical Associa- tion 66 846–850. [51] R E N D E L L , L . J ., J O H A N S E N , A . M . , L E E , A . and W H I T E - L E Y , N . (2020). Global Consensus Monte Carlo. Journal of Computational and Graphical Statistics 30 249–259. [52] S C OT T , S . L . (2017). Comparing consensus Monte Carlo strate- gies for distributed Bayesian computation. Brazilian Journal of Pr obability and Statistics 668–685. [53] S C OT T , S . L ., B L O C K E R , A. W ., B O N A S S I , F. V . , C H I P - M A N , H. A . , G E O R G E , E . I . and M C C U LL O C H , R . E . (2022). Bayes and big data: The consensus Monte Carlo algorithm. In Big Data and Information Theory 8–18. Routledge. [54] S R I V A S TA V A , S . , C E V H E R , V . , D I N H , Q . and D U N S O N , D . (2015). W ASP: Scalable Bayes via barycenters of subset pos- teriors. In Artificial Intelligence and Statistics 912–920. PMLR. [55] S R I V A S TA V A , S . , L I , C . and D U N S O N , D . B . (2018). Scalable Bayes via barycenter in W asserstein space. J ournal of Machine Learning Resear ch 19 1–35. [56] S T R E H L , A . and G H O S H , J . (2002). Cluster ensembles—a knowledge reuse framew ork for combining multiple partitions. Journal of Mac hine Learning Resear ch 3 583–617. [57] T H RU N , M . C . and U L T S C H , A . (2021). Using projection-based clustering to find distance-and density-based clusters in high- dimensional data. Journal of Classification 38 280–312. [58] V E L D H U I S , R . N . (2002). The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Pr ocessing Letters 9 96– 99. [59] V I L L A N I , C . (2009). Optimal transport: old and new 338 . Springer , One New Y ork Plaza, Suite 4600, New Y ork, NY 10004-1562. [60] W A D E , S . and G H A H R A M A N I , Z . (2018). Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion). Bayesian Analysis 13 559–626. [61] W A N G , X . and D U NS O N , D . B . (2013). P arallelizing MCMC via W eierstrass sampler. arXiv pr eprint arXiv:1312.4605 . [62] W H I T E , S . R . , K Y P R A I O S , T. and P R E S T O N , S . (2015). Piece- wise Approximate Bayesian Computation: fast inference for dis- cretely observ ed Markov models using a factorised posterior dis- tribution. Statistics and Computing 25 289–301. [63] W U , B . and B L E I , D . M . (2026). Extending mean-field v aria- tional inference via entropic regularization: theory and computa- tion. Journal of Mac hine Learning Resear ch 27 1–68.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment