Block clustering with collapsed latent block models

Block clustering with collapsed latent block models

We introduce a Bayesian extension of the latent block model for model-based block clustering of data matrices. Our approach considers a block model where block parameters may be integrated out. The result is a posterior defined over the number of clusters in rows and columns and cluster memberships. The number of row and column clusters need not be known in advance as these are sampled along with cluster memberhips using Markov chain Monte Carlo. This differs from existing work on latent block models, where the number of clusters is assumed known or is chosen using some information criteria. We analyze both simulated and real data to validate the technique.


💡 Research Summary

The paper presents a fully Bayesian extension of the latent block model (LBM) that enables simultaneous inference of row‑ and column‑cluster memberships and the numbers of clusters themselves. Traditional LBMs assume that the numbers of row clusters (K) and column clusters (L) are known a priori or select them post‑hoc using information criteria such as AIC, BIC, or ICL. This assumption limits applicability when the underlying block structure is unknown.

To overcome this limitation, the authors introduce a “collapsed” Bayesian LBM. They place conjugate priors on the block‑specific parameters (e.g., Beta–Bernoulli for binary data, Normal–Inverse‑Wishart for continuous data) and analytically integrate these parameters out of the joint posterior. The resulting collapsed likelihood depends only on sufficient statistics of each block (counts for binary data, sums of squares for Gaussian data), dramatically reducing the dimensionality of the parameter space and simplifying computation.

Crucially, the numbers of clusters K and L are themselves treated as random variables with discrete priors (e.g., Poisson or geometric). During Markov chain Monte Carlo (MCMC) sampling, the algorithm proposes birth‑death or split‑merge moves that can increase or decrease K and L, allowing the model to explore different dimensionalities. Row‑cluster assignments (z_i) and column‑cluster assignments (w_j) are updated via Gibbs steps conditioned on the current column (or row) configuration, using the collapsed likelihood and Dirichlet priors on the allocation probabilities. A Metropolis–Hastings acceptance ratio governs the cluster‑number moves, incorporating the prior on K/L and the change in collapsed likelihood.

Because label switching is inherent in mixture‑type models, the authors apply a post‑processing alignment (minimum‑distance permutation) to the MCMC output, ensuring interpretable cluster labels. Convergence diagnostics are performed with multiple chains and Gelman‑Rubin statistics.

The computational cost per iteration scales as (O(nK + mL)) where (n) and (m) are the numbers of rows and columns, respectively, and only the block sufficient statistics need to be stored, yielding an overall memory footprint of (O(KL)). This makes the method feasible for moderately large matrices.

Experimental validation includes two simulation studies. In the binary case, data are generated from known K and L (ranging from 3 to 5) with varying signal‑to‑noise ratios. The collapsed Bayesian LBM accurately recovers both the true number of clusters and the allocation, achieving higher Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) than a conventional LBM with fixed K/L. In the Gaussian case, the method remains robust when block means are close together, avoiding over‑splitting that plagues information‑criterion‑based selection.

Real‑world applications are demonstrated on (1) a gene‑expression matrix (thousands of genes × hundreds of samples) and (2) a congressional voting matrix (legislators × bills). In the gene‑expression data, the model discovers biologically meaningful gene modules and sample groups without pre‑specifying the number of modules, outperforming the standard LBM in predictive log‑likelihood and interpretability. In the voting data, the inferred blocks correspond closely to party affiliation and ideological blocs, again with fewer clusters than required by fixed‑K approaches.

The main contributions are:

  1. A collapsed Bayesian formulation of the LBM that eliminates the need to sample block‑specific parameters.
  2. A fully Bayesian treatment of the numbers of row and column clusters, enabling data‑driven model complexity selection within the MCMC framework.
  3. An efficient implementation based on block sufficient statistics, allowing the method to scale to realistic data sizes.
  4. Empirical evidence that the approach yields more accurate cluster number estimates and higher clustering quality than traditional LBM variants.

Limitations include sensitivity to hyper‑parameter choices for the priors on K, L, and the Dirichlet concentration parameters, and the computational burden of MCMC for very large matrices (e.g., >10⁵ rows or columns). The authors suggest future work on variational approximations, sparse‑matrix extensions, and automatic hyper‑parameter tuning to further improve scalability and robustness.