Multidimensional counting grids: Inferring word order from disordered bags of words

Models of bags of words typically assume topic mixing so that the words in a single bag come from a limited number of topics. We show here that many sets of bag of words exhibit a very different pattern of variation than the patterns that are efficiently captured by topic mixing. In many cases, from one bag of words to the next, the words disappear and new ones appear as if the theme slowly and smoothly shifted across documents (providing that the documents are somehow ordered). Examples of latent structure that describe such ordering are easily imagined. For example, the advancement of the date of the news stories is reflected in a smooth change over the theme of the day as certain evolving news stories fall out of favor and new events create new stories. Overlaps among the stories of consecutive days can be modeled by using windows over linearly arranged tight distributions over words. We show here that such strategy can be extended to multiple dimensions and cases where the ordering of data is not readily obvious. We demonstrate that this way of modeling covariation in word occurrences outperforms standard topic models in classification and prediction tasks in applications in biology, text modeling and computer vision.

💡 Research Summary

The paper introduces a novel probabilistic framework called the counting grid (CG) to capture smooth, sequential variations in word usage that are not well modeled by conventional bag‑of‑words topic models such as LDA. Traditional topic models assume that each document is generated from a mixture of a small number of static topics, treating words as exchangeable within a document. However, many real‑world corpora exhibit a gradual drift: as one document follows another, certain words fade out while new ones appear, reflecting an underlying temporal or thematic progression. The authors model this phenomenon by arranging the entire vocabulary on a discrete grid (typically two‑dimensional) and assigning a multinomial word distribution to each grid cell. A document is generated by selecting a contiguous window of fixed size on the grid and sampling words from the cells covered by that window. Adjacent documents share overlapping windows, which naturally induces smooth transitions in word composition without explicitly encoding document order.

Learning is performed via variational Bayesian inference. The latent variables are the word‑distribution parameters for each cell (θ) and the discrete window location for each document (z). A variational distribution q(θ)q(z) is introduced, and the evidence lower bound (ELBO) is maximized. The window location is treated as a categorical variable; its update involves computing expected counts of words contributed by each possible window position, allowing the model to infer an implicit ordering even when the original data are unordered. Dirichlet priors on θ encourage sparsity and prevent over‑fitting.

A major contribution is the extension to multidimensional counting grids. While a one‑dimensional grid captures a single ordering dimension (e.g., time), many datasets involve multiple interacting factors such as time × topic, space × time, or experimental condition × developmental stage. By embedding the vocabulary in a d‑dimensional toroidal lattice and defining a window with size (w₁,…,w_d) along each axis, the model can simultaneously model several smooth variations. Documents are then generated from the d‑dimensional window, and inference proceeds analogously, with the variational updates now summing over all possible positions in the multidimensional lattice.

The authors evaluate the approach on three distinct domains. In a news‑article corpus ordered by publication date, the CG uncovers a coherent daily thematic drift, achieving higher log‑likelihood and classification accuracy than LDA, Correlated Topic Model, and Hierarchical Dirichlet Process. In a time‑course gene‑expression dataset, a two‑dimensional grid (time × experimental condition) reveals biologically meaningful clusters and outperforms standard clustering and factor analysis methods in predicting future expression states. In a computer‑vision experiment, image patches are mapped onto a 2‑D spatial grid; the resulting CG features improve object‑recognition performance over traditional Bag‑of‑Visual‑Words pipelines. Across all experiments, careful tuning of window size and grid resolution demonstrates a trade‑off between capturing fine‑grained local changes and maintaining generalization.

The paper concludes that counting grids provide a principled way to recover latent order from unordered bags of words, bridging the gap between exchangeable bag‑of‑words models and sequential models such as hidden Markov models. The multidimensional extension offers a flexible, domain‑agnostic framework for modeling complex, intertwined variations in textual, biological, and visual data. Future work may explore hierarchical counting grids, integration with neural embeddings, and applications to streaming data where the ordering is only partially observed.