Discovering Structure in High-Dimensional Data Through Correlation Explanation
We introduce a method to learn a hierarchy of successively more abstract representations of complex data based on optimizing an information-theoretic objective. Intuitively, the optimization searches for a set of latent factors that best explain the correlations in the data as measured by multivariate mutual information. The method is unsupervised, requires no model assumptions, and scales linearly with the number of variables which makes it an attractive approach for very high dimensional systems. We demonstrate that Correlation Explanation (CorEx) automatically discovers meaningful structure for data from diverse sources including personality tests, DNA, and human language.
💡 Research Summary
The paper introduces Correlation Explanation (CorEx), an unsupervised method for discovering hierarchical latent representations in high‑dimensional data by optimizing an information‑theoretic objective. The core idea is to use total correlation (TC), also known as multivariate mutual information, which quantifies the amount of dependence among a set of variables. TC is defined as the sum of individual entropies minus the joint entropy; it is zero only when the variables are independent.
CorEx seeks latent factors Y that, when conditioned upon, minimize the total correlation of the observed variables X. The reduction in TC, denoted TC(X;Y)=TC(X)−TC(X|Y), measures how well Y explains the dependencies in X. By allowing Y to be a discrete variable with k states and optimizing over all conditional distributions p(y|x), the method maximizes TC(X;Y). Direct optimization would be intractable because it would require estimating an exponential number of parameters, but the authors overcome this by introducing multiple latent factors Y₁,…,Y_m and assigning each observed variable X_i to exactly one group G_j associated with a single latent factor. This restriction yields a tractable objective (Equation 4) that can be expressed as a sum of mutual informations between each X_i and its assigned Y_j minus the mutual information among the Y_j themselves.
The implementation rewrites the problem using binary assignment variables α_{i,j} indicating whether X_i belongs to group j. For a fixed α, the optimal conditional distribution p(y_j|x) has a closed‑form solution (Equations 7‑8) that depends only on marginal probabilities, making the number of parameters linear in the number of observed variables. The α matrix is updated iteratively with a soft‑max rule (Equation 9) controlled by a learning rate λ and a temperature‑like parameter γ. The algorithm alternates between updating α and the marginals until convergence; its computational complexity scales linearly with the number of variables and can be further reduced by minibatching.
The authors evaluate CorEx on synthetic data, personality‑survey responses, DNA genotype data, and natural‑language text. In synthetic experiments based on a latent tree model, CorEx perfectly recovers the underlying clusters and latent factors, outperforming spectral clustering, ICA, NMF, and other baselines, especially as dimensionality grows. On a 5‑point Likert personality questionnaire (≈5,000 respondents), CorEx automatically discovers five clusters that exactly match the “Big Five” personality traits, a rare case of perfect unsupervised recovery of ground truth. In genomic data, the method isolates latent variables that predict gender, geographic ancestry, and ethnicity with near‑perfect accuracy. Applied to text corpora, CorEx yields both stylistic features and a hierarchical topic structure, demonstrating its ability to capture multiple levels of abstraction.
Theoretical discussion connects TC reduction to redundant information, local Markov properties, and Bayesian network structure, showing that the CorEx objective provides a lower bound on the total correlation of the data. Limitations include the need to pre‑specify the number of latent factors (m) and their cardinality (k), and the non‑convex nature of the optimization, which only guarantees local optima. Future work suggested includes automatic model selection, extensions to continuous variables, integration with Bayesian structure learning, and scaling to massive streaming datasets.
In summary, CorEx offers a principled, scalable, and model‑free framework for extracting meaningful latent structure from high‑dimensional discrete data, outperforming existing unsupervised methods across diverse domains while providing interpretable hierarchical representations.
Comments & Academic Discussion
Loading comments...
Leave a Comment