Learning by Analogy: A Causal Framework for Composition Generalization
Compositional generalization – the ability to understand and generate novel combinations of learned concepts – enables models to extend their capabilities beyond limited experiences. While effective, the data structures and principles that enable this crucial capability remain poorly understood. We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice. In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes. We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms. Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects. Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.
💡 Research Summary
The paper tackles the long‑standing problem of compositional generalization – the ability of a model to produce novel combinations of learned concepts that were never seen together during training. The authors argue that this ability hinges on two cognitive operations that humans routinely perform: (1) decomposing high‑level concepts into a set of low‑level, modular components, and (2) recombining those components across contexts while changing as little as possible. They formalize these operations using the language of causal inference, introducing two principles: causal modularity (or invariant mechanisms) and the minimal‑change principle.
To embody these principles, the authors propose a hierarchical latent variable generative model. The observable variables are an image x and a discrete text description d that encodes which high‑level concepts are present (e.g., “peacock”, “rice”). The first latent layer z₁ is conditioned directly on d; each entry z₁,i represents a low‑level concept associated with a particular high‑level token. Subsequent layers z₂,…,z_L are generated recursively: each variable v in the hierarchy is a (potentially non‑linear, non‑parametric) function g_v of its parents Pa(v) and independent noise ε_v. This yields the compact formulation (1) and the graphical model G shown in Figure 3. Crucially, the functions g_v are shared modules; the same module can be a child of multiple parents, allowing a “beak & rice” interaction to be reused for both “chicken eating rice” and “peacock eating rice”.
Theoretical contributions are twofold. First, Theorem 3.1 gives a necessary and sufficient condition for a new concept combination d to be composable (i.e., to lie in the compositional space Ω_comp). The condition states that for every latent variable z, the support of its parent distribution under d must be contained in the support of the parent distribution for some training combination \tilde d that actually appears in the support Ω_supp. In plain language, every low‑level module needed for the new combination must have been observed in at least one training example, possibly a different one for each module. This captures the essence of analogical reasoning: learn the “beak & rice” module from one example, the “colorful tail” module from another, then compose them. The theorem also highlights the role of graph sparsity: a sparser parent set Pa(z) means fewer constraints and a higher chance that the required parents have been seen, directly linking sparsity to compositional power.
Second, the authors address identifiability: can the latent hierarchy (both the graph structure and the individual latent variables) be uniquely recovered from only observable text‑image pairs p(d,x)? They prove that under mild assumptions—non‑degenerate, invertible g_v functions, sufficient variability in parent configurations, and the absence of deterministic “colliders”—the true latent model is identifiable up to a smooth bijection and permutation of latent dimensions. Unlike prior work that required linearity or discrete latents, this result holds for fully non‑linear, continuous hierarchies, making it applicable to realistic image‑text data.
Empirically, the theory is instantiated in a diffusion‑based text‑to‑image generator. The diffusion timesteps are interpreted as hierarchical levels, and an explicit sparsity regularizer (combined ℓ₁/ℓ₂ penalty) is applied to the attention maps that correspond to the parent‑child relationships. This encourages the learned graph to be sparse, as suggested by the theory. Experiments on standard compositional benchmarks (e.g., CLEVR‑CoGenT, CUB‑200 with novel attribute‑object pairs) show substantial improvements in FID and CLIPScore over strong baselines that rely on additive or polynomial interaction models. Notably, the model successfully generates images of concept pairs never seen together during training, such as “peacock eating rice”, demonstrating genuine analogical composition. Visualizations of the learned latent modules reveal interpretable components (beak, wing, rice) that align with human intuition, confirming that the model has indeed discovered the intended modular structure.
In summary, the paper makes four key contributions: (1) a principled causal formulation of analogy‑driven compositionality, (2) a hierarchical, non‑parametric latent variable model that captures modularity and minimal change, (3) rigorous theorems establishing when compositional generalization is possible and when the latent hierarchy is identifiable, and (4) a practical implementation that translates these insights into state‑of‑the‑art performance on compositional generation tasks. The work bridges cognitive insights, causal theory, and modern generative modeling, offering a clear roadmap for building models that can reason by analogy and generalize beyond their training distribution.
Comments & Academic Discussion
Loading comments...
Leave a Comment