Multi-Modal Mean-Fields via Cardinality-Based Clamping
Mean Field inference is central to statistical physics. It has attracted much interest in the Computer Vision community to efficiently solve problems expressible in terms of large Conditional Random Fields. However, since it models the posterior probability distribution as a product of marginal probabilities, it may fail to properly account for important dependencies between variables. We therefore replace the fully factorized distribution of Mean Field by a weighted mixture of such distributions, that similarly minimizes the KL-Divergence to the true posterior. By introducing two new ideas, namely, conditioning on groups of variables instead of single ones and using a parameter of the conditional random field potentials, that we identify to the temperature in the sense of statistical physics to select such groups, we can perform this minimization efficiently. Our extension of the clamping method proposed in previous works allows us to both produce a more descriptive approximation of the true posterior and, inspired by the diverse MAP paradigms, fit a mixture of Mean Field approximations. We demonstrate that this positively impacts real-world algorithms that initially relied on mean fields.
💡 Research Summary
The paper addresses a fundamental limitation of the standard Mean Field (MF) approximation when applied to large Conditional Random Fields (CRFs). MF replaces the true posterior distribution with a fully factorized product of marginal distributions, which makes inference tractable but forces the approximation to concentrate on a single mode. In many vision problems the true posterior is multimodal: distinct configurations can be equally probable yet mutually exclusive. The authors propose a Multi‑Modal Mean Field (MMMF) framework that represents the posterior as a weighted mixture of several fully factorized distributions, each capturing a different mode.
The key technical contributions are twofold. First, instead of clamping a single variable at a time (as in prior work), the method clamps groups of variables simultaneously. The groups are selected by an entropy‑based criterion that operates on a family of “temperature‑scaled” versions of the original CRF. By multiplying the energy by a temperature parameter T, the posterior is smoothed; at different temperatures, variables that are ambiguous across modes exhibit higher marginal entropy. Those high‑entropy variables are grouped together. Second, the actual partition of the state space is performed by a cardinality‑based rule: for a chosen group G={i1,…,iL} and a set of target labels {v1,…,vL}, the algorithm creates two subsets – one where at least C of the variables in G take the specified labels, and the complementary subset where fewer than C do. This “cardinality clamping” can split the space dramatically with a single operation, which is essential for dense, highly connected CRFs.
Mathematically, the state space X is recursively partitioned into disjoint subsets Xk (k=1…K). For each subset a standard MF approximation Qk(x)=∏i qki(xi) is computed under the hard constraints imposed by the clamping. The overall approximation is QMM(x)=∑k mk Qk(x) where mk are mixture weights. The authors derive a tractable objective by minimizing KL(QMM‖P) under a “near‑disjointness” assumption (overlap ≤ ε). This leads to a closed‑form expression for mk (mk ∝ exp(Ak)) where Ak is the maximized expected log‑likelihood of Qk under the true energy, subject to the cardinality constraints. The optimization of each Ak is reduced to a series of unconstrained MF problems via Lagrangian duality. When the cardinality threshold C is near the extremes (0 or |G|) the constraint becomes a higher‑order pattern potential; for intermediate C the sum of indicator variables is approximated by a Gaussian, yielding a linear constraint that can be encoded as additional unary potentials.
The algorithm builds a binary tree in a breadth‑first manner: starting from the whole space, it repeatedly (1) evaluates temperature‑scaled posteriors, (2) selects a high‑entropy variable group, (3) applies cardinality clamping to split the current node, and (4) runs MF on each child. The process stops when a predefined number of modes is reached or when further splits become insignificant. The leaf nodes constitute the final mixture components.
Empirical evaluation spans four representative vision tasks: binary image segmentation with strong pairwise attraction, stereo depth estimation, multi‑person tracking, and semantic segmentation. In each case MMMF discovers multiple plausible configurations that standard MF either collapses to a single dominant mode or fails to represent. Quantitatively, MMMF improves accuracy, reduces energy gaps, and provides richer posterior information that can be exploited for downstream decisions (e.g., temporal consistency in tracking). Compared to the earlier clamping method of
Comments & Academic Discussion
Loading comments...
Leave a Comment