Weakly Supervised Learning of Foreground-Background Segmentation using Masked RBMs
We propose an extension of the Restricted Boltzmann Machine (RBM) that allows the joint shape and appearance of foreground objects in cluttered images to be modeled independently of the background. We present a learning scheme that learns this representation directly from cluttered images with only very weak supervision. The model generates plausible samples and performs foreground-background segmentation. We demonstrate that representing foreground objects independently of the background can be beneficial in recognition tasks.
💡 Research Summary
The paper introduces a novel extension of the Restricted Boltzmann Machine (RBM) designed to separate foreground objects from cluttered backgrounds using only weak supervision. The authors build upon their earlier Masked RBM (MRBM) framework, but unlike the original model where all image regions are treated as equivalent, the new formulation assumes a single dominant foreground object occluding a complex background. The observed image is modeled as a pixel‑wise binary mixture: a binary mask m determines whether a pixel is generated by the foreground appearance image v_F or by the background image v_B. The background is modeled with a Beta‑RBM that captures both mean and variance of continuous pixel values, while the foreground is modeled jointly by an RBM that has two visible layers – a binary layer for the mask and a continuous layer for appearance – coupled through a mixed energy function E_mixed = E_bin(m, h_S) + E_beta(v_F, h_A). This joint modeling allows shape (mask) and appearance to be statistically dependent, addressing a limitation of earlier work where they were assumed independent.
Inference is performed by Gibbs sampling in three steps: (1) sample hidden units of foreground and background RBMs given their respective visible layers; (2) sample the mask and the two latent images pixel‑wise conditioned on the hidden states and the observed image; (3) decompose the joint pixel distribution into a mask‑dependent selection of either the foreground or background value. Closed‑form expressions for the mask posterior (Eq. 4) and the conditional distributions of v_F and v_B (Eq. 5) are derived, enabling efficient block Gibbs updates.
Learning follows an EM‑like procedure. Since only the observed image x is available, the algorithm alternates between (i) inferring the latent variables (v_F, v_B, m) using the current model parameters, and (ii) treating these inferred variables as observed data to update the RBM parameters via standard Contrastive Divergence (CD) or Stochastic Maximum Likelihood (SML). Purely unsupervised learning proved difficult, so the authors adopt a weakly supervised scheme: a background model is first trained on a large collection of natural image patches, providing a generic statistical prior for background pixels. During foreground learning, pixels that are poorly explained by the background model are treated as outliers; the foreground RBM then learns the regularities of these outliers across the training set, effectively discovering the foreground without any pixel‑level annotations. To improve robustness when the background model is insufficient, an “outlier component” (uniform distribution with probability p = 0.3) is added to each background pixel, allowing the sampler to fall back on a non‑informative prior when needed.
The approach is evaluated on two datasets. The first is a synthetic toy set of 16 × 16 patches containing two classes of foreground objects (rectangles of varying sizes and four types of round shapes) placed on random natural background patches. The second is a down‑scaled version of the Labeled Faces in the Wild‑A (LFW‑A) dataset, resized to 32 × 32 pixels. For the toy data, the model learns accurate shape masks and appearance textures; generated samples closely resemble the training objects, and segmentation on 5 000 test patches achieves 96 % pixel‑wise accuracy. For the face data, after initializing the foreground appearance RBM with a Beta‑RBM trained on full face images (including background), the model learns meaningful variations in pose, hair style, and gender. Sampled faces display recognizable structure despite the limited number of hidden units. Segmentation results show that the inferred masks correctly capture facial regions in most test images; errors occur mainly for extreme head poses or poorly modeled background regions. Randomly permuting masks across test images dramatically degrades performance, confirming that the model is not simply learning a fixed spatial prior.
To assess whether foreground‑background separation aids recognition, the authors conduct a simple classification task: distinguishing rectangle versus round shapes in the toy set using only the hidden activations of the foreground RBM (h_F). With abundant training data (100 examples per class) both the standard RBM and the foreground‑background model achieve near‑perfect accuracy. However, when the number of training examples is reduced, the foreground‑background model retains substantially higher accuracy (e.g., 66 % vs. 8 % with 10 examples per class), demonstrating that isolating foreground features makes the representation more robust to background clutter.
In summary, the paper presents a principled probabilistic framework that jointly models foreground shape and appearance while treating the background as a separate generative process. By leveraging a weakly supervised learning scheme—training a generic background model first and then learning foreground as outliers—the method achieves effective foreground‑background segmentation, realistic generative sampling, and improved feature robustness for downstream tasks. The work bridges a gap between deep generative models and classic layered representations in computer vision, offering a scalable approach for learning object‑centric representations from largely unlabeled data.
Comments & Academic Discussion
Loading comments...
Leave a Comment