Learning Active Basis Models by EM-Type Algorithms
EM algorithm is a convenient tool for maximum likelihood model fitting when the data are incomplete or when there are latent variables or hidden states. In this review article we explain that EM algorithm is a natural computational scheme for learning image templates of object categories where the learning is not fully supervised. We represent an image template by an active basis model, which is a linear composition of a selected set of localized, elongated and oriented wavelet elements that are allowed to slightly perturb their locations and orientations to account for the deformations of object shapes. The model can be easily learned when the objects in the training images are of the same pose, and appear at the same location and scale. This is often called supervised learning. In the situation where the objects may appear at different unknown locations, orientations and scales in the training images, we have to incorporate the unknown locations, orientations and scales as latent variables into the image generation process, and learn the template by EM-type algorithms. The E-step imputes the unknown locations, orientations and scales based on the currently learned template. This step can be considered self-supervision, which involves using the current template to recognize the objects in the training images. The M-step then relearns the template based on the imputed locations, orientations and scales, and this is essentially the same as supervised learning. So the EM learning process iterates between recognition and supervised learning. We illustrate this scheme by several experiments.
💡 Research Summary
The paper presents a statistical framework for learning deformable object templates in images by combining an “active basis” representation with EM‑type algorithms. An active basis model expresses a template as a linear superposition of a small set of Gabor‑like wavelet elements selected from a large dictionary. Each element is allowed to undergo slight perturbations in position, orientation, and scale, thereby accommodating shape deformations. In a fully supervised setting—where all training images are already aligned (same pose, location, scale)—the template can be learned directly by a “shared sketch” algorithm, which sequentially selects wavelet elements that are shared across all images while estimating sparse coefficients for each image.
The core contribution lies in extending this learning to the more realistic weakly supervised scenario where objects appear at unknown locations, orientations, and scales. The authors treat these unknown transformations as latent variables in a generative model of the images and apply an EM‑type iterative scheme:
-
E‑step (self‑supervision): Using the current template as a detector, the algorithm searches each training image for the transformation (translation, rotation, scaling) that yields the highest matching score. This step imputes the missing alignment information, effectively providing a “complete‑data” set for the next step.
-
M‑step (supervised re‑learning): With the images now aligned according to the E‑step estimates, the shared sketch algorithm is run again. Crucially, the M‑step does not only update the coefficients of the selected wavelets; it also re‑selects the wavelet elements themselves, thereby optimizing both model structure and parameters. The complete‑data log‑likelihood is guaranteed to increase at each iteration, ensuring convergence.
The paper emphasizes that this EM formulation differs from the classic EM where the model structure is fixed. Here, the structure (the set of active basis elements) is part of the optimization, reflecting the variable‑selection nature of wavelet regression. Computationally, the authors employ matching pursuit‑style greedy selection and parallel processing to handle the huge dictionary of wavelets across multiple scales and orientations.
Experimental results on several object categories (e.g., deer, birds, bicycles) demonstrate that the EM‑type learning achieves performance close to fully supervised training, with only modest drops in detection accuracy (typically 2–3 %). The method remains robust when objects undergo substantial deformation or appear against cluttered backgrounds. Evaluation metrics include detection precision/recall, template reconstruction error, and likelihood improvement across EM iterations.
Key insights and contributions include:
-
Statistical formulation of active basis: By viewing wavelet elements as predictors in a sparse linear regression, the authors connect computer‑vision template modeling with well‑studied statistical concepts such as variable selection, Bayesian coefficient estimation, and likelihood maximization.
-
Latent‑variable EM for alignment: Treating translation, rotation, and scale as hidden variables enables a principled self‑supervised alignment step, turning the detection problem into the E‑step of EM.
-
Joint structure‑parameter learning: The M‑step simultaneously selects wavelet elements and estimates their coefficient distributions, extending EM beyond parameter estimation to model‑structure learning.
-
Biological relevance: The active basis elements resemble V1 simple‑cell receptive fields, providing a bridge between statistical modeling and neuro‑biological vision theories.
The authors conclude that the EM‑type framework offers a clean, interpretable, and statistically sound approach to learning deformable object templates without full supervision. Future directions suggested include handling more complex, non‑rigid deformations, extending to multi‑object scenes, and integrating deep‑learning based priors for the wavelet dictionary. Overall, the paper makes a compelling case for revisiting classic EM methods in modern vision tasks, especially when combined with sparse, biologically motivated representations.
Comments & Academic Discussion
Loading comments...
Leave a Comment