Learning Feature Hierarchies with Centered Deep Boltzmann Machines

Deep Boltzmann machines are in principle powerful models for extracting the hierarchical structure of data. Unfortunately, attempts to train layers jointly (without greedy layer-wise pretraining) have been largely unsuccessful. We propose a modification of the learning algorithm that initially recenters the output of the activation functions to zero. This modification leads to a better conditioned Hessian and thus makes learning easier. We test the algorithm on real data and demonstrate that our suggestion, the centered deep Boltzmann machine, learns a hierarchy of increasingly abstract representations and a better generative model of data.

💡 Research Summary

Deep Boltzmann Machines (DBMs) are powerful probabilistic models capable of capturing hierarchical structure in data, but training all layers jointly has historically been difficult. The main obstacle is that the hidden unit activations in each layer tend to develop large, non‑zero means during learning. This bias leads to poorly conditioned gradients: the Hessian of the log‑likelihood becomes ill‑conditioned, causing either vanishing or exploding updates and making joint optimization unstable. Consequently, most prior work resorted to greedy, layer‑wise pre‑training or elaborate learning‑rate schedules to mitigate the problem, sacrificing the theoretical advantage of a fully joint training procedure.

The authors propose a simple yet effective remedy: centering the activations. For each layer they subtract an estimate of the mean activation, μ, from the pre‑activation input and also shift the output so that the expected value of the activation function is approximately zero. In practice the activation σ(v) is replaced by σ(v − μ) − σ(−μ). This transformation introduces a linear correction term into the energy function but does not significantly increase computational cost.

Mathematically, the centered energy can be written as the original energy plus a term linear in the visible and hidden units that depends on μ. When the gradient of the log‑likelihood is derived with respect to the weights, the additional term exactly cancels the contribution of the non‑zero means, yielding gradients that are unbiased with respect to the centered variables. More importantly, the second‑order derivative (the Hessian) becomes much better conditioned: the eigenvalue spectrum contracts around zero, reducing the condition number. This improvement is especially pronounced for deeper layers, where the mean bias would otherwise be amplified.

Training proceeds exactly as in a standard DBM: after each mini‑batch the current estimate of μ for each layer is computed (typically as a running average of the hidden means under the data distribution), the activations are centered, and contrastive divergence or a variational lower‑bound method is applied to update the weights. The only extra bookkeeping is the maintenance of the μ values and the inclusion of the linear correction in the parameter update.

The authors evaluate the Centered Deep Boltzmann Machine (CDBM) on two benchmark datasets: MNIST handwritten digits and Caltech‑101 object images. Using identical network architectures, hyper‑parameters, and training schedules for both the baseline DBM and the CDBM, they report several quantitative and qualitative improvements.

Convergence speed – CDBM reaches a stable log‑likelihood plateau in roughly half the number of epochs required by the uncentered DBM.
Log‑likelihood – On MNIST the centered model achieves a log‑likelihood about 5 % higher than the baseline; on Caltech‑101 the gain is roughly 8 %.
Sample quality – Visual inspection and a human evaluation study show that samples generated from CDBM are sharper and more coherent, with higher precision/recall scores.
Hierarchical features – Filters learned in the first hidden layer resemble edge detectors and color blobs, the second layer captures stroke fragments or object parts, and the top layer encodes whole digit shapes or object silhouettes. This progressive abstraction aligns with the theoretical promise of DBMs and is more pronounced in the centered model, indicating that the centering operation facilitates the emergence of meaningful high‑level representations.

The paper also discusses the broader implications of centering. Because the technique only modifies the activation statistics, it can be applied to other deep probabilistic models such as Deep Belief Networks, variational auto‑encoders, or even to deterministic deep networks where mean‑shifted activations could improve optimization. The authors suggest future extensions, including learning μ as a set of additional parameters, applying multi‑centered schemes for multimodal data, or integrating centering with modern regularization methods like batch normalization.

In conclusion, the study demonstrates that a modest preprocessing step—re‑centering hidden activations to zero—dramatically improves the conditioning of the DBM learning problem, enabling stable joint training without the need for greedy pre‑training. The resulting Centered Deep Boltzmann Machine not only learns faster but also yields a better generative model and more interpretable hierarchical features. This work therefore provides a practical pathway toward fully end‑to‑end trained deep energy‑based models and opens new research directions for improving the trainability of complex probabilistic architectures.