A Generalized Information Bottleneck Theory of Deep Learning

A Generalized Information Bottleneck Theory of Deep Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.


💡 Research Summary

The paper revisits the Information Bottleneck (IB) principle, which frames deep learning as a trade‑off between predictive accuracy (I(T;Y)) and representation complexity (I(X;T)). While IB has offered valuable insights, its practical applicability is hampered by three major issues: (1) the difficulty of accurately estimating mutual information in high‑dimensional, deterministic networks; (2) the theoretical problem that the complexity term can become infinite or constant for ReLU‑based models, preventing observable compression; and (3) the dependence of compression phases on the choice of activation function, which conflicts with empirical evidence that ReLU networks still generalize well without compression.

To address these gaps, the authors introduce a Generalized Information Bottleneck (GIB) that incorporates the concept of synergy—information that is only available when multiple input features are processed jointly. They define a computable synergy measure by averaging the interaction information (II) of each feature with the rest of the set:
Syn(X→Y) = I(X;Y) – (1/N) Σ_i


Comments & Academic Discussion

Loading comments...

Leave a Comment