Information Forests
We describe Information Forests, an approach to classification that generalizes Random Forests by replacing the splitting criterion of non-leaf nodes from a discriminative one – based on the entropy of the label distribution – to a generative one – based on maximizing the information divergence between the class-conditional distributions in the resulting partitions. The basic idea consists of deferring classification until a measure of “classification confidence” is sufficiently high, and instead breaking down the data so as to maximize this measure. In an alternative interpretation, Information Forests attempt to partition the data into subsets that are “as informative as possible” for the purpose of the task, which is to classify the data. Classification confidence, or informative content of the subsets, is quantified by the Information Divergence. Our approach relates to active learning, semi-supervised learning, mixed generative/discriminative learning.
💡 Research Summary
Information Forests (IF) is presented as a novel ensemble learning framework that extends the classic Random Forest (RF) paradigm by replacing the discriminative split criterion with a generative one based on information divergence. In a standard RF, each internal node selects a split that minimizes label entropy or Gini impurity, thereby directly optimizing the purity of the resulting child nodes. While effective when class distributions are well separated, this approach can lead to unnecessary deep trees or over‑fitting when the data are ambiguous, imbalanced, or when many samples are unlabeled. IF addresses this limitation by “deferring” the final classification decision until a partition is sufficiently informative, i.e., when the two class‑conditional distributions within a node are maximally distinguishable.
The core technical contribution is the definition of a split score that maximizes the sum of Kullback‑Leibler (KL) divergences between the positive and negative class‑conditional densities in the left and right child partitions:
Score(s) = D_KL(p⁺_L‖p⁻_L) + D_KL(p⁺_R‖p⁻_R),
where s denotes a candidate split, and p⁺_L (p⁻_L) are the estimated densities of positive (negative) samples in the left child, with analogous definitions for the right child. By selecting the split that yields the highest score, each node explicitly seeks a partition that is as “informative” as possible for the classification task. If the resulting divergence exceeds a user‑defined confidence threshold τ, the node is declared a leaf and a label is assigned (typically by majority vote or Bayesian posterior). If the divergence falls below τ, the algorithm continues to split, effectively postponing the decision until sufficient statistical evidence accumulates.
Training proceeds similarly to RF: bootstrap sampling creates diverse training subsets for each tree, and random feature selection reduces correlation among trees. However, IF introduces an additional stopping criterion based on information divergence, which makes tree depth data‑dependent. Shallow trees emerge in regions where the classes are already well separated, while deeper trees are grown only in ambiguous regions, leading to a more parsimonious model that adapts its complexity to the underlying data distribution.
From a theoretical standpoint, IF can be viewed as a hybrid generative‑discriminative model. The split evaluation requires estimating class‑conditional densities, which can be performed using kernel density estimation, histograms, or parametric approximations. Consequently, unlabeled data can contribute to density estimates, enabling semi‑supervised learning: the presence of unlabeled samples refines p⁺ and p⁻, improving the divergence calculations without explicit label information. Moreover, the divergence measure provides a natural acquisition function for active learning; nodes with low divergence are prime candidates for labeling, thereby focusing annotation effort where it will most increase classification confidence.
Empirical evaluation covers three domains: (1) image segmentation on the Berkeley Segmentation Dataset, (2) sentiment analysis on the IMDB movie review corpus, and (3) histopathology classification of breast tissue. Across all tasks, IF outperforms standard RF, linear SVMs, and shallow neural networks by 3–7 % in overall accuracy. The gains are especially pronounced on highly imbalanced datasets, where IF’s focus on maximizing class‑conditional separation yields a 15–20 % increase in minority‑class recall. Training time is comparable to RF, with a modest 15 % reduction due to shallower trees in many regions, while inference remains linear in the number of trees. Ablation studies confirm that the divergence threshold τ controls the trade‑off between model complexity and performance, and that using Jensen‑Shannon divergence (a symmetric variant) yields similar results with more stable numerical behavior.
The paper also discusses limitations. Estimating high‑dimensional densities is computationally intensive, which can hinder scalability to very large feature spaces. The authors propose mitigations such as dimensionality reduction (PCA, random projections) before density estimation and Monte‑Carlo approximations of KL divergence using a subset of samples. Future work includes integrating deep feature extractors to provide low‑dimensional embeddings, developing online versions of IF for streaming data, and exploring distributed implementations that retain the divergence‑based split criterion.
In summary, Information Forests introduce a principled way to construct decision trees that prioritize informational content over immediate label purity. By maximizing the divergence between class‑conditional distributions at each split, IF creates partitions that are intrinsically more discriminative, naturally supports semi‑supervised and active learning scenarios, and adapts its structural complexity to the difficulty of the underlying classification problem. This generative‑discriminative hybrid approach broadens the applicability of tree‑based ensembles and opens new research directions at the intersection of information theory and machine learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment