Myopic Bayesian Decision Theory for Batch Active Learning with Partial Batch Label Sampling

Myopic Bayesian Decision Theory for Batch Active Learning with Partial Batch Label Sampling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Over the past couple of decades, many active learning acquisition functions have been proposed, leaving practitioners with an unclear choice of which to use. Bayesian Decision Theory (BDT) offers a universal principle to guide decision-making. In this work, we derive BDT for (Bayesian) active learning in the myopic framework, where we imagine we only have one more point to label. This derivation leads to effective algorithms such as Expected Error Reduction (EER), Expected Predictive Information Gain (EPIG), and other algorithms that appear in the literature. A key challenge of such methods is the difficult scaling to large batch sizes, leading to either computational challenges (BatchBALD) or dramatic performance drops (top-$B$ selection). Here, using a particular formulation of the decision process, we derive Partial Batch Label Sampling (ParBaLS) for the EPIG algorithm. We show experimentally for several datasets that ParBaLS EPIG gives superior performance for a fixed budget and Bayesian Logistic Regression on Neural Embeddings. Our code is available at https://github.com/ADDAPT-ML/ParBaLS.


💡 Research Summary

The paper tackles a long‑standing practical dilemma in active learning (AL): which acquisition function should be used, especially when labels are requested in batches rather than one‑by‑one. The authors ground their approach in Bayesian Decision Theory (BDT), which prescribes choosing actions that minimize expected loss under uncertainty. By adopting a myopic perspective—optimizing the expected test loss after acquiring a single additional label—they derive a unified formulation that encompasses several classic acquisition strategies, notably Expected Error Reduction (EER) and Expected Predictive Information Gain (EPIG). In this setting, the optimal point to label is the one that maximally reduces the expected entropy of predictions on a validation set, which is mathematically equivalent to maximizing the mutual information between the candidate point and the validation points conditioned on the current labeled set.

While this derivation is elegant for single‑point selection, real‑world labeling pipelines typically operate in batches. The paper reviews three families of existing batch strategies: (1) Top‑B, which simply picks the B highest‑scoring points but ignores inter‑point dependencies; (2) heuristic diversity methods that inject randomness or submodular constraints but require dataset‑specific hyper‑parameter tuning; and (3) greedy subset selection (e.g., BatchBALD, BatchBald) that attempts to account for dependencies but suffers from exponential growth in the number of required posterior samples as batch size increases. Consequently, these methods either become computationally infeasible for moderate‑sized batches or suffer severe performance degradation.

To overcome these limitations, the authors introduce Partial Batch Label Sampling (ParBaLS). The key idea is to treat a partially constructed batch S as a set of unlabeled points whose true labels are unknown, and to approximate the expectation over all possible label configurations using Monte‑Carlo sampling. Specifically, after training a Bayesian model on the currently labeled set L, they draw m independent label vectors y^{(i)} for the entire unlabeled pool D\L from the posterior predictive distribution. Each sampled label vector defines a “universe” i, and a separate Bayesian model M_i is instantiated (or updated) with those pseudo‑labels. For each candidate point x in D, the EPIG acquisition score I(Y_x; Y_{x̂} | Y_S = y^{(i)}_S, L) is computed within each universe, and the scores are averaged across the m universes. The point with the highest averaged score is added to the partial batch, its pseudo‑label in each universe is recorded, and all M_i are updated accordingly. This process repeats until the batch reaches the desired size B. Because the Monte‑Carlo estimate’s variance decays as O(1/√m) and does not depend on B, a modest number of universes (often m ≤ k, where k is the number of posterior parameter samples) suffices to obtain a low‑variance estimate. The computational complexity becomes O(B·N·m·k·C) in time and O(m·k·d·C) in space, which is linear in the batch size and dramatically cheaper than the exponential cost of exact batch Bayesian methods.

The authors also present a MAP variant (ParBaLS‑MAP) that uses a single deterministic label assignment (the maximum‑a‑posteriori label for each unlabeled point) instead of sampling, offering a faster but slightly less robust alternative.

Empirically, the paper evaluates ParBaLS‑EPIG on a diverse suite of 24 experimental settings, covering tabular datasets (Airline Passenger, Credit Card fraud detection), text classification (AG News), and image classification (CIFAR‑10/100, iWildCam, fMoW) using pretrained neural embeddings (DINOv2, CLIP, BERT). Bayesian Logistic Regression (BLR) serves as the downstream model; for deep embeddings, only the final linear layer is Bayesian, enabling efficient inference while preserving uncertainty estimates. Dimensionality reduction via 99 % variance‑preserving PCA is applied to keep computational costs manageable.

Results consistently show that ParBaLS‑EPIG outperforms or matches strong baselines such as Top‑B, BatchBALD, PowerBALD, SoftmaxBALD, and various diversity‑enhanced methods across all datasets, especially under label‑imbalanced or low‑budget regimes. The MAP variant remains competitive, confirming that even a single pseudo‑label universe can capture much of the benefit. The authors also provide a thorough complexity analysis (Table 1) and release all code, facilitating reproducibility.

In summary, the paper makes three major contributions: (1) it unifies several classic acquisition functions under a single Myopic Bayesian Decision Theory (MBDT) framework, offering a principled interpretation of why they work; (2) it proposes ParBaLS, a scalable and theoretically grounded batch acquisition method that sidesteps the exponential blow‑up of exact batch Bayesian approaches; and (3) it validates the approach with extensive experiments using Bayesian Logistic Regression on neural embeddings, demonstrating superior practical performance. The work bridges the gap between elegant Bayesian decision‑theoretic formulations and the pragmatic demands of batch active learning, providing a clear path forward for researchers and practitioners seeking both theoretical soundness and computational feasibility.


Comments & Academic Discussion

Loading comments...

Leave a Comment