A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.
💡 Research Summary
This paper addresses a significant challenge in sampling from multi-label data, where each observation can belong to multiple, non-exclusive categories, and label frequencies are highly imbalanced. The core problem is obtaining a sample that contains enough observations with scarce labels for reliable inference, while deliberately deviating from the population frequencies in a controlled and mathematically sound manner that accounts for dependencies between labels.
The authors propose a novel sampling framework grounded in probability theory. They model the underlying distribution of the multi-label data using a Multivariate Bernoulli (MVB) distribution. The observed frequencies of each unique combination of labels in the population are used to estimate the parameters of this MVB distribution. The innovation lies in formulating the sampling task as an optimization problem. The goal is to solve for a set of weights, one for each possible label combination, such that when a weighted sample is drawn, its resulting label distribution meets pre-specified criteria.
Two specific sampling objectives are formalized:
- Balanced Sampling: The weights are optimized so that the marginal probability of each label appearing in the sample is equal. This creates a sample where all labels are equally frequent.
- Compressed Imbalanced Sampling: A more flexible objective where the weights are optimized to preserve the ordinal ranking of label frequencies from the population, but to compress the ratio between the most and least frequent labels. A “compression strength” parameter allows the researcher to control the degree of balancing, with stronger compression leading to a more uniform distribution.
The paper demonstrates the practical utility of the method, particularly for meta-research. The algorithm is applied to sample scientific articles from Web of Science tagged with biomedical topic categories. In one example, a sample of 3,000 articles is drawn from a corpus of over 220,000 articles with 64 categories. In another, 200 articles are sampled from 1,341 articles across 6 categories. In both cases, simple random sampling yields a sample that mirrors the severe imbalance of the original corpus, potentially omitting rare categories. In contrast, the proposed MVB-based compressed sampling (with a compression strength of 2) successfully produces a more balanced sample where the frequency order of categories is largely maintained, but the representation of minority categories is substantially enhanced.
The discussion draws a parallel between the weighting scheme and Inverse Probability Weighting Estimation (IPWE), noting that while both use weights to achieve a desired distribution, the proposed method optimizes for a property of the marginal distribution (e.g., equality or compressed ratios) rather than for a known target distribution. The method provides a principled, model-based alternative to heuristic multi-label sampling algorithms, explicitly incorporating label dependencies through the MVB model to create samples that are both balanced and structurally representative of the original data’s correlation patterns.
Comments & Academic Discussion
Loading comments...
Leave a Comment