Multiple Hypothesis Testing in Pattern Discovery
The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.
💡 Research Summary
The paper addresses a fundamental challenge in data mining: how to assess the statistical significance of a massive number of patterns (e.g., frequent itemsets, association rules, subgraphs) that are generated by a mining algorithm in a single run. Traditional multiple‑hypothesis‑testing procedures such as Bonferroni correction or Benjamini‑Hochberg false‑discovery‑rate control assume a fixed set of hypotheses and often become overly conservative or inapplicable when the hypotheses are produced dynamically. The authors propose a generic, algorithm‑independent framework that integrates a mining procedure with a resampling‑based multiple‑testing correction capable of controlling the family‑wise error rate (FWER) in the strong sense.
The core idea is to create a collection of surrogate data sets that preserve the structural properties of the original data (e.g., item frequencies, transaction sizes) by random permutation or bootstrap sampling. The same mining algorithm is run on each surrogate, yielding a set of candidate patterns and their associated test statistics for every resample. From these, two families of critical values are derived: (1) the “min‑p” method records the smallest p‑value obtained in each surrogate and uses the (1‑α) quantile of this distribution as the global significance threshold; (2) the “max‑stat” method records the largest test statistic in each surrogate and uses its (1‑α) quantile as the cutoff. Both approaches implicitly account for dependencies among patterns because the entire mining process is replicated on each surrogate.
The authors provide a rigorous proof that the min‑p and max‑stat procedures guarantee FWER ≤ α for any possible configuration of true and false null hypotheses, i.e., they control the error in the strong sense. The proof hinges on the exchangeability of the original data and its surrogates under the global null hypothesis, ensuring that the distribution of the minimum p‑value (or maximum statistic) is the same across all resamples.
Empirical evaluation is performed on two fronts. First, a real‑world retail transaction database containing thousands of items and hundreds of thousands of purchases is mined for frequent itemsets. The proposed methods are compared against Bonferroni correction and Benjamini‑Hochberg FDR control. At a nominal α = 0.05, the min‑p approach discovers roughly 30 % more significant itemsets than Bonferroni while maintaining the same empirical FWER, and its power is comparable to the FDR method without sacrificing error control. Second, synthetic data are generated to simulate “hypothesis explosion” scenarios where the number of candidate patterns ranges from a few thousand to several hundred thousand. In these stress tests, the resampling‑based procedures continue to keep the observed FWER at or below the target level, whereas Bonferroni becomes excessively conservative, leading to a dramatic loss of power.
A computational‑complexity analysis shows that the overall cost scales linearly with the number of resamples B and the runtime T of the underlying mining algorithm (O(B·T)). The authors demonstrate that a modest number of resamples (B≈1000) suffices to obtain stable critical values, and that the process can be efficiently parallelized across multiple cores or distributed systems, making it practical for large‑scale mining tasks.
In conclusion, the paper delivers a theoretically sound and practically viable solution for multiple hypothesis testing in pattern discovery. By embedding the mining algorithm within a resampling framework, it achieves strong‑sense FWER control without the severe loss of power typical of classical corrections. The methodology is broadly applicable to any pattern‑search algorithm and opens avenues for future extensions, such as controlling other error metrics (e.g., k‑FWER, false discovery exceedance) or incorporating adaptive resampling strategies.
Comments & Academic Discussion
Loading comments...
Leave a Comment