Directional replicability: when can the factor of two be omitted
Directional replicability addresses the question of whether an effect studied across $n$ independent studies is present with the same direction in at least $r$ of them, for $r \geq 2$. When the expected direction of the effect is not specified in advance, the state of the art recommends assessing replicability separately by combining one-sided $p$-values for both directions (left and right), and then doubling the smaller of the two resulting combined $p$-values to account for multiple testing. In this work, we show that this multiplicative correction is not always necessary, and give conditions under which it can be safely omitted.
💡 Research Summary
The paper tackles the problem of “directional replicability,” which asks whether an effect observed across n independent studies is present with the same sign (all positive or all negative) in at least r studies, where r ≥ 2. When the expected direction is not pre‑specified, the current practice (following Owen 2009 and subsequent work) conducts two one‑sided tests: one for a positive effect in at least r studies (H⁺{r/n}) and one for a negative effect in at least r studies (H⁻{r/n}). Each unilateral test is performed at level α/2, the resulting combined p‑values are obtained (typically by a partial‑conjunction method such as Bonferroni), and the smaller of the two is doubled to control the family‑wise error rate for the two‑direction search.
The authors show that the multiplicative factor of two is not universally required. They focus on the Bonferroni partial‑conjunction p‑value, defined as
p⁺{r/n} = (n − r + 1) · p{(r)} and p⁻{r/n} = (n − r + 1) · q{(r)},
where p_{(r)} is the r‑th smallest right‑tail p‑value (p_i = 1 − Φ(T_i)) and q_{(r)} is the r‑th smallest left‑tail p‑value (q_i = Φ(T_i)).
Theorem 1 (main result) states that if r exceeds the halfway point of the study set, i.e. (n + 1)/2 < r ≤ n, then the simple minimum
p_{r/n} = min{p⁺{r/n}, p⁻{r/n}}
is already a valid level‑α p‑value for the directional replicability null hypothesis H_{r/n} (which asserts that fewer than r studies have a positive effect and fewer than r studies have a negative effect). The proof relies on ordering the test statistics T_{(1)} ≤ … ≤ T_{(n)} and noting that, under the condition r > n − r + 1, the events {T_{(r)} ≤ −t} and {T_{(n−r+1)} ≥ t} are disjoint. Each event corresponds to a sum of independent Bernoulli indicators (X_i = 1_{T_i ≤ −t} and Y_i = 1_{T_i ≥ t}). By analyzing the worst‑case configuration (θ_i → ∞ for the first r − 1 components), the authors show that the supremum of the Type I error over the null parameter space equals α, establishing validity without the factor‑two correction.
When r is smaller, specifically 2 ≤ r ≤ (n + 1)/2, the two events can overlap, and the simple minimum can exceed the nominal level. The paper presents a concrete counterexample with n = 20 and r ranging from 2 to 7. In the “discordant” configuration where the first r − 1 studies have extremely large positive effects and the next r − 1 studies have extremely large negative effects, the Type I error of the unadjusted minimum surpasses 2α, demonstrating that the factor‑two correction remains necessary in this regime.
Beyond the fixed‑r setting, the authors discuss a data‑adaptive choice of r. They propose starting from the smallest r that guarantees a majority, k = ⌈(n + 2)/2⌉, and sequentially testing H_{k/n}, H_{k+1/n}, … at level α. The index ℓ of the last rejected hypothesis provides a (1 − α) lower bound on the maximum number of effects sharing the same sign, i.e., ℓ ≤ max{n₊, n₋} with probability 1 − α. This adaptive scheme avoids the need for a priori selection of r while preserving error control.
In the discussion, the authors note that the Bonferroni method is a simple, dependence‑robust way to combine p‑values, but when independence holds (as assumed throughout the paper) the Šidák correction can replace Bonferroni without affecting Theorem 1. They also mention other combining functions—Simes’ method and Fisher’s combination—that assume independence, and suggest that extending the “no‑factor‑two” result to these methods is an interesting direction for future work.
The paper connects its findings to earlier work on multivariate normal mean testing (Sasabuchi 1980) and highlights that the case r = n corresponds to testing whether all components share the same sign, a problem already studied in the likelihood‑ratio framework. Finally, the authors point out that in high‑dimensional applications (e.g., genomics) where many features are examined simultaneously, the usual practice is to apply the two‑direction procedure separately to each feature and then take the union of the two α/2‑level rejection sets. Whether the factor‑two correction can be omitted in that multiple‑testing context remains an open question.
Overall, the paper provides a clear condition—r > (n + 1)/2—under which the conventional doubling of the minimum one‑sided combined p‑value is unnecessary, thereby potentially increasing power in many replication studies while preserving rigorous error control.
Comments & Academic Discussion
Loading comments...
Leave a Comment