On the Correct Use of Statistical Tests: Reply to "Lies, damned lies and statistics (in Geology)"

On the Correct Use of Statistical Tests: Reply to "Lies, damned lies and   statistics (in Geology)"
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In a Forum published in EOS Transactions AGU (2009) entitled “Lies, damned lies and statistics (in Geology)”, Vermeesch (2009) claims that “statistical significant is not the same as geological significant”, in other words, statistical tests may be misleading. In complete contradiction, we affirm that statistical tests are always informative. We detail the several mistakes of Vermeesch in his initial paper and in his comments to our reply. The present text is developed in the hope that it can serve as an illuminating pedagogical exercise for students and lecturers to learn more about the subtleties, richness and power of the science of statistics.


💡 Research Summary

The paper is a detailed rebuttal of P. Vermeesch’s 2009 claim that the distribution of global earthquakes over the seven days of the week is statistically non‑uniform, and that p‑values are heavily dependent on sample size, rendering statistical significance meaningless in geology. Vermeesch applied a chi‑square (χ²) goodness‑of‑fit test to the USGS catalog (118 415 events, magnitude 4.0–9.0, 1999‑2009) and obtained χ² = 94, corresponding to an astronomically small p‑value (4.5 × 10⁻¹⁸), which led him to reject the null hypothesis of uniform weekday occurrence. He then “reduced” the sample size by a factor of ten while preserving the observed weekday proportions, obtaining χ² = 9.4 and p = 0.15, and concluded that p‑values are unstable and not interpretable.

The authors identify two fundamental statistical errors in Vermeesch’s approach. First, the reduction of the sample size while keeping the same relative frequencies is not a legitimate resampling method. The χ² statistic is defined as the sum over bins of (observed – expected)² divided by the expected count; scaling all counts by a constant factor reduces the numerator by the square of that factor while the denominator is reduced only linearly, causing the statistic to shrink artificially. Proper resampling would involve drawing a random 10 % subset of the original events and then re‑binning them into the seven weekdays; when this is done the p‑value remains on the order of 10⁻⁶, still rejecting uniformity. Thus Vermeesch’s “10 % reduction” produced a biased χ² value and an erroneous conclusion about p‑value dependence on sample size.

Second, the χ² test assumes independent, identically distributed (i.i.d.) observations and a minimum expected count per bin (commonly ≥ 8–10). The global earthquake catalog violates these assumptions because it contains a large proportion of aftershocks, swarm activity, and anthropogenic events (e.g., quarry blasts, fluid‑induced seismicity). To satisfy the independence requirement, the authors apply a standard declustering algorithm (Pisarenko et al., 2008) that removes 80 616 aftershocks (≈ 68 % of the catalog), leaving 37 799 main shocks. For this declustered set the χ² statistic is 36.19 with p ≈ 2.5 × 10⁻⁶, still rejecting uniformity, but the test now respects the independence assumption.

Recognizing that catalog completeness also affects the analysis, the authors further restrict the data to events with magnitude m ≥ 5.0, a range where the Gutenberg‑Richter law indicates reasonable completeness (most authors cite a completeness threshold around m = 5.5 for the Harvard global catalog). This subset contains 16 308 events; after declustering it leaves 5 636 main shocks. The χ² test on these larger‑magnitude main shocks yields χ² = 5.64 with p = 0.46, a non‑significant result. Consequently, for well‑recorded, independent, larger earthquakes the null hypothesis of uniform weekday occurrence cannot be rejected, aligning with geological intuition that there is no intrinsic weekly cycle in tectonic processes.

The paper also clarifies the misconception about p‑value stability. When the sample is sufficiently large and each bin contains the required minimum number of observations, the χ² statistic’s asymptotic chi‑square distribution becomes essentially independent of the total sample size. The uniform distribution of p‑values under the true null hypothesis is achieved once the asymptotic regime is reached, a result originally proved by Pearson (1900). Vermeesch’s claim that p‑values “strongly depend on sample size” stems from his improper resampling and from applying the test to data that violate its assumptions.

Finally, the authors distinguish between the classical significance‑testing framework (which involves only a null hypothesis and a p‑value) and the Neyman‑Pearson hypothesis‑testing paradigm (which requires an explicit alternative hypothesis and power calculations). Vermeesch conflates the two, presenting non‑central χ² calculations without specifying a concrete alternative, thereby confusing the interpretation of “expected p‑value” versus “probability of exceeding the observed statistic under an alternative.” The authors argue that, under the null hypothesis, the p‑value does not depend on sample size, whereas under a specific alternative the power does increase with sample size—a point correctly illustrated in Vermeesch’s Table 1 but misinterpreted by him.

In summary, the authors demonstrate that statistical tests, when applied correctly—respecting independence, completeness, appropriate bin counts, and proper resampling—are always informative. Extremely small p‑values should prompt investigators to scrutinize the underlying assumptions rather than dismiss the test as “lies.” By declustering, imposing magnitude thresholds, and using proper sampling, the authors show that the apparent weekly non‑uniformity of earthquakes disappears for the most reliable data, reaffirming that statistical significance and geological significance can be reconciled when methodology is sound. This work serves as a pedagogical case study for geoscientists on the proper use and interpretation of statistical hypothesis testing.


Comments & Academic Discussion

Loading comments...

Leave a Comment