A Statistical Significance Simulation Study for the General Scientist

Reading time: 5 minute
...

📝 Original Info

  • Title: A Statistical Significance Simulation Study for the General Scientist
  • ArXiv ID: 1109.6565
  • Date: 2011-09-30
  • Authors: Jacob Levman

📝 Abstract

When a scientist performs an experiment they normally acquire a set of measurements and are expected to demonstrate that their results are "statistically significant" thus confirming whatever hypothesis they are testing. The main method for establishing statistical significance involves demonstrating that there is a low probability that the observed experimental results were the product of random chance. This is typically defined as p < 0.05, which indicates there is less than a 5% chance that the observed results occurred randomly. This research study visually demonstrates that the commonly used definition for "statistical significance" can erroneously imply a significant finding. This is demonstrated by generating random Gaussian noise data and analyzing that data using statistical testing based on the established two-sample t-test. This study demonstrates that insignificant yet "statistically significant" findings are possible at moderately large sample sizes which are very common in many fields of modern science.

💡 Deep Analysis

Figure 1

📄 Full Content

Establishing statistical significance is extremely common among scientists. This involves demonstrating that there is a low probability that a scientist's observed measurements were the result of random chance. This is typically defined as (p<0.05) which indicates that there is less than a 5% chance that the observed differences were the result of randomness. Statistical significance can be established using a wide variety of statistical tests which compare a scientist's measurements with randomly generated distributions to determine a p-value (from which statistical significance is established). It is known that as the number of samples increases, the amount of difference needed between our two distributions to obtain statistical significance (p<.05) gets smaller. The main focus of this research paper is to present data which demonstrates that as the number of samples becomes large, the amount of separation between our two groups needed to obtain 'statistical significance' becomes negligible. This effect indicates that scientists have a potentially extremely low threshold for obtaining statistical significance. In its most extreme form, a "statistically significant" effect is in fact qualitatively insignificant.

In this study we have elected to perform statistical testing using the widely accepted and established two-sample t-test [1]. It should be noted that the t-test was developed in a beer factory in 1908 over one-hundred years ago by a scientist writing under a false name (Student). This was long before the advent of computers, thus long before a scientist had the ability to perform statistical testing on groups of data with large numbers of samples. In fact, the original introduction of the t-test [1] provided look-up tables to assist in statistical computations that allow the researcher to perform analyses on data groups with up to only 10 samples. In those days it was unreasonable for someone to manually compute p-values on hundreds or thousands of samples. In the present research environment a journal paper reviewer is likely to require many more than 10 samples from a typical scientist’s experiment, thus inadvertently lowering the bar for obtaining the desired “statistical significance”. After reading this study it should be clear that the ttest was developed for another era and alternative techniques would benefit the modern scientist. Or as Bill Rozeboom wrote in 1960 “The statistical folkways of a more primitive past continue to dominate the local scene” [2]. Rozeboom was writing about problems with statistical testing over 50 years after the t-test was first created. It is now 50 years after Rozeboom wrote this commentary and his words are still as relevant as ever. There have been many critiques of how statistical significance testing and null hypothesis testing is used [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20], yet despite the many shortcomings highlighted, performing hypothesis testing based on a p-value threshold (p<.05) is still one of the most common statistical techniques used by modern scientists.

Standard thought has it that if we increase our number of samples then the computed statistical p-value becomes more and more reliable. In fact, as we add more and more samples, the amount of separation needed between our groups to achieve statistical significance gets smaller. This is because the p-value computations are based on random data. Once the number of samples becomes very large, the amount of overlap observed between large randomly generated distributions will always be large, leading to very little separation required between the two distributions to achieve a p-value below 0.05. Or put another way, we have a threshold for statistical significance that is so low that (as long as we have an adequate number of samples) all we need is to have two noisy signals that are ever so marginally dissimilar in order to achieve “statistical significance”. This low threshold for achieving statistical significance has the potential to greatly affect a scientist’s approach to their experiments. As scientists, our career prospects (and thus our prestige and personal finances) are heavily dependent on our accumulation of peer-reviewed journal papers. This personal motivation biases us towards getting our research accepted for publication. Since it is extremely common for a journal paper reviewer to require that our experimental results be tested for statistical significance, we are generally biased towards finding statistical significance in our experiments in order to accumulate journal publications and to succeed in our careers.

The word ‘significant’ is qualitative and subjective. Whether something is ‘significant’ is in the eye of the beholder. When we add the word ‘statistics’, we add a strong quantitative word to the very qualitative word ‘significant’. This lends an appearance of credibility and certainty to any experiment that achieves a p-value below 0.05, si

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut