A Statistical Significance Simulation Study for the General Scientist

A Statistical Significance Simulation Study for the General Scientist Corresponding Author: Jacob Levman, PhD Author Affiliation: Sunnybrook Research Institute, Univ ersity of Toronto, Toronto, ON, Canada. 2075 Bayview Avenue, Room S605, Toronto, ON, Canada, M4N 3M5, 416-480- 6100 x3390, fax: 416-480-5714 jacob.levman@sri.utoronto.ca Abstract When a scientist performs an experiment th ey norm ally acquire a set of measurements and are expected to dem onstrate that their results are “ statistically significant” thus confirming whatever hypothesi s they are testing. The main method for establishing statistical significance involves demonstrating that there is a low probability that the observed experimental results were the pr oduct of random chance. This is typically defined as p < 0.05, which indicates there is less than a 5% chance that the observed results occurred randomly. This research study visually demonstrates that the commonly used definition for “statistical significance” can erroneously im ply a significant finding. This is demonstrated by generating random Ga ussian noise data and analyzing that data using statistical testing based on the established two-sam ple t-test. This study demonstrates that insig nificant yet “statist ically significant” fi ndings are possible at moderately large sample sizes which are very common in m any fields of modern science. Keywords: Statistical significance; hypothesis testin g; sam ple size; p-value; t-test Introduction Establishing statistical signi ficance is extremely common amo ng scientists. This involves demonstrating that there is a low probability that a sc ientist’s observed measurements were the result of random chance. This is t ypically defined as (p <0.05) which indicates that there is less than a 5% chance that the observed differences were the resu lt of randomness. Statistical significance can be esta blished using a wide variety of statistical tests which compare a scientist’s measurem ents with randomly genera ted distributions to determine a p-value (from which statistical si gnificance is establishe d). It is known that as the number of samples increases, the am ount of difference needed between our two distributions to obtain s tatistical signif icance (p<.05) gets smalle r. The main focus of this research paper is to present data which de monstrates that as the num ber of samples becomes large, the amount of separation between our two groups needed to obtain ‘statistical s ignificance’ becomes neglig ible. This effect indicates that scien tists have a potentially extremely low threshold for obtai ning statistical signifi cance. In its most extreme form, a “statistically significant” eff ect is in fact qualit atively insignificant. In this study we have elected to perf orm statistical testing using the widely accepted and established two-sample t-test [1]. It should be noted that the t-test was developed in a beer factory in 1908 over one-hundred years ago by a scientist writing under a false name (Student). This was long be fore the advent of computers, thus long before a scientist had the ability to perform st atistical testing on groups of data with large numbers of samples. In fact, the original introduction of the t-te st [1] provided look-up tables to assist in statis tical computations that allow the researcher to perf orm analyses on data groups with up to only 10 samples. In those days it was unreasonable for someone to manually compute p-values on hundreds or thousa nds of samples. In the present research environment a journal paper reviewer is li kely to require many more than 10 samples from a typical scientist’s experiment, thus inadvertently lowering the bar for obtaining the desired “statistical s ignificance”. After reading this s tudy it should be clear th at the t- test was developed for anothe r era and alternative technique s would benefit the m odern scientist. Or as Bill Roze boom wrote in 1960 “The statis tical folkways of a more primitive past continue to dom inate the lo cal scene” [2]. Rozeboom was writing about problems with statistica l testing over 50 years after the t-test was f irst created. It is now 50 years after Rozeboom wrote this commentary and his words are still as relevant as ever. There have been many critiques of ho w statistical significance testing and null hypothesis testing is used [2-20], yet desp ite the many shortcom ings highlighted, performing hypothesis testing based on a p-value th reshold (p<.05) is still one of the most common statistical techniques us ed by modern scientists. Standard thought has it that if we in crease our number of samples then the computed statistical p-value becomes m ore a nd more reliable. In fact, as we add more and more samples, the am ount of separati on needed between our groups to achieve statistical significance gets sm aller. This is because the p-value computations are based on random data. Once the number of samples b ecom es very large, the amount of overlap observed between large randomly generated distri butions will always be large, leading to very little separation require d between th e two distributions to achiev e a p-value below 0.05. Or put another way, we have a threshold fo r statistical significance that is so low that (as long as we have an adequate number of samples) all we need is to have two noisy signals that are ever so marginally dissim ilar in order to achieve “stati stical significance”. This low threshold for achieving statistica l significance has the potential to greatly affect a scientist’s approach to th eir experim ents. As scientists, our career prospects (and thus our prestige and persona l finances) are heavily dependent on our accumulation of peer-reviewed journal papers. This personal m otivation biases us towards getting our research accepted for publication. Since it is extrem ely common for a journal paper reviewer to requ ire that our experim ental resu lts be tested for statistica l significance, we are generally biased toward s finding statistical si gnificance in our experiments in order to accumulate journal publications and to su cceed in our careers. The word ‘significant’ is qualitative and subjec tive. Whether som ething is ‘significant’ is in the eye of the beholder. When we add th e word 'statistics', we add a strong quantitative word to the very qualitative word ' significant'. This lends an appearan ce of credibility and certainty to any experiment that achieve s a p-value below 0.05, simply because this is the widely accepted threshold for ac hieving ‘statistical significance’. Since statistical significance is based on random distributions, performing hypothesis testing on the p-value calculation (p <.05) is like asking the question: did our experiment do better than 95% of randomness? But since the vast m ajo rity of scientists are likely to have constructed their experime nts in a som ewhat logical manner, they are generally liable to do at least a little bette r than random chance. Thus scientists are highly likely to find statistical signif icance in their exp eriments, especially if they perform their experiments with m any samples. This study is design ed to visually illustrate that achieving statistical significance (p<.05) at moderately larg e sample sizes requires only marginally significan t (or possibly even insignif icant) experimental d ata. Methods P-values are computed from lookup tables which are created from randomly generated distributions of data. This research study’s methods are designed to visually illustrate how much separation is require d between two groups of numbers in order to achieve the standard definition of statistical significance (p<.05) at a variet y of sample sizes. This is accomplished by generating large amounts of normal (Gaussian) rando m distributions. We have elected to perform our analysis on two-sample tests where two groups of numbers are compared with each other in order to determ ine if they are statistically significantly different from each other. Th is is one of the most pervasive types of statistical testing as it is extremely common for a scientist to com pare two groups of numbers (for example an experim ental gr oup and a control group). For this study 1000 pairs of random distribu tions were create d at each example sample s ize. Of all the randomly generated cases, the pair that exhi bit the highest p-value below 0.05 is selected for presentation as a visual example of how much separation is needed between two groups of data in order to achieve ‘statistic al significance’ at the given sample size. Random distributions were generated across a wide variety of group sample sizes where the image’s dimensions are expressed as a factor of 2 (4, 16, 64, 256, 1024, 4096, 16384, 65536 and 262144 samples in the example distribu tions presented). The variance in these noise pairs demonstrates how the amount of separation between two barely statistically significantly different groups changes as the number of samples is varied. All statistical significance testing was pe rformed using one of the m ost common statistical tests available, the two-sample t-te st. This was selected so that our statistica l testing method matches the type of distribu tions being randomly generated (Gaussian noise / normal distributions). In addition, for each sam ple size setting, the number of randomly created distributions that have a p-value below 0.05 are enumerated. All random normal (Gaussian) distributions we re created using the mathematical and statistical package Matlab (Mathworks, Natick, MA, USA). Statistical te sting was performed with the established two- sample t-test provided in Matlab. Results Pairs of randomly generated distributions w ith p-values just below 0.05 are included as the main results of this study. Figure 1 demo nstrates randomly generated statistically significantly different normal distributions with 4 sam ples (2x2, top row), 16 samples (4x4, middle row) and 64 samples (8x8, bottom row). Figure 2 demonstrates randomly generated statistically signi ficantly different normal dist ributions w ith 256 samples (16x16, top row), 1024 samples (32x32, middle row) and 4096 sam ples (64x64, bottom row). Figure 3 demonstrates randomly gene rated statistically significantly diff erent normal distributions with 16384 sample s (128x128, top row), 65536 samples (256x256, middle row) and 262144 (512x512, bottom row). Table 1 presents the p-values of each of the pairs selected for viewing in figures 1, 2 and 3 as computed by Matlab’s two-sample t-test. Table 1 also presents the total numbe r of randomly generated distributions w hich achieved a statistically significant difference as the term is typically defined (p<.05), using the popular and well established two-sample t-test. Since the experiment involves creating 1000 randomized distributions we expe ct to find 50 (5%) of those samples being statistically significant (p<.05). The results from each trial were confirmed to be close to 50 samples achieving statistical significan ce out of each 1000 randomly created cases. When examining each noise image pair, a scientist can interpret the two visual image distributions as being very close to the threshold for obtaining statistical significance at the given number of sample s. Note that the d ifference between two statistically significantly dif ferent distribution s gets smaller as the nu mber of samples increases. Figure 1: Randomly generated pairs of statistically sign ificantly different (p<.05) distributions with 4 samples (top row), 16 sa mples (m iddle row) and 64 samples (bottom row). Figure 2: Randomly generated pairs of statistically sign ificantly different (p<.05) distributions with 256 samples (top row), 1024 samples (middle row) and 4096 samples (bottom row). Figure 3: Randomly generated pairs of statistically sign ificantly different (p<.05) distributions with 16384 samples (top row) , 65536 samples (middle row) and 262144 samples (bottom row). Table 1: P-values and Associated Data fo r the Randomly Generated Distributions of Figures 1, 2 and 3 Random Distribution Size P- value of Pair Presented in Figures Above Number of Random Cases with p < 0.05 2x2=4 0.0499 51/1000 4x4=16 0.0459 44/1000 8x8=64 0.0488 48/1000 16x16=256 0.0493 49/1000 32x32=1024 0.0498 54/1000 64x64=4096 0.0499 56/1000 128x128=16384 0.0483 46/1000 256x256=65536 0.0485 42/1000 512x512=262144 0.0496 42/1000 Discussion It can be seen from the results that at a p-value just below 0.05, the two random ly generated groups of 4 samples each are substa ntially different from each other as the image on the right is cl early darker overall th an the image on the left (see figure 1 top line). At 16 and 64 samples, it is clear th at the random im age on the left is darker than the one on the right although it is clear that the 6 4 sample images are substantially more similar to each oth er than the images with 16 or 4 samples. All of the statistically significant pairs presented in Figure 1 appear qualitatively significantly different from each other. Once the size of the images has been increased to just 256 samples it becom es challenging to see a significant difference be tween the two distribu tions, even though the results displayed are statistical ly significant (p = 0.0493) as th e term is traditionally used (see figure 2 top line). When comparing two di stributions with more than 256 samples, the distributions appear qualitatively insignifi cantly different from each other despite having obtained “statistical significance” (see figures 2 and 3 ). Data was also included demonstrating that approximately 5% of the random ly generated samples created for th is experiment achieve s tatisti cal significance (as the term is typically defined p <.05). This is presen ted in the final column of Table 1. This information is simply provided to dem ons trate that the experiment is matching expectations – that about 5% or about 50 out of 1000 randomly created distributions have a p-value below 0.05. P-value computations from single sample sta tistical tests are intuitive: a new single sample can be compared against the pre-existing group and the 0.05 p-value threshold causes only those samples that fall on the outskirts of th e distribution to be considered statistically significantly differe nt. The same does not hold once we move to two-sample statistical testi ng. If we generate tw o random groups of data with each group containing many samples, then it is inevitable that the two groups will overlap each other substantially. Even in the 5% of cases where the two large random distributions are most dissimilar, we will still find highly overlappi ng distribu tions as demonstrated in this paper’s results (figures 2 and 3). This has the effect of setti ng the bar for finding statistical significance in our experiments extremely low (esp ecially when the num ber of samples is large) and may have led researchers to conclude a significa nt effect from their experimental results when in fact th e effect observed is much smaller or possib ly even non-existent. Achieving statisti cal significance (p<.05) merely demonstrates that the experimental results outperfor med 95% of the random ly generated noise from which the p-value is computed. Ascribi ng ‘significance’ to any experi ment is a subjective task which should be evaluated by whomever is interested in examining the experim ent, not by a single number. This study’s findings are potentially of broa d interest to scient ists in general. Figure 2 (top line) demonstrates that achie ving statistical significance (p<.05) on groups with only 256 samples only confirms the exis tence of an extrem ely marginal effect. Scientific studies based on at least a couple hundred samples in each group are extremely common in the literature. Establishing statisti cal significance is typically a prerequisite for publication of a scientific study. Scientists who find statistical significance in experiments containing thousands of samples haven’t actually demonstrated that their findings are significant at all (unless they’ ve included well separated conf idence intervals). It is doubtful that anyone would qua litatively describe the pairs of results presented in figure 3 as significantly different from each other even though they meet the normal criteria for statistical significance (p<.05). Establishing statistical si gnificance with a p-value provides us with an answer to the question “did we beat 95% of randomness?” But randomness is a very low bar to set for ourselves, thus ensuring that scientists who work with reasonably large sample sizes will be able to go on finding statistically si gn ificant (p<.05) results (almost) wherever they look for them. Acknowledgments This work was supported by the Canadian Breast Cancer Foundation . References [1] Student, S. The Probable Error of a Mean. Biometrika , 6(1), 1-25 (1908). [2] Rozeboom, W. W. The fallacy of the null hypothesis significance test. Psychological Bulletin , 57, 416-428 (1960). [3] Wilhelmus, K. Beyond P: I: Problems with Probability. Journal of Cataract and Refractive Surgery , 30(9), 2005-2006 (2004). [4] Siegried, T. Odds are, It’s Wrong. Science News , 177(7), (2010). [5] Cohen, J. The Earth Is Round (p < .05). American Psychologist , 49(12), 997-1003 (1994). [6] Bakan, D. The test of signifi cance in psychological research. Psychological Bulletin , 66, 1-29 (1966). [7] Morrison, D. E. and Henkel, R. E. The si gnificance test controvers y. Chicago: Aldine Publishing Company (1970). [8] Falk, R. and Greenbaum, C. W. Significance Tests Die Hard. Theory & Psychology , 5(1), 75-98 (1995). [9] Goodman, S. Toward Evidence-Based Medi cal Statistics 1: Th e P Value Fallacy. Annals of Internal Medicine , 130(12), pp. 995-1004 (1999). [10] Wagenmakers, E. A pr actical solution to the perv asive problems of p values. Psychonomic Bulletin & Review , 14, 779-804 (2007). [11] Dixon, P. The p-value fallacy and how to avoid it. Canadian Journal of Experimental Psychology (2003). [12] Panagiotakos, D. The Value of p-Value in Biomedical Research. Open Cardiovasc Med J , 2, 97-99 (2008). [13] Goodman, S. p Values, Hypothesis Tests, and Likelihood: Implications for Epidemiology of a Neglected Historical Debate. American Journal of Epidemiology , 137(5), 485-496 (1993). [14] Sainani, K. Misleading Comparisons : The Fallacy of Comparing Statistical Significance. Physical Medicine and Rehabilitation , 2, 559-562 (2010). [15] Taylor, K. and Frideres, J. Issues Versus Controversie s: Substantiv e and Statistical Significance. American Sociological Review , 37, 464-472 (1972). [16] Gliner, J., Leech, N., and Morgan, G. Problems With Null Hypothesis Significance Testing (NHST): What Do The Textbooks Say?. The Journal of Experimental Education , 71(1), 83-92 (2002). [17] Pocock, S., Hughes, M., and Lee, R. Sta tistical Problems in the Reporting of Clinical Trials. New England Journal of Medicine , 317, 426-432 (1987). [18] Kaye, D. Is Proof of Sta tistical Significance Relevant?. Washington Law Review , 61, (1986). [19] Nurminen, M. Statistical significance – a misconstrued notion in medical research. Scandinavian Journal of Work Environment and Health , 23(3), 232-235 (1997). [20] Hopewell, S., Loudon, K., Clarke, M. J., Oxman, A. D., and Dickersin, K. Publication bias in clinical trials due to stat istical significance or dire ction of trial results. Cochrane Database of Systematic Reviews , 1, (2009).

A Statistical Significance Simulation Study for the General Scientist

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment