Estimating Bernoulli trial probability from a small sample
The standard textbook method for estimating the probability of a biased coin from finite tosses implicitly assumes the sample sizes are large and gives incorrect results for small samples. We describe the exact solution, which is correct for any sample size.
💡 Research Summary
The paper addresses a fundamental yet often overlooked problem in elementary statistics: estimating the success probability p of a Bernoulli trial (e.g., a biased coin) when the available data consist of only a few tosses. Traditional textbook treatments present the sample proportion k/n as the point estimate and construct a confidence interval using the normal approximation (the Wald interval). While this approach works reasonably well when the number of trials n is large, the authors demonstrate that it becomes unreliable for small n, leading to biased point estimates, confidence intervals with poor coverage, and pathological behavior when the observed successes are either zero or equal to n.
The authors first quantify the shortcomings of the Wald method. They show analytically that the variance estimator k̂(1‑k̂)/n underestimates the true variance for small samples, and they provide Monte‑Carlo simulations illustrating that the nominal 95 % coverage can drop below 80 % for n ≤ 10, especially when the true p is near 0 or 1. Moreover, the Wald interval assigns a probability of zero to the region outside the observed extremes, which is unreasonable when the sample provides no information about the tails.
To overcome these issues, the paper presents two exact alternatives. The Bayesian solution assumes a uniform prior Beta(1, 1) and derives the posterior distribution Beta(k+1, n‑k+1). The posterior mean, (k+1)/(n+2), known as Laplace’s rule of succession, never collapses to 0 or 1, even when k = 0 or k = n, and it provides a natural point estimate that is slightly biased toward the centre but has lower mean‑squared error for very small n. Credible intervals are obtained by taking the 2.5 % and 97.5 % quantiles of the posterior, guaranteeing the nominal coverage regardless of sample size.
The second alternative is the frequentist Clopper–Pearson exact interval. By inverting the binomial test, lower and upper bounds pₗ and pᵤ are defined such that P(X ≤ k | pₗ) = α/2 and P(X ≥ k | pᵤ) = α/2. This construction yields a conservative interval that always attains at least the stated confidence level, albeit with a modest increase in average width compared with the Bayesian interval.
Beyond interval estimation, the authors examine point‑estimation refinements. While the maximum‑likelihood estimator (MLE) k/n is unbiased, its associated variance estimate is optimistic for small samples. The paper evaluates the Wilson estimator, which adjusts both the numerator and denominator using the z‑score, and the Agresti–Coull estimator, which adds 0.5 pseudo‑successes and 1 pseudo‑failure. Simulation results indicate that for n ≤ 10 the Wilson estimator achieves the lowest mean‑squared error, whereas the Agresti–Coull estimator provides a robust fallback when k = 0 or k = n.
A concrete example illustrates the practical impact. With eight tosses yielding two heads, the MLE gives 0.25, the Wilson estimate 0.31, the Bayesian posterior mean 0.30, and the Agresti–Coull estimate 0.28. The corresponding 95 % intervals are: Wald (0.03, 0.47), Wilson (0.09, 0.48), Bayesian credible (0.07, 0.57), and Clopper–Pearson (0.03, 0.55). Only the Bayesian and Clopper–Pearson intervals achieve the nominal coverage in repeated‑sampling simulations, confirming the inadequacy of the Wald interval for such a small data set.
In conclusion, the paper convincingly argues that the naïve use of k/n and the Wald interval is inappropriate when sample sizes are modest. It recommends the Bayesian posterior mean (or, equivalently, Laplace’s rule of succession) for point estimation and either the Bayesian credible interval or the Clopper–Pearson exact interval for uncertainty quantification. For practitioners who prefer a frequentist point estimate, the Wilson or Agresti–Coull adjustments are shown to reduce mean‑squared error without sacrificing interpretability. These findings have immediate relevance for fields such as clinical trials, quality control, and any domain where decisions must be made on the basis of a handful of binary observations.
Comments & Academic Discussion
Loading comments...
Leave a Comment