Power Analysis is Essential: High-Powered Tests Suggest Minimal to No Effect of Rounded Shapes on Click-Through Rates

Power Analysis is Essential: High-Powered Tests Suggest Minimal to No Effect of Rounded Shapes on Click-Through Rates
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Underpowered studies (below 50%) suffer from the winner’s curse: A statistically significant result must exaggerate the true treatment effect to meet the significance threshold. A study by Dipayan Biswas, Annika Abell, and Roger Chacko published in the Journal of Consumer Research (2023) reported that in an A/B test simply rounding the corners of square buttons increased the online click-through rate by 55% (p-value 0.037)$\unicode{x2014}$a striking finding with potentially wide-ranging implications for a digital industry that is seeking to enhance consumer engagement. Drawing on our experience with tens of thousands of A/B tests, many involving similar user interface modifications, we found this dramatic claim implausibly large. To evaluate the claim, and provide a more accurate estimate of the treatment effect, we conducted three high-powered A/B tests, each involving over two thousand times more users than the original study. All three experiments yielded effect size estimates that were approximately two orders of magnitude smaller than initially reported, with 95% confidence intervals that include zero, that is, not statistically significant at the 0.05 level. Two additional independent replications by Evidoo found similarly small effects. These findings underscore the critical importance of power analysis and experimental design in increasing trust and reproducibility of results.


💡 Research Summary

The paper critically re‑examines the striking claim made by Biswas, Abell and Chacko (2023) that simply rounding the corners of square CTA buttons yields a 55 % lift in click‑through rate (CTR). The original evidence rests on two online studies: Study 2, an A/B test with only 919 visits that reported a statistically significant 55 % increase (p = 0.037), and Study 1, a “field experiment” run through Google Ads that suffered from non‑random assignment and an extreme sample‑ratio mismatch (SRM) of 44 % vs 56 %. The authors argue that these designs are severely under‑powered (power well below 50 %), exposing the results to the “winner’s curse” – the tendency for significant findings in low‑power studies to overestimate the true effect.

To address these methodological flaws, the authors conducted three large‑scale, high‑powered replications. Two were run on Coop’s e‑commerce sites and one on SeaWorld® Orlando, together involving 2.8 million, 2.2 million, and 1.9 million users respectively—over two thousand times the sample size of the original test. Each experiment was pre‑registered with the American Economic Association’s Registry for Randomized Controlled Trials, used a simple binary treatment (square vs. rounded corners), measured the same primary metric (CTR), and avoided any post‑hoc filtering or transformations. Minimum Detectable Effects (MDEs) were set conservatively at 2 % for the first two experiments and 0.5 % for the third, based on industry benchmarks.

The results were strikingly modest: estimated lifts of 0.16 % (p = 0.20), 0.29 % (p = 0.60), and 0.73 % (p = 0.09). All 95 % confidence intervals included zero, indicating no statistically significant effect. Two independent replications performed by Evidoo reported similarly tiny, non‑significant lifts (all under 1 %). These findings contrast sharply with the original 55 % claim, suggesting that the earlier result was an artefact of low power, sampling variability, and design flaws.

The authors supplement their empirical work with a comprehensive review of industry‑wide A/B test repositories (GoodUI.org, Evidoo), large‑company experiment archives (Microsoft/Bing, Airbnb, Analytics Toolkit), and expert elicitation from practitioners. Across thousands of experiments, typical CTR lifts are well below 1 %, with median lifts around 0.1 % and rarely exceeding 0.3 %. Even the most successful patterns in the cited repositories achieve at most a 6–7 % relative lift, far smaller than the 55 % reported by the original study. This contextual evidence underscores how implausible the original effect size is.

A detailed power analysis illustrates why the original study could not have reliably detected the effect it claimed. Using the standard formula n = 16σ²/Δ² for a two‑sided test with α = 0.05 and 80 % power, the authors show that detecting a 2 % lift would require roughly 520 k users per variant; detecting a 5 % lift still needs over 80 k per variant; even a 10 % lift demands >20 k per variant. The original experiment’s 474 vs. 445 visits fall dramatically short of these thresholds, making any claim of a 55 % lift statistically untenable.

The paper also highlights methodological nuances specific to online experiments: the need to randomize at the user level, the danger of treating visits as independent observations, and the importance of monitoring SRM in real‑time to avoid allocation bugs that can invalidate results. The authors note that Google Optimize’s “single‑user exposure” guarantee does not eliminate the need for rigorous statistical handling of repeated visits.

In conclusion, the authors demonstrate that when properly powered and rigorously designed, the effect of rounding button corners on CTR is essentially negligible. Their work serves as a cautionary tale about the perils of under‑powered A/B tests, the necessity of pre‑experiment power calculations, and the value of replication in the fast‑moving digital product space. The study advocates for industry‑wide adoption of robust experimental design standards to ensure that actionable insights are both reliable and reproducible.


Comments & Academic Discussion

Loading comments...

Leave a Comment