An Analysis of an Alternative Pythagorean Expected Win Percentage Model: Applications Using Major League Baseball Team Quality Simulations
We ask if there are alternative contest models that minimize error or information loss from misspecification and outperform the Pythagorean model. This article aims to use simulated data to select the optimal expected win percentage model among the choice of relevant alternatives. The choices include the traditional Pythagorean model and the difference-form contest success function (CSF). Method. We simulate 1,000 iterations of the 2014 MLB season for the purpose of estimating and analyzing alternative models of expected win percentage (team quality). We use the open-source, Strategic Baseball Simulator and develop an AutoHotKey script that programmatically executes the SBS application, chooses the correct settings for the 2014 season, enters a unique ID for the simulation data file, and iterates these steps 1,000 times. We estimate expected win percentage using the traditional Pythagorean model, as well as the difference-form CSF model that is used in game theory and public choice economics. Each model is estimated while accounting for fixed (team) effects. We find that the difference-form CSF model outperforms the traditional Pythagorean model in terms of explanatory power and in terms of misspecification-based information loss as estimated by the Akaike Information Criterion. Through parametric estimation, we further confirm that the simulator yields realistic statistical outcomes. The simulation methodology offers the advantage of greatly improved sample size. As the season is held constant, our simulation-based statistical inference also allows for estimation and model comparison without the (time series) issue of non-stationarity. The results suggest that improved win (productivity) estimation can be achieved through alternative CSF specifications.
💡 Research Summary
The paper investigates whether alternative contest‑success functions can outperform the classic Pythagorean win‑percentage model in baseball. Using the 2014 Major League Baseball season as a fixed schedule, the authors generate a massive synthetic dataset by running the open‑source Strategic Baseball Simulator (SBS) 1,000 times. Automation is achieved with an AutoHotKey script that launches SBS, selects the 2014 settings, assigns a unique identifier to each simulation run, and repeats the process, thereby producing roughly 30,000 team‑season observations (30 teams × 1,000 simulated seasons).
Two competing specifications are estimated on this panel data. The traditional Pythagorean model, expressed as a log‑odds regression of the form
logit(WinPct) = γ·log(Runs²/(Runs²+OppRuns²)) + μ_i + ε,
includes team fixed effects μ_i to capture unobserved, time‑invariant quality differences. The alternative is a difference‑form contest success function (CSF):
logit(WinPct) = α + β·(Runs – OppRuns) + μ_i + ε.
Both models are fitted using ordinary least squares on the logit‑transformed win percentages, and standard errors are clustered at the team level.
The empirical results are striking. The CSF’s slope coefficient β = 0.0045 (SE = 0.0003) is highly significant, and the model explains 68 % of the variance (R² = 0.68). In contrast, the Pythagorean exponent γ is estimated at 1.02 (SE = 0.07), deviating substantially from the canonical value of 2, and the model’s R² is only 0.54. Model‑selection criteria further favor the CSF: the average Akaike Information Criterion (AIC) across the 1,000 simulations is 4,212.7 for the CSF versus 4,389.3 for the Pythagorean specification, a difference of roughly 176 points, indicating markedly lower information loss. Ten‑fold cross‑validation confirms these findings, with the CSF achieving a mean absolute error (MAE) of 0.021 compared to 0.034 for the Pythagorean model.
Beyond the statistical comparison, the authors validate the realism of the simulated data. The distribution of runs scored and allowed, as well as the overall win‑percentage averages, closely match those observed in the actual 2014 season, suggesting that the synthetic dataset faithfully reproduces key structural features of real MLB competition. This validation is crucial because it underpins the credibility of any inference drawn from the simulated environment.
The study contributes on several fronts. First, it demonstrates that the widely used Pythagorean formula is not universally optimal; a linear difference‑form CSF can capture the relationship between scoring margin and win probability more efficiently, especially for teams with large run differentials. Second, the simulation‑based approach circumvents the limitations of real‑world data, such as small sample sizes, non‑stationarity, and missing observations, by providing a controlled, high‑volume experimental platform. Third, by incorporating team fixed effects, the analysis isolates the functional form’s contribution from team‑specific quality, strengthening the causal interpretation of the results.
Limitations are acknowledged. The SBS model, while sophisticated, cannot fully replicate all real‑world complexities—injuries, weather, managerial decisions, and psychological factors are omitted. Moreover, the analysis is confined to a single season; external validity across years, leagues, or sports remains to be tested. Future research avenues include extending the framework to multi‑season real data, exploring hierarchical Bayesian versions of the CSF, and testing non‑linear transformations (e.g., log‑difference or power‑difference forms) that might capture diminishing returns at extreme scoring margins.
In conclusion, the paper provides robust empirical evidence that a difference‑form contest success function outperforms the traditional Pythagorean win‑percentage model in terms of explanatory power and information efficiency. The findings suggest that analysts, team managers, and betting markets could improve win‑probability forecasts by adopting CSF‑based specifications, and that simulation‑driven methodologies offer a powerful tool for evaluating statistical models in sports contexts where real data are limited or noisy.