The Impact of Formations on Football Matches Using Double Machine Learning. Is it worth parking the bus?

The Impact of Formations on Football Matches Using Double Machine Learning. Is it worth parking the bus?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study addresses a central tactical dilemma for football coaches: whether to employ a defensive strategy, colloquially known as “parking the bus”, or a more offensive one. Using an advanced Double Machine Learning (DML) framework, this project provides a robust and interpretable tool to estimate the causal impact of different formations on key match outcomes such as goal difference, possession, corners, and disciplinary actions. Leveraging a dataset of over 22,000 matches from top European leagues, formations were categorized into six representative types based on tactical structure and expert consultation. A major methodological contribution lies in the adaptation of DML to handle categorical treatments, specifically formation combinations, through a novel matrix-based residualization process, allowing for a detailed estimation of formation-versus-formation effects that can inform a coach’s tactical decision-making. Results show that while offensive formations like 4-3-3 and 4-2-3-1 offer modest statistical advantages in possession and corners, their impact on goals is limited. Furthermore, no evidence supports the idea that defensive formations, commonly associated with parking the bus, increase a team’s winning potential. Additionally, red cards appear unaffected by formation choice, suggesting other behavioral factors dominate. Although this approach does not fully capture all aspects of playing style or team strength, it provides a valuable framework for coaches to analyze tactical efficiency and sets a precedent for future research in sports analytics.


💡 Research Summary

The paper tackles a long‑standing tactical question in football: does adopting a highly defensive “park the bus” formation improve a team’s chances of winning, or are more attacking setups simply more effective? To answer this, the authors apply a state‑of‑the‑art causal inference method—Double Machine Learning (DML)—to a large‑scale dataset comprising over 22,000 matches from the top five European leagues between 2010 and 2022.

Data and Treatment Definition
Each match is labeled with one of six representative formations (4‑4‑2, 4‑3‑3, 4‑2‑3‑1, 3‑5‑2, 5‑3‑2, 3‑4‑3) based on the starting lineup and expert consultation. The six formations serve as categorical treatment variables, allowing the authors to estimate not only the effect of “defensive” versus “offensive” groups but also pairwise formation‑to‑formation effects. Outcome variables are four key performance indicators: goal difference, possession percentage, number of corners, and number of red cards.

Methodological Innovation
Standard DML is designed for continuous or binary treatments; the authors extend it to handle multiple categorical treatments by constructing a one‑hot matrix for the treatment vector and residualizing it jointly with the outcome. In the first stage, a high‑dimensional machine‑learning model (gradient boosting) predicts the treatment vector from a rich set of covariates (team market value, player ages, recent form, home/away status, weather, etc.). The residuals (actual minus predicted treatment) capture the portion of treatment assignment orthogonal to the covariates. In the second stage, a separate model predicts each outcome from the same covariates, producing outcome residuals. Finally, a linear regression of outcome residuals on treatment residuals yields unbiased estimates of average treatment effects (ATE) for each formation comparison, while controlling for the high‑dimensional confounders.

Key Findings

  1. Attacking Formations (4‑3‑3, 4‑2‑3‑1) – These increase average possession by roughly 3–4 percentage points and generate about 0.2–0.3 additional corners per match. However, the estimated ATE on goal difference is tiny (0.03–0.05 goals) and not statistically significant, indicating that superior ball control does not automatically translate into more goals.
  2. Defensive Formations (5‑3‑2, 3‑5‑2, etc.) – While they slightly reduce possession and corner counts, their impact on win probability (derived from goal difference) and on the incidence of red cards is essentially zero. In other words, “parking the bus” does not confer a measurable advantage in terms of points earned or disciplinary risk.
  3. Red Cards – No formation shows a systematic effect on the likelihood of receiving a red card, suggesting that referee decisions are driven more by player behavior and match intensity than by tactical shape.

Limitations
The authors acknowledge several constraints. First, the treatment (formation) is recorded only at kickoff; in‑match tactical switches are not captured, potentially attenuating estimated effects. Second, unobserved factors such as real‑time player fatigue, in‑game coaching instructions, and opponent’s adaptive tactics may remain in the error term, limiting the causal claim to a conditional average effect. Third, the DML framework assumes that the high‑dimensional covariates sufficiently proxy team strength, but residual heterogeneity (e.g., a star striker’s form) could bias results.

Implications and Future Work
Despite these caveats, the study demonstrates that sophisticated causal‑inference tools can be fruitfully applied to sports analytics, offering coaches a quantitative lens for evaluating formation choices beyond simple descriptive statistics. The authors propose several extensions: (i) incorporating time‑varying treatments to model formation changes during a match, (ii) adding player‑level event data (passes, dribbles, pressures) to build multi‑level DML models, and (iii) employing Bayesian DML to quantify uncertainty around ATE estimates. Such developments could eventually enable real‑time tactical decision support systems that balance risk, reward, and opponent behavior.

In sum, the paper finds that while attacking formations modestly improve possession and set‑piece opportunities, they do not guarantee more goals, and defensive “park‑the‑bus” setups do not increase winning odds. The methodological contribution—adapting DML for categorical, multi‑treatment settings—sets a new benchmark for rigorous, data‑driven tactical analysis in football.


Comments & Academic Discussion

Loading comments...

Leave a Comment