Generalized Optimal Classification Trees: A Mixed-Integer Programming Approach
Global optimization of decision trees is a long-standing challenge in combinatorial optimization, yet such models play an important role in interpretable machine learning. Although the problem has been investigated for several decades, only recent advances in discrete optimization have enabled practical algorithms for solving optimal classification tree problems on real-world datasets. Mixed-integer programming (MIP) offers a high degree of modeling flexibility, and we therefore propose a MIP-based framework for learning optimal classification trees under nonlinear performance metrics, such as the F1-score, that explicitly addresses class imbalance. To improve scalability, we develop problem-specific acceleration techniques, including a tailored branch-and-cut algorithm, an instance-reduction scheme, and warm-start strategies. We evaluate the proposed approach on 50 benchmark datasets. The computational results show that the framework can efficiently optimize nonlinear metrics while achieving strong predictive performance and reduced solution times compared with existing methods.
💡 Research Summary
The paper tackles the long‑standing problem of globally optimal decision‑tree learning, with a particular focus on imbalanced classification tasks where nonlinear performance measures such as the F1‑score, Matthews correlation coefficient (MCC), and the FoM index are more appropriate than simple misclassification error. The authors propose a unified mixed‑integer programming (MIP) framework that can directly optimize any metric that can be expressed as a rational function of the entries of the confusion matrix. By introducing auxiliary variables and linearization tricks (big‑M constraints, piecewise linear approximations, and absolute‑value reformulations), the originally nonlinear objectives are transformed into a set of linear constraints, allowing state‑of‑the‑art MIP solvers to handle them without resorting to surrogate loss functions.
A key scalability innovation is the “weighted flow” formulation (WFlowOCT). The training set is first compressed into a set of unique feature‑label pairs; each unique instance i is assigned an integer weight w_i equal to its multiplicity in the original data. This aggregation dramatically reduces the number of flow variables and constraints, especially for high‑dimensional binary data where many rows are duplicated. The flow network is identical to the one used in FlowOCT, but the objective now incorporates the instance weights, preserving optimality while shrinking the problem size.
To further accelerate solution times, the authors design a problem‑specific branch‑and‑cut algorithm. Feature‑aware conflict cuts prevent contradictory splits at the same node, and path‑based cuts prune infeasible routing patterns early in the branch‑and‑bound tree. A warm‑start strategy feeds solutions from fast heuristics (CART, BinOCT) and from a previous Benders‑OCT run into the MIP solver, dramatically reducing the initial optimality gap. The overall algorithm also employs a Benders decomposition: the master problem decides the branching structure, while the subproblem solves a weighted flow that respects the current structure. This decomposition enables the handling of datasets with up to 245 057 samples and 200 features.
The experimental evaluation covers 50 publicly available benchmark datasets ranging from 100 to 245 057 observations, with class imbalance ratios up to 1:100. Trees of depth D = 2–4 are learned. Compared with the leading Benders‑OCT method and recent dynamic‑programming approaches that can only handle linear objectives, the proposed framework achieves:
- Speed: Average solution times are reduced by 30–60 %, with the largest instances solved within two hours, whereas competing methods often exceed the time limit.
- Predictive quality: When optimizing F1‑score or MCC, the resulting trees improve these metrics by 2–5 percentage points on average; for highly imbalanced data the MCC gain reaches up to 7 percentage points.
- Model size: The number of leaf nodes required to achieve the reported performance is comparable to or smaller than that of baseline methods, preserving interpretability.
- Ablation: Removing warm‑start or the custom cuts leads to 20–40 % longer runtimes, confirming the importance of each acceleration component.
The authors also discuss how the linearization scheme can be extended to other nonlinear measures such as ROC‑AUC or PR‑AUC, and they note that the framework is compatible with additional constraints (e.g., fairness, monotonicity). In summary, the paper delivers a practically viable, mathematically rigorous approach for globally optimal classification‑tree learning under nonlinear, imbalance‑sensitive objectives, advancing both the theory and application of interpretable machine learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment