Deep-ICE: the first globally optimal algorithm for minimizing 0-1 loss in two-layer ReLU and maxout networks
This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models).
💡 Research Summary
The paper introduces Deep‑ICE, the first algorithm that provably finds a globally optimal solution to the empirical risk minimization (ERM) problem for two‑layer ReLU and Maxout networks when the loss is the 0‑1 misclassification count. The authors start by observing that, although minimizing 0‑1 loss is NP‑hard in general, the combinatorial structure of a finite training set imposes a finite number of distinct data partitions that can be induced by hyperplanes. By leveraging the result of He & Little (2023) that every optimal separating hyperplane for a linear classifier can be expressed as the affine span of D data points, they argue that the effective search space for a two‑layer network with K hidden units consists of at most O(N^D) possible hyperplanes per unit, leading to a total of O(N^{DK}) configurations.
To explore this space efficiently, the authors develop a formalism based on list homomorphisms and fusion laws. They construct a recursive “nested combination generator” that enumerates all K‑tuples of hyperplane partitions without redundant recomputation. Two algorithmic variants are presented: (1) a sequential version that memoizes intermediate activation patterns, and (2) a divide‑and‑conquer version that splits the enumeration tree into independent sub‑tasks, enabling embarrassingly parallel execution on GPUs or multi‑core CPUs. The theoretical time complexity is shown to be O(2^{K‑1}·N^{DK+1}+N·D·D³), which improves on earlier work by Arora et al. (2016) and Hertrich (2022) that suffered from ambiguous exponential factors and large hidden constants. Moreover, the method can be extended to any computable loss function without altering the asymptotic bound, simply by plugging in the loss‑specific linear aggregation used in the He & Little framework.
Because the worst‑case complexity still grows exponentially with the product D·K, the authors introduce a heuristic coreset selection step for large‑scale data. The coreset is a small, representative subset of the original training set (typically <1 % of N) obtained via a greedy coverage algorithm. Deep‑ICE is then applied to the coreset, and the resulting parameters are evaluated on the full dataset. This approach yields practical runtimes on datasets with tens of thousands of points while preserving most of the optimality benefits.
Empirical evaluation is conducted on synthetic benchmarks (N ≤ 200, D ≤ 5, K ≤ 4) where the exact optimum can be verified, and on real‑world datasets where the coreset technique is used. In the small‑scale regime, Deep‑ICE consistently recovers the true global optimum, confirming the correctness of the implementation. On larger datasets, the method achieves 20‑30 % lower misclassification rates compared to standard stochastic gradient descent training of the same two‑layer architecture and to linear Support Vector Machines. Notably, the authors report that the exact optimization does not lead to severe over‑fitting; test error improvements are comparable to training error gains.
The paper’s contributions are threefold: (i) a rigorously defined, globally optimal algorithm for 0‑1 loss in two‑layer ReLU/Maxout networks, (ii) two concrete implementations that balance memory usage and parallel scalability, and (iii) a practical coreset‑based extension that makes the approach feasible for moderately large problems. However, several limitations remain. The exponential dependence on D and K restricts applicability to shallow, low‑dimensional models. The coreset heuristic lacks theoretical guarantees of preserving the global optimum, and its quality depends on the greedy selection strategy. Implementation details such as CUDA kernel design, memory footprint, and hyper‑parameter settings are only briefly described, which hampers reproducibility. Moreover, the experimental comparison is limited to simple baselines; performance against modern deep architectures or advanced regularization techniques is not explored. Finally, the manuscript contains numerous typographical and formatting errors that detract from readability.
In conclusion, Deep‑ICE represents a significant step toward exact combinatorial optimization of neural networks under discrete loss functions. It provides a clear theoretical foundation, demonstrates practical feasibility through coreset reduction, and opens avenues for future work on scaling exact methods, extending to deeper networks, and integrating with other non‑convex loss landscapes.
Comments & Academic Discussion
Loading comments...
Leave a Comment