DiTOX: Fault Detection and Localization in the ONNX Optimizer
The ONNX Optimizer, part of the official ONNX repository and widely adopted for graph-level model optimizations, is used by default to optimize ONNX models. Despite its popularity, its ability to preserve model correctness has not been systematically evaluated. We present DiTOX, an automated framework for comprehensively assessing the correctness of the ONNX Optimizer using differential testing, fault localization, and evaluation techniques that generalize to other compiler optimizers. DiTOX applies optimization passes to a corpus of ONNX models, executes both original and optimized versions on user-defined inputs, and detects discrepancies in behavior or optimizer failures. When divergences are observed, DiTOX isolates the responsible optimization pass through iterative, fine-grained analysis. We evaluated DiTOX on 130 models from the ONNX Model Hub spanning vision and language tasks. We found that 9.2% of model instances crashed the optimizer or produced invalid models under default settings. Moreover, output discrepancies occurred in 30% of classification models and 16.6% of object detection and segmentation models, while text-based models were largely robust. Overall, DiTOX uncovered 15 issues – 14 previously unknown – affecting 9 of the 47 optimization passes as well as the optimizer infrastructure. All issues were reported to the ONNX Optimizer developers. Our results demonstrate that DiTOX provides a simple and effective approach for validating AI model optimizers and is readily extensible beyond ONNX.
💡 Research Summary
The paper introduces DiTOX, an automated framework designed to assess the functional correctness of the ONNX Optimizer—a widely used graph‑level optimizer for ONNX models. While the optimizer is integrated into the official ONNX repository and applied by default in many deployment pipelines, its impact on model accuracy and runtime stability had not been systematically examined. DiTOX addresses this gap by combining differential testing with fine‑grained fault localization at the level of individual optimization passes.
The authors first assembled a corpus of 130 real‑world ONNX models from the ONNX Model Hub, covering three major domains: image classification (≈70 models), object detection/semantic segmentation (≈30 models), and natural‑language tasks such as text comprehension and generation (≈30 models). For each model, DiTOX runs two parallel inference pipelines: the original model and the model after being processed by the ONNX Optimizer under two configurations—(i) the default “full” optimization (which applies all fuse and eliminate passes) and (ii) a custom configuration that can enable any subset of the 47 available passes. Inference is performed with ONNX Runtime on identical input datasets, and outputs are compared using metrics appropriate to the task: Kendall’s τ for ranked classification outputs, IoU, mAP, and F1 for detection/segmentation, and BLEU for text generation.
When a discrepancy is observed—either a crash, an invalid ONNX graph, or a statistically significant deviation in output—DiTOX initiates a fault‑localization loop. It iteratively re‑optimizes the model with a single pass at a time, re‑runs inference, and checks whether the discrepancy persists. The first pass that reproduces the problem is flagged as the faulty transformation. All findings are recorded in structured JSON reports, including the pass name, error type, and quantitative metrics.
Experimental results reveal that 9.2 % of model instances cause the optimizer to crash or emit malformed graphs. Output inconsistencies are observed in 30 % of classification models and 16.6 % of detection/segmentation models, while text‑based models remain largely unaffected. Overall, DiTOX uncovered 15 distinct bugs, 14 of which were previously unknown. These bugs affect nine of the 47 optimizer passes, spanning issues such as incorrect value references, mismatched input/output tensor shapes, and graph‑structure violations. The authors reported all bugs to the official ONNX Optimizer repository, providing reproducible steps and detailed diagnostics.
Beyond the empirical findings, the paper contributes a reusable methodology that can be applied to other AI compilers (e.g., TVM, MLIR). The modular architecture—comprising a Model Orchestrator, Optimizer Module, Runner Module, and Metrics Comparison Module—facilitates extension to new model types, additional metrics, and more sophisticated localization techniques such as data‑flow analysis or delta debugging. The authors discuss future work, including automated root‑cause extraction (e.g., graph diff visualizations) and integration of performance regression testing to ensure that optimizations improve speed without sacrificing accuracy.
In summary, DiTOX demonstrates that systematic, automated differential testing combined with per‑pass fault localization is an effective strategy for validating the correctness of deep‑learning model optimizers. The framework not only identified concrete defects in a widely used tool but also provided a blueprint for continuous, community‑driven quality assurance of AI compilation pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment