A Selective Quantization Tuner for ONNX Models
Quantization reduces the precision of deep neural networks to lower model size and computational demands, but often at the expense of accuracy. Fully quantized models can suffer significant accuracy degradation, and resource-constrained hardware accelerators may not support all quantized operations. A common workaround is selective quantization, where only some layers are quantized while others remain at full precision. However, determining the optimal balance between accuracy and efficiency is a challenging task. To this direction, we propose SeQTO, a framework that enables selective quantization, deployment, and execution of ONNX models on diverse CPU and GPU devices, combined with profiling and multi-objective optimization. SeQTO generates selectively quantized models, deploys them across hardware accelerators, evaluates performance on metrics such as accuracy and size, applies Pareto Front-based objective minimization to identify optimal candidates, and provides visualization of results. We evaluated SeQTO on four ONNX models under two quantization settings across CPU and GPU devices. Our results show that SeQTO effectively identifies high-quality selectively quantized models, achieving up to 54.14% lower accuracy loss while maintaining up to 98.18% of size reduction compared to fully quantized models.
💡 Research Summary
The paper addresses the well‑known trade‑off in deep‑learning model quantization: while reducing model size and inference latency, quantization—especially full quantization—often incurs significant accuracy loss and may encounter compatibility issues on resource‑constrained accelerators. To tackle these challenges, the authors introduce SeQTO (Selective Quantization Tuner for ONNX), a comprehensive framework that automates selective post‑training quantization, deployment, profiling, and multi‑objective optimization for ONNX models across heterogeneous CPU and GPU platforms.
SeQTO’s workflow consists of five modules. First, the Model Orchestration module fetches ONNX models from local storage or the official ONNX Model Hub, handling multiple models in parallel. Second, the Selective Quantization module leverages the ONNX Quantizer to produce partially quantized models by supplying a list of layers to exclude. The quantizer supports both static (weights, biases, and activations quantized at compile time) and dynamic (activations quantized at runtime) modes.
A crucial component is Layer Activation Analysis. After fully quantizing a model, SeQTO runs a small calibration set through both the original and fully quantized versions, computing two per‑layer error metrics: QDQ Error (the deviation of a layer’s output when quantized in isolation) and XModel Error (the layer’s contribution to the overall accuracy drop of the fully quantized model). Both metrics are normalized to
Comments & Academic Discussion
Loading comments...
Leave a Comment