Error Analysis of Matrix Multiplication Emulation Using Ozaki-II Scheme
The Ozaki-II scheme is an emulation method that leverages the Chinese Remainder Theorem to compute high-precision matrix multiplication via a sequence of low-precision matrix multiplications. In this scheme, the attainable numerical accuracy improves as the number of low-precision matrix multiplications increases. Previous numerical studies have shown that single- and double-precision matrix multiplication using the Ozaki-II scheme achieves higher throughput than that of standard BLAS routines on modern AI hardware equipped with fast INT8 matrix multiply-accumulate units with INT8 inputs and INT32 accumulation. However, the accuracy of the Ozaki-II scheme can degrade when the exponent distribution of the input matrices is wide, in which case a large number of low-precision matrix multiplications is required to obtain high-precision results. In this paper, we present a rigorous deterministic error analysis of the Ozaki-II scheme. The proposed analysis not only clarifies the accuracy behavior of the method but also enables the estimation of the number of low-precision matrix multiplications required to achieve a desired level of numerical accuracy.
💡 Research Summary
The paper presents a rigorous deterministic error analysis of the Ozaki‑II scheme, a Chinese Remainder Theorem (CRT) based method for emulating high‑precision matrix multiplication (SGEMM/DGEMM) using a sequence of low‑precision INT8 matrix‑multiply‑accumulate operations. The authors first review the motivation: modern AI accelerators provide extremely fast INT8/INT32 kernels, while native FP32/FP64 performance has stagnated, prompting the development of emulation techniques that combine many low‑precision products to reconstruct a high‑precision result. Existing long‑multiplication based schemes (e.g., Ozaki‑I) require many slices when the exponent range of the inputs is wide, and their error behavior has not been fully characterized.
The Ozaki‑II algorithm proceeds in four steps. (1) A set of pairwise‑coprime moduli (p_1,\dots,p_N) (each ≤ 256) is chosen, and their product (P=\prod_{\ell=1}^N p_\ell) is computed. (2) Diagonal scaling vectors (\mu) and (\nu) are determined so that the scaled integer matrices (A’=\operatorname{trunc}(\operatorname{diag}(2^\mu)A)) and (B’=\operatorname{trunc}(B\operatorname{diag}(2^\nu))) satisfy (|A’|;|B’|<P,E), guaranteeing a unique CRT reconstruction. (3) For each modulus, the algorithm forms (A’\ell = A’ \bmod p\ell) and (B’\ell = B’ \bmod p\ell), multiplies them with an INT8 engine (producing exact integer products because the operands are bounded), reduces the product modulo (p_\ell) to obtain (W_\ell), and finally accumulates the weighted sum (\sum_\ell s_{\ell}W_\ell) where (s_{\ell}=P/p_\ell\cdot q_\ell) is approximated by two double‑precision numbers (s_{\ell1},s_{\ell2}). (4) The accumulated result is divided by (P) (using a double‑double representation of (P) and its reciprocal) and inverse‑scaled to produce the final matrix (C\approx AB).
The core contribution is a complete error budget. Two sources are identified: (i) truncation error from the scaling step, bounded by (|A’|;|B’|/P); (ii) reconstruction error from the finite‑precision representation of the CRT weights and the floating‑point accumulation. Lemma 1 proves that the scaling vectors computed by Algorithm 2 always satisfy the truncation bound (\le 2u_{32}P,E). Theorem 2–4 bound the reconstruction error in terms of the number of moduli (N), the bit‑width of the weight approximations, and the unit round‑off (u_{64}). The final relative error bound is expressed as (\varepsilon(N)=C_1,2^{-\beta(N)}+C_2,u_{64}), where (\beta(N)) grows roughly linearly with (N). Consequently, increasing (N) reduces both truncation and reconstruction errors exponentially, while the cost grows only linearly because each additional modulus adds one INT8 GEMM.
Experimental validation covers matrices of varying sizes and exponent distributions (narrow, wide, random). The authors show that for typical FP32‑level accuracy (relative error ≈ 10⁻⁷) only (N=6)–(8) moduli are needed, while FP64‑level accuracy (≈ 10⁻¹⁵) requires (N≈12). Throughput measurements on recent GPUs demonstrate that the accurate mode of Ozaki‑II outperforms native BLAS by 1.5–2.3× and exceeds the previously proposed fast mode by only a modest 10 % overhead, confirming that the extra CRT reconstruction cost is negligible compared with the gains from using INT8 engines.
A practical outcome of the analysis is an automatic tuning framework. By inspecting the maximum absolute values and exponent ranges of the input matrices, the framework predicts the minimal (N) needed to meet a user‑specified error target, selects the appropriate subset of pre‑computed moduli (the list in equation (10)), and computes the scaling vectors on the fly. This removes the need for manual trial‑and‑error and enables seamless integration of Ozaki‑II into high‑performance libraries.
In conclusion, the paper delivers a mathematically sound and practically useful error model for CRT‑based matrix‑multiplication emulation. It clarifies how the exponent spread of the inputs dictates the required number of low‑precision products, provides tight deterministic bounds, and validates the model with extensive benchmarks. Future work is suggested on dynamic modulus selection, extension to even lower‑precision engines (e.g., FP8, INT4), and scaling the approach to multi‑GPU or distributed environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment