A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
đĄ Research Summary
This technical report, produced by the Exascale Computing Projectâs Multiprecision Focus Team, surveys the stateâofâtheâart mixedâprecision techniques across dense and sparse linear algebra, dataâcommunication compression, preconditioning, and software ecosystem integration. The motivation stems from the rapid adoption of lowâprecision arithmetic unitsâsuch as halfâprecision (FP16) and bfloat16 tensor coresâin modern CPUs and GPUs driven by machineâlearning workloads. These units deliver 2â4Ă higher throughput than singleâprecision and up to an order of magnitude over doubleâprecision, while simultaneously halving memory traffic.
The report begins with an overview of lowâprecision BLAS, focusing on NVIDIAâs tensorâcoreâaccelerated halfâprecision GEMM (HGEMM). Benchmarks show that HGEMM with FP32 output on tensor cores reaches ~100âŻTFLOP/s on a V100 GPU, offering both higher performance and better accuracy than pure FP16 output. BatchâHGEMM implementations (e.g., in MAGMA) overcome the fixedâsize restrictions of tensor cores and outperform cuBLAS for smallâtoâmoderate matrix sizes.
Mixedâprecision iterative refinement is presented as a generic strategy: a singleâprecision (or halfâprecision) solve provides an inexpensive initial solution, which is then refined in double precision to achieve fullâprecision accuracy. Classical iterative refinement and the more robust GMRESâIR are described, together with scaling and shifting techniques that improve convergence for illâconditioned systems.
The survey continues with mixedâprecision factorizations. Dense LU, Cholesky, and QR algorithms exploit FP16 matrixâpanel updates while retaining doubleâprecision accuracy through highâprecision updates of the trailing submatrix. A quantized integer LU method demonstrates that 8âbit integer arithmetic, combined with appropriate scaling, can dramatically reduce memory bandwidth while delivering acceptable accuracy for certain applications.
SectionâŻ3 addresses data and communication compression. Mixedâprecision MPI reduces network traffic by transmitting operands in low precision and reconstructing them at the receiver. Approximate FFTs with tunable accuracyâforâspeed tradeâoffs and dynamic splitting strategies further lower communication costs in spectral methods.
Sparse linear algebra receives extensive treatment. Mixedâprecision sparse LU and QR factorisations, as well as direct solvers, are shown to achieve substantial speedups by performing the bulk of matrixâvector products in low precision. Krylov subspace methodsâincluding LanczosâCG, ArnoldiâGMRES, and their mixedâprecision variantsâbenefit from lowâprecision matrixâvector kernels and highâprecision residual corrections. GMRESâIR combined with lowâprecision preconditioners can be 2â3Ă faster than traditional doubleâprecision GMRES while preserving convergence.
Preconditioner design (SectionâŻ6) emphasizes decoupling arithmetic precision from memory precision. Multigrid smoothers can run in FP16, while coarseâgrid corrections use FP32 or FP64, yielding up to 75âŻ% memory savings without sacrificing convergence.
The report also surveys the integration of mixedâprecision capabilities into the xSDK ecosystem. Libraries such as Ginkgo, hypre, Kokkos Kernels, MAGMA, PETSc, PLASMA, SLATE, STRUMPACK, SuperLU, and Trilinos now expose APIs for FP16/bfloat16 kernels, automatic precision selection, and mixedâprecision MPI.
Finally, the authors discuss IEEEâ754 format emulators and roundingâerror analysis, providing a theoretical foundation for predicting error growth when switching precisions. The emulator enables systematic testing of precisionâmixing strategies across architectures.
Overall, the document argues that mixedâprecision algorithms are essential for achieving Exascale performance: they alleviate the growing memoryâbandwidth bottleneck, exploit the massive throughput of tensor cores, and retain doubleâprecision accuracy where needed. Future work includes developing automated precisionâselection frameworks, extending portability across emerging hardware (e.g., ARMâbased GPUs, future tensorâcore designs), and conducting largeâscale application studies to validate the reported speedups in real scientific workloads.
Comments & Academic Discussion
Loading comments...
Leave a Comment