Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST-DASLab/Quartet-II .


💡 Research Summary

The paper introduces Quartet II, a novel training scheme that enables fully quantized pre‑training of large language models (LLMs) using NVIDIA’s NVFP4 4‑bit micro‑scaling floating‑point format, which is natively supported on Blackwell GPUs. Existing NVFP4 recipes rely on element‑wise stochastic rounding (SR) to keep the backward pass unbiased, but SR incurs high quantization error at 4‑bit precision, leading to noticeable accuracy gaps compared to FP16 or BF16 training. To address this, the authors propose MS‑EDEN (Micro‑Scaling EDEN), an unbiased quantization primitive that moves stochasticity from individual FP4 values to the micro‑scale factors. MS‑EDEN first applies a randomized Hadamard transform (RHT) to chunks of the input vector, then quantizes the rotated values using round‑to‑nearest (RTN) instead of SR, which dramatically reduces mean‑square error. The EDEN‑style scaling correction factor S, which would normally require high‑precision storage, is merged into the NVFP4 group scales via stochastic rounding, preserving unbiasedness while fitting within the coarse FP8 scale representation. The authors prove (Corollary 3.1) that the overall operator remains unbiased in expectation, and empirically demonstrate more than a two‑fold reduction in quantization error relative to SR (MSE drops from 23.5 × 10⁻³ to 9.8 × 10⁻³).

Quartet II integrates MS‑EDEN into a complete NVFP4 linear‑layer computation graph. In the forward pass, it employs the “Four‑Over‑Six” scale‑selection heuristic to maximize representation capacity of activations and weights. In the backward pass, it uses inner‑dimension RHT combined with MS‑EDEN to produce unbiased gradient estimates for all major GEMM operations (e.g., Q·Kᵀ, V·Oᵀ, feed‑forward). The scheme is implemented as GPU kernels that process 128‑element chunks, leveraging tensor‑core acceleration on Blackwell GPUs. Benchmarks show up to 4.2× speedup over BF16 while maintaining comparable loss trajectories.

The authors validate Quartet II by training a Llama‑2‑like model with 1.9 B parameters on 38 B tokens. Compared to a BF16 baseline, the NVFP4 model incurs less than 0.3 % increase in validation loss, effectively closing the accuracy gap that prior NVFP4 methods exhibited. Ablation studies across different group‑size configurations (1×16 vs. 16×16) and with/without the Four‑Over‑Six heuristic confirm that MS‑EDEN provides the most significant gains.

Limitations are acknowledged: the current implementation focuses on linear layers; extending MS‑EDEN to non‑linear operations such as GELU or layer normalization will require additional engineering. Moreover, the approach has been evaluated on models up to a few billion parameters; scaling to tens of billions remains future work. Nonetheless, the paper offers a solid theoretical foundation, practical kernel implementations, and compelling empirical evidence that unbiased gradient estimation via MS‑EDEN can unlock the efficiency potential of NVFP4 for large‑scale LLM training.


Comments & Academic Discussion

Loading comments...

Leave a Comment