M2XFP: A Metadata-Augmented Microscaling Data Format for Efficient Low-bit Quantization
Existing low-bit Microscaling (MX) formats, such as MXFP4, often suffer from substantial accuracy degradation due to the use of a shared scaling factor with the Power-of-Two format. In this work, we explore strategies that introduce minimal metadata to recover accuracy lost during quantization while maintaining high bit efficiency across a wide range of large language models. We propose a complete algorithm-hardware co-design based on flexible metadata, featuring an online quantization with simple encoding. To support the proposed method efficiently, we implement a lightweight hardware unit and integrate it into the accelerator. Evaluation results demonstrate that our method substantially narrows the accuracy gap, achieving on average a 70.63% reduction in accuracy loss compared to MXFP4 and a 37.30% reduction relative to the latest NVFP4 on LLM benchmarks. Furthermore, our design delivers up to 1.91$\times$ speedup and 1.75$\times$ energy savings over state-of-the-art accelerators. Our code is available at https://github.com/SJTU-ReArch-Group/M2XFP_ASPLOS26.
💡 Research Summary
The paper addresses the persistent accuracy gap in 4‑bit low‑bit quantization for large language models (LLMs) when using existing Microscaling (MX) formats such as MXFP4. MX formats rely on a shared scaling factor per block, typically represented as an 8‑bit power‑of‑two (E8M0) exponent. While this design is extremely hardware‑friendly—requiring only shift operations for dequantization—it suffers from coarse granularity: the block’s maximum value often falls between two exponent bins, causing large rounding errors that propagate through the entire block. The authors empirically demonstrate that preserving the block maximum in higher precision (FP16) dramatically reduces perplexity, confirming that mishandling of the maximum is the primary source of error.
Existing attempts to mitigate this problem fall into three categories: (1) refining the scaling factor (e.g., using FP8‑based NVFP4), which improves precision but narrows exponent range and adds extra tensor‑level rescaling; (2) redesigning the base data type (custom floating‑point formats such as Flint, M‑ANT, BlockDialect), which offers expressive power but incurs prohibitive hardware cost, especially for dynamic activations; and (3) adding metadata (e.g., OliVe, MicroScopiQ), which either targets only static tensors or introduces excessive overhead (often >40 bits per block). The authors argue that the metadata dimension remains under‑explored and holds the key to reconciling accuracy with bit‑efficiency.
Through a systematic design‑space exploration, the paper identifies an asymmetry between weights and activations. Weights are static after training, allowing offline optimization; activations are dynamic and must be processed online. Consequently, the most effective metadata allocation is element‑level for activations (tiny per‑element “extra mantissa” bits) and subgroup‑level for weights (shared extra exponent bits combined with offline scale search). By allocating only 0.25 bits of metadata per element—implemented as an extra mantissa fragment—the effective precision becomes roughly 4.5 bits per value while preserving the hardware‑friendly 4‑bit data path.
The proposed format, named M2XFP (Metadata‑Augmented Microscaling Format), integrates this hybrid metadata scheme. For activations, each element carries a 0.25‑bit metadata that refines its mantissa; for weights, groups of 16–32 elements share a 1‑bit extra exponent and an adaptively chosen scale. The overall bit‑budget per element remains near 4.5 bits, delivering near‑FP16 accuracy at 4‑bit storage cost.
Hardware support is added with minimal extensions to a systolic array accelerator. Three key units are introduced: (1) a top‑1 decode unit that extracts and restores metadata on‑the‑fly; (2) an augmented FP4×FP4 processing element that incorporates the extra mantissa bits directly into the multiplication pipeline; and (3) a streaming quantization engine that simultaneously handles block scales and metadata without stalling the GEMM pipeline. These additions increase area by less than 2 % and power by under 15 %, while keeping the memory interface identical to existing MXFP4 designs.
Evaluation spans multiple LLMs (LLaMA‑3.1 variants from 7 B to 70 B, GPT‑NeoX, etc.). Compared with MXFP4, M2XFP reduces average perplexity loss by 70.63 %; compared with the latest NVFP4, loss is reduced by 37.30 %. Performance-wise, the modified accelerator achieves up to 1.91× speedup and 1.75× energy savings over state‑of‑the‑art MX accelerators. Metadata overhead adds only 0.25 bits per element, resulting in a negligible increase (<3 %) in total memory footprint and no additional bandwidth pressure.
In conclusion, the work demonstrates that a carefully designed, lightweight metadata augmentation—co‑optimized with hardware—can substantially close the accuracy gap of 4‑bit quantization without sacrificing the bit‑efficiency and throughput that make MX formats attractive. The authors suggest future directions such as automated metadata allocation, support for irregular tensor shapes, and porting the design to other hardware platforms (FPGA, ASIC) to further broaden the impact of M2XFP on efficient LLM deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment