BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization
Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization…
Authors: Ji-Fu Li, Manyi Zhang, Xiaobo Xia
BA TQuan t: Outlier-resilien t MXFP4 Quan tization via Learnable Blo c k-wise Optimization Ji-F u Li 1 , Man yi Zhang 1 ⋆ , Xiaob o Xia 2 , Han Bao 1 , Haoli Bai 1 , Zhenh ua Dong 1 , and Xianzhi Y u 1 1 Hua wei T echnologies 2 Univ ersity of Science and T ec hnology of China {lijifu4, zhangmanyi6}@huawei.com Abstract. Microscaling floating-p oin t (MXFP) formats hav e emerged as a promising standard for deploying Multi-modal Large Language Mo d- els (M LLMs) and Large Language Mo dels (LLMs) on mo dern accelera- tor architectures. How ev er, existing Post-T raining Quan tization (PTQ) metho ds, particularly rotation-based tec hniques designed for integer for- mats, suffer from severe p erformance collapse when applied to MXFP4. Recen t studies attribute this failure to a fundamen tal format mismatch: global orthogonal rotations inadv ertently transfer outlier energy across quan tization blo c ks, inducing new outliers that disrupt lo cal blo c k-wise scaling, while often creating bimodal activ ation distributions that under- utilize the limited quan tization range. T o address these issues, we propose BA TQuant ( B lo c k-wise A ffine T ransformation), which restricts transfor- mations to align with MXFP granularit y to prev ent cross-block outlier propagation, while relaxing orthogonality constrain ts to optimize distri- bution shaping. T o ensure parameter efficiency , w e introduce G lobal and P riv ate K roneck er (GPK) decomp osition to effectively reduces storage and run time o verhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on b oth MLLMs and LLMs demonstrate that BA TQuan t establishes new state-of-the- art results under aggressiv e W4A4KV16 configurations, reco vering up to 96.43% of full-precision p erformance on multimodal benchmarks and clearly outp erforming existing metho ds across diverse tasks. Keyw ords: Multi-mo dal Large Language Mo dels · Large Language Mo d- els · Quantization · MXFP 1 In tro duction Multi-mo dal Large Language Mo dels (MLLMs) and Large Language Models (LLMs) hav e recen tly revolutionized artificial intelligence, demonstrating re- mark able capabilities in bridging visual p erception with linguistic reasoning [4, ⋆ Corresp onding author. Pr eprint. 2 Li et al. 14, 27, 34, 35, 48, 52, 56, 64]. F rom autonomous driving to medical image anal- ysis, these mo dels are increasingly deploy ed in real-world scenarios where lo w latency and memory efficiency are paramoun t [28, 31, 53, 60, 68]. Ho wev er, the ev er-growing scale of MLLMs and LLMs, often comprising billions of parameters, imp oses prohibitiv e costs on memory bandwidth and computational resources, hindering their deploymen t on edge devices and resource-constrained platforms. P ost-T raining Quantization (PTQ) has emerged as a k ey solution to miti- gate these costs. While integer quan tization has b een widely studied, the recen t emergence of microscaling floating-point formats (MXFP) offers a promising alternativ e [2, 43]. Supp orted b y next-generation hardw are [1, 7, 49], MXFP4 utilizes blo c k-wise scaling to better accommodate the long-tailed distributions inheren t in activ ations, theoretically offering superior dynamic range compared to fixed-point formats. Despite this hardw are readiness, ac hieving accurate 4-bit quan tization for MLLMs under the MXFP format remains an unsolv ed c hal- lenge [66, 67]. While existing state-of-the-art PTQ metho ds are predominan tly designed for INT formats [22, 23, 30, 36, 41, 45, 51, 54], their applicabilit y to MXFP formats is con tested. Sp ecifically , p opular rotation-based techniques (e.g., QuaRot [3] and SpinQuant [33]), whic h excel in INT4 b y spreading outliers via orthogo- nal transformations, suffer from sev ere performance collapse when applied to MXFP4 [11, 38]. Recent studies [11, 46] ha v e attributed this failure to the incom- patibilit y b et ween global rotations and the fine-grained quantization settings of MXFP , and further prop ose blo c k-wise rotation transformation methods. Ho w- ev er, these approac hes still fail to mitigate extreme outliers within certain blocks, and the Hadamard transform further introduces a bimodal distribution problem (see Figure 2a). T o bridge this gap, in this pap er, we introduce BA TQuant. The core of our metho d is the Blo ck-wise Affine T r ansformation (BA T). Unlik e global rotations, BA T restricts the transformation scop e to align strictly with the MXFP quanti- zation granularit y (e.g., 32 elements). This design preven ts the cross-block energy transfer of outliers, ensuring that eac h blo c k’s scaling factor accurately captures its lo cal dynamic range. Moreo ver, w e relax the orthogonality constraint and learn the optimal affine matrices tailored to the MXFP format to minimize quan- tization error. T o address the storage ov erhead caused by learnable blo c k-wise affine transformations, w e further in tro duce the Glob al and Private Kr one cker (GPK) decomp osition that drastically reduces parameter counts b y sharing a global transformation basis across blo cks while retaining block-specific priv ate comp onen ts. Finally , we incorp orate Blo ck-wise L e arnable Clipping , whic h dy- namically adapts thresholds to suppress residual outliers within quan tization blo c ks. W e v alidate our BA TQuan t extensiv ely on both MLLMs and LLMs. Our metho d ac hieves near-lossless p erformance on W4A8KV16 with an accuracy re- co very rate exceeding 99% . F urthermore, it establishes the new state-of-the-art results under aggressiv e W4A4KV16 configurations, reco v ering up to 96.43% on BA TQuant: Outlier-resilien t MXFP4 Quantization 3 Fig. 1: Quan tization performance on Qw en3-VL-8B-Instruct across v arious metho ds. Our method yields sup erior results compared to baselines across all bit- width settings. The adv an tage is particularly substan tial in the W4A4 setting, where our metho d clearly outp erforms existing metho ds. m ultimo dal b enchmarks, significantly outp erforming existing metho ds (see Fig- ure 1). Our main con tributions are summarized as follows: – W e propose BA TQuan t, featuring a Blo c k-wise Affine T ransformation that aligns with MXFP gran ularit y to preven t energy transfer across blo c ks and address the bimo dal distribution problem for effective quantization. Addi- tionally , w e incorporate Global and Priv ate Kronec ker decomposition for parameter efficiency . – W e ev aluate BA TQuan t on b oth MLLMs and LLMs, suc h as Qwen3-8B- VL-Instruct [4] and Qw en3-8B [61], co vering a wide range of c hallenging settings. The effectiv eness is v alidated, ranging from kno wledge understand- ing to complex reasoning benchmarks, setting new state-of-the-art res ults in most scenarios. 2 Preliminary Microscaling Floating-Poin t Definition. The MXFP , prop osed by OCP [43], is a family of floating-point formats that employ block-wise quan tization. An MXFP format is defined by three components: a sign bit ( S ), an exponent ( E ), and a mantissa ( M ). Each MXFP format uses a fixed blo c k size of 32 elemen ts, with all v alues in a blo c k sharing a common scaling factor represen ted in UE8M0 format (8-bit exp onent, no man tissa). The standard MXFP4 (E2M1) format uses 1 sign bit, 2 exp onen t bits, and 1 man tissa bit. This configuration represents 7 distinct p ositiv e v alues: {0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0}, along with their negatives and zero. MXFP8 offers tw o v arian ts, E4M3 and E5M2. Here, we adopt E4M3 for MXFP8, as a larger mantissa width is more crucial for the p erformance of fine-grained quan tization [6, 39]. Related W ork. Initial research on LLM quan tization primarily explored integer- based formats [12, 16, 25, 26, 58, 62]. As NVFP and MXFP formats gain hardw are supp ort, quan tization accuracy under these formats is also drawing increasing atten tion [15, 20, 29, 59, 65]. Prior work has shown that MXFP8 achiev es lossless 4 Li et al. Block 5 Block 295 Effic ient Us e of Quantization Lev els Block Hadamard T rans form Waste d Qua ntizat ion L evels (a) BRQ Block 5 Block 295 Effic ient Us e of Quantization Lev els Block Affine T ransfo rm Effic ient Us e of Quantization Lev els (b) BA TQuant Fig. 2: Activ ation distributions for the down_proj mo dule in lay er 35 of Qwen3-8B. The cen tral 3D plots illustrate the activ ations after transformation. W e specifically extract Blo c k 5 (without outliers) and Block 295 (with extreme outliers), and visualize the v alues after scaling factor division but prior to rounding. (a) After applying the blo c k Hadamard transform, block 295 exhibits a bimo dal distribution, leading to inefficient utilization of the bit width. (b) After the blo c k affine transformation, block 295 sho ws reduced magnitude compared to subplot (a) while effectively leveraging the floating- p oin t quantization grids. quan tization, whereas MXFP4 suffers from significant accuracy degradation [66]. F or the low-bit scenarios, e.g., 4-bit quan tization, outliers are considered as a sev ere impediment. The primary metho ds for suppres sing outliers include rota- tion transformations [17, 23, 50] and affine transformations [36, 47]. The rotation- based metho ds, suc h as QuaRot [3] and SpinQuan t [33], unlike their success in INT4 quantization, underp erform ev en basic R TN when applied to MXFP4. Suc h global rotations mix dimensional information, suppressing outliers and kurtosis, th us disrupting the lo cal statistical properties of fine-grained formats. T o address the incompatibilit y betw een rotation-based techniques and MXFP4, BRQ [46] is prop osed to utilize blo c k-wise rotation quantization to mitigate outliers and pre- v ent amplifying small-v alue blo c ks. MR-GPTQ [11], a GPTQ v arian t optimized for FP4, similarly employs block-wise Hadamard transforms and format-specific adjustmen ts to accommo date FP4’s unique prop erties. Affine transformation- based metho ds, suc h as FlatQuan t [47], o vercome the energy-conserv ation con- strain ts inheren t in rotation-based transformations and enhance quan tization accuracy b y employing affine transformations. Nev ertheless, previous metho ds still suffer from significan t accuracy degradation on MXFP4 quan tization, par- ticularly on complex reasoning tasks [66]. Observ ations and Motiv ation. W e find that block-wise rotation still struggles to suppress extreme outliers in specific blocks, and the Hadamard transform fur- ther in tro duces a bimo dal distribution problem. Specific ally , we visualize the ac- tiv ations after block-wise Hadamard transformation on Qwen3-8B in Figure 2a. W e observ e that although the block-wise Hadamard transform reduces the mag- nitude for the v ast ma jority of blo c ks, since Hadamard matrices are comp osed of { +1 , − 1 } v alues, certain blo c ks with extreme outliers exhibit a bimo dal dis- tribution. This results in w asted bit-width and in troduces larger quantization errors [10]. Therefore, to address these c hallenges, w e propose BA TQuan t. As sho wn in Figure 2b, BA TQuant effectiv ely alleviates outliers, while ensuring that the p ost-transformation data distribution remains amenable to floating-point quan tization. BA TQuant: Outlier-resilien t MXFP4 Quantization 5 Hua w e i Pro p ri e ta ry - Res tri c te d Dis tri b u ti o n 5 𝐏 𝐗 𝐏 𝟏 𝐏 𝟐 𝐏 𝒌 × … … 𝒈 … Quantzat ion Bloc k s ∈ ℝ 𝟏 × 𝒈 𝐁 𝐁 𝟏 𝐁 𝟐 𝐁 𝒌 𝐏 𝐏 𝟏 𝐏 𝟐 𝐏 𝒌 𝒈 𝒈 Global s har e d matri x ∈ ℝ 𝒈 𝟏 × 𝒈 𝟏 Bloc k - spe c if ic private matri x ∈ ℝ 𝒈 𝟐 × 𝒈 𝟐 𝐀 𝒈 𝒈 𝒈 𝟐 𝒈 𝟏 𝟏 𝐏 𝑖 ∈ ℝ 𝒈 × 𝒈 𝐏 𝑖 ∈ ℝ 𝒈 × 𝒈 Blo ck - w ise Aff ine T ra nsfo rm a tion G lo ba l a nd P riv a te K r o neck er 𝐖 𝑞 𝐖 𝑘 𝐖 𝑣 𝐏 𝑞 𝑘𝑣 − 1 𝐏 𝑞 𝑘𝑣 − 1 𝐏 𝑞 𝑘𝑣 − 1 𝐏 𝑣 R oP E Qua nt 𝐏 𝑘 Qua nt 𝐗 𝐏 𝑞 𝑘𝑣 S oftMa x KV C a c he Qua nt 𝐏 𝑘 − 1 𝐏 𝑣 − 1 𝐏 𝑜 Qua nt 𝐏 𝑜 − 1 𝐖 𝑜 Qua nt 𝐏 𝑢𝑝 𝐖 𝑢 𝐖 𝑔 𝐖 𝑑 𝐏 𝑢𝑝 − 1 𝐏 𝑢𝑝 − 1 Ac t 𝐏 𝑑 𝐏 𝑑 − 1 Qua nt M LP Self - At te nt ion Fig. 3: The ov erall framework of BA TQuant. Bottom: In tegration of BA TQuan t in to the T ransformer architecture. W eight-side transformations are fused offline into the linear la y ers, while activ ation-side transformations are applied online. T op: Exemplary view of the Blo ck-wise Affine T r ansformation , where inputs are partitioned into MXFP- aligned blocks. Each block transformation is decomposed via the Global and Private Kr one cker . 3 Metho d In this section, we presen t BA TQuant with the framework illustrated in Fig- ure 3. W e first introduce learning optimal blo c k-wise affine transformations in Section 3.1. Afterward, w e discuss its integration with the T ransformer arc hi- tecture in Section 3.2. Note that w e provide a detailed algorithm flow of our BA TQuan t in App endix B. 3.1 Blo c k-wise Affine T ransformation Consider a standard linear lay er computation Y = XW ⊤ , where X ∈ R S × N represen ts activ ations and W ∈ R M × N denotes weigh ts. The primary ob jective is to find the b est affine transformation P ⋆ ∈ R N × N for each linear lay er to quan tize: P ⋆ = arg min P || Y − Q ( XP ) Q ( P − 1 W ⊤ ) || 2 F . Instead of learning a single global matrix, w e partition the transformation matrix in to k disjoin t blo c ks aligned with the MXFP quan tization gran ularity g (e.g., g = 32 ). W e then construct a block-diagonal affine matrix: P = diag ( P 1 , P 2 , . . . , P k ) , where P i ∈ R g × g , N = k · g . (1) 6 Li et al. T able 1: Comparison of decomp osition metho ds on parameter counts and computa- tional cost. F or the example parameter coun t, w e set the hidden dim N = 4096 and the MXFP quantization granularit y g = 32 . The size of decomposed matrix A i and B i are set to g 1 = 8 and g 2 = 4 . The rep orted MatMul Complexity refers to the computational cost of the activ ation transformation XP . Metho d Decomp osition MatMul Complexit y # P arams of P Example Coun t FlatQuan t Kroneck er O ( S N 3 2 ) 2 N 8,192 Ours w/o O ( S N g ) N · g 131,072 Naiv e Kroneck er O ( S N ( g 1 + g 2 )) k · ( g 2 1 + g 2 2 ) 10,240 GPK O ( S N ( g 1 + g 2 )) g 2 1 + k · g 2 2 2,112 Here, eac h P i is an independent and learnable affine transformation applied solely within the i -th quantization block. By restricting the transformation scop e to the size of the MXFP block, our metho d ensures that outlier redistribution o ccurs only lo cally . This preserv es the statistical indep endence of eac h quan tiza- tion blo c k, allowing the MXFP scaling factors to accurately capture the dynamic range of eac h blo c k without interference from outliers of other blo cks. Global and Priv ate Kronec k er. Although the blo c k-diagonal structure of P in tro duces inheren t sparsit y , the total n umber of learnable parameters remains N · g . F or large-scale mo dels, storing such a matrix for every la yer still incurs a significan t memory cost. A straigh tforward approach to mitigate this is to apply Kronec ker product decomp osition to eac h P i , factorizing it in to t wo smaller matrices B i ⊗ A i , where A i ∈ R g 1 × g 1 , B i ∈ R g 2 × g 2 . The g 1 and g 2 resp ectiv ely denote the size of A i and B i and we hav e MXFP quantization granularit y g = g 1 · g 2 . W e refer to this as Naive Kr one cker . How ever, since the blo c k size g is t ypically small (e.g., 32 in MXFP formats), the reduction in parameter coun t is marginal. T o address this limitation, we prop ose Glob al and Private Kr one cker (GPK). GPK decomp oses eac h P i in to the p roduct of a glob al shar e d matrix A and a blo ck-sp e cific private matrix B i : P i = B i ⊗ A , ∀ i ∈ { 1 , . . . , k } , (2) where A is shared across all k blo c ks and B i is unique to the i -th block. This design drastically reduces the storage requiremen t from k · ( g 2 1 + g 2 2 ) to g 2 1 + k · g 2 2 . As shown in T able 1, GPK significan tly reduces the storage o verhead, reduc- ing the parameter coun t b y more than 74% and 79% compared to FlatQuant and Naiv e Kronec k er. A dditionally , b y leveraging the v ectorization tric k of the Kronec ker product, i.e., vec ( V )( B i ⊗ A ) = v ec ( B ⊤ i V A ) for some V ∈ R g 2 × g 1 , GPK maintains efficient inference by preserving the lo w matrix m ultiplication complexit y . Here, we pro vide the PyT orch-st yle pseudo co de of the forw ard pass with GPK in App endix B. BA TQuant: Outlier-resilien t MXFP4 Quantization 7 Blo c k-wise Learnable Clipping. While the blo c k-wise affine transformation effectiv ely smooths activ ation distributions, residual outliers ma y still persist within the quantization blo c ks, potentially dominating the quantization range of MXFP formats. T o mitigate this, we in tro duce Blo ck-wise L e arnable Clipping , a fine-grained strategy that adapts clipping thresholds to the lo cal statistics of eac h quantization block. F or the i -th blo c k, the clipp ed v alues ˆ x i (and similarly for w eights ˆ w i ) are computed as: ˆ x i = clip x i , β min i , β max i , (3) where the dynamic b ounds β min i and β max i are: β min i = σ ( α min i ) · min( x i ) , β max i = σ ( α max i ) · max( x i ) . (4) Here, min( x i ) and max( x i ) denote the minimum and maxim um v alues within the i -th blo c k, respectively , and σ ( · ) is the sigmoid function constraining the clipping ratios to (0 , 1) . α i is the learnable parameter sp ecific to blo c k i . The T raining Ob jectiv e. F ollowing previous work [47], we optimize the block- wise affine transformations and clipping factors by minimizing the la yer-wise quan tization errors b et ween the full-precision and quantized outputs ov er a small calibration set D cal : Θ ∗ l = arg min Θ l E X ∼D cal F l ( X ) − ˆ F l ( X ; Θ l ) 2 2 (5) where F l ( · ) and ˆ F l ( · ) denote the full-precision la y er l and quantized lay er l , re- sp ectiv ely . Θ l is abbreviated for all learnable parameters within the quantization blo c k. 3.2 In tegration with the T ransformer Arc hitecture W e in tegrate BA TQuant into b oth LLM (Qwen3) and MLLM (Qwen3-VL) ar- c hitectures b y inserting block-wise affine transformations in to the transformer blo c k, where the w eight-side transformations are merged in to the linear la y- ers offline, while the activ ation-side transformations are applied online during inference. F ollo wing the conv entional practices, w e emplo y lo w-bit matrix m ul- tiplications for all linear la yers, while k eeping lay er normalization la yers, pre- quan tization transformations, RoPE embeddings and attention scores in BF16. MLP Module. In LLM and the text model of MLLM, the MLP mo dule em- plo ys t wo transformation sets, P up and P down . P up flattens the activ ation dis- tribution after La yerNorm before the up_proj and gate_proj la yers. P down smo oths the input to the down_proj la yer. In the ViT mo del of MLLM, the MLP mo dule also employs t wo transformation sets: P f c 1 and P f c 2 . P f c 1 flat- tens the activ ation distribution after La yerNorm before the linear_fc1 lay ers. P f c 2 smo oths the input to the linear_fc2 lay er. All matrices utilize the GPK decomp osition to minimize storage. 8 Li et al. Self-A ttention Mo dule. In LLM and the text model of MLLM, the Self- A ttention mo dule employs four transformations: P q kv , P o , P k and P v . P q kv and P o flatten the activ ation distribution b efore the qkv_proj lay er and o_proj la yer respectively . P k and P v are used to transform the k ey and v alue cac he head by head, respectively . In the ViT mo del of MLLM, only P q kv and P o are emplo yed. This is because ViT do es not require an autoregressive KV cac he mec hanism. Consequen tly , there is no need to store, transform and quan tize the k ey and v alue states across generation steps. 4 Exp erimen ts 4.1 Settings Ev aluation and Baselines. W e ev aluate BA TQuan t on Qwen3-VL-8B-Instruct (MLLM) [4] and Qw en3-8B (LLM) [61 ]. W e assess quan tized models on the follo wing b enc hmarks: (1) Multimo dal b enc hmarks, including MME [13], OCR- Benc h [32], DocVQA [37], RealW orldQA [57], and VLMBlind. (2) Non-reasoning tasks, including PIQA [5], Winogrande [44], Hellaswag [63], ARC-Easy [8], and AR C-Challenge [8]. (3) Reasoning benchmarks, including GSM8K [9], MA TH- 500 [24], AIME24, AIME25 and GPQA-D [42]. W e compare BA TQuant against p opular p ost-training quantization metho ds, including QuaRot [3], SpinQuan t [33], BR Q [46], FlatQuant [47], Smo othQuan t [58] and GPTQ [12]. More details ab out b enc hmarks and baseline metho ds are provided in App endix A. Implemen tation Details. W e implement BA TQuant based on Huggingface [55], PyT orc h [40]. W e adopt the A dam W optimizer with an initial learning rate of 2e-3 and emplo y a cosine annealing learning rate decay sc hedule. BA TQuan t is trained for 5 ep o c hs, and the batc h size is set to 4. F or GPK, w e set the size of the global shared matrix g 1 and block-specific priv ate matrix g 2 to 8 and 4, respectively . F or LLM, we use the BF16 mo del to self-generate data on the Numina-Math-1.5 [21] dataset, and randomly sample 128 text sequences of length 2048 to construct the calibration set. F or MLLM, w e randomly sample 128 image-text pairs from the GQA [18] dataset to construct the calibration set. F urther details ab out implementation are provided in Appendix A. Quan tization Settings. W e ev aluate the prop osed metho d on sev eral MXFP- based quantization configurations, including weigh t-activ ation quantization and KV cache quan tization. F or clarity , we denote eac h configuration using the for- mat W { bits } A { bits } KV { bits } . F or example, W4A8KV8 indicates quantizing w eights to 4-bit, activ ations to 8-bit, and KV cache to 8-bit. W e empirically observ e that com bining differen t metho ds with GPTQ universally enhances p er- formance. Consequently , unless otherwise sp ecified, the rep orted results refer to the GPTQ-in tegrated v ariants of each metho d. Detailed comparisons b et ween GPTQ and R TN w eight quantizer are pro vided in App endix C. BA TQuant: Outlier-resilien t MXFP4 Quantization 9 (a) Performance of Qwen3-8B on Non-Reasoning tasks under differen t quan tization settings. (b) Performance of Qwen3-8B on Reasoning tasks under differen t quan tization settings. Fig. 4: Performance comparison of different metho ds on Qw en3-8B across LLM b enc h- marks under v arious quantization configurations. The results are categorized into Non- Reasoning (left) and Reasoning (righ t) tasks. 4.2 Main Results Here, w e presen t a comprehensiv e empirical ev aluation of BA TQuant. Our exp er- imen ts are designed to answer the follo wing critical questions: (1) Can BA TQuant main tain satisfactory performance under aggressiv e MXFP-based quan tization configurations where existing metho ds fail? (2) Ho w do es our approac h general- ize across mo dalities (MLLMs vs. LLMs) and task domains, sp ecifically spanning multimo dal understanding (including do cument understanding, STEM puzzles, and general V QA) in MLLMs and linguistic task (cov ering non-reasoning and reasoning tasks) in LLMs? Results on Multimo dal Benc hmarks. T able 2 summarizes the performance of different post-training quantization methods on the Qwen3-VL-8B-Instruct mo del across fiv e m ultimo dal b enchmarks. As sho wn in the table, BA TQuan t consisten tly establishes state-of-the-art results across all bit-width configura- tions. Notably , in the aggressive W4A4KV16 regime, BA TQuant achiev es an av er- age recov ery rate of 96.43%, significantly outperforming the strongest baseline FlatQuan t b y a margin of 1.64%. Under W4A8KV16 scenario, BA TQuan t achiev es an a verage recov ery rate of 99.29%, which is the only approac h exhibiting a p erformance degradation of under 1%. This sup eriorit y extends to KV cache quan tization as well. Under W4A8KV8 and W4A8KV4 , our metho d maintains sup e- rior performance with recov ery rates of 98.89% and 97.51%, resp ectiv ely . Such a consistent p erformance gain is also widely observed across different types of b enc hmarks, including document understanding, STEM puzzles, and general V QA. W e attribute this success to our metho d’s unique capability to mitigate in ter-blo c k energy transfer, thereby effectiv ely capturing diverse outlier patterns that con ven tional metho ds fail to address. Results on LLM Benc hmarks. T o comprehensively ev aluate the generaliza- tion capabilit y of BA TQuan t beyond m ultimo dal, w e conduct extensive exper- 10 Li et al. T able 2: P erformance comparison of v arious quan tization methods on m ultimo dal b enc hmarks across different bit-width configurations (e.g., W4A8KV16, W4A4KV16, W4A8KV8 and W4A8KV4).The reco very rate relativ e to the BF16 baseline is also pro vided and the b est result in each case is marked in b old. Bits Method MME OCRBench Do cV QA RealW orldQA VLMBlind R e c overy(%) BF16 – 2377 906 95.81 70.98 73.98 100.00 W4A8KV16 R TN 2294 883 94.72 69.80 70.99 97.43 QuaRot 2327 870 95.07 69.80 71.12 97.53 SpinQuant 2321 872 94.79 70.46 69.82 97.29 BRQ 2329 865 94.72 70.19 67.18 96.40 FlatQuant 2351 886 95.31 69.02 73.90 98.66 SmoothQuant 2349 885 94.81 70.06 69.46 97.61 GPTQ 2346 891 95.03 69.15 72.62 98.36 BA TQuant 2386 893 95.55 70.20 73.14 99.29 W4A4KV16 R TN 2243 838 92.70 65.23 66.47 93.07 QuaRot 2189 810 93.47 64.97 57.62 89.69 SpinQuant 1994 801 91.79 65.36 60.23 88.32 BRQ 2147 805 92.94 66.14 62.14 90.74 FlatQuant 2231 873 94.10 65.62 68.86 94.79 SmoothQuant 2264 862 93.93 68.89 66.26 95.01 GPTQ 2286 849 93.98 66.93 67.29 94.64 BA TQuant 2360 864 94.31 67.32 69.70 96.43 W4A8KV8 R TN 2208 878 94.64 69.54 71.01 96.51 QuaRot 2296 868 95.11 69.02 70.26 96.77 SpinQuant 2217 832 94.41 68.10 69.04 94.58 BRQ 2283 867 94.63 69.80 67.36 95.98 FlatQuant 2353 888 95.12 69.14 72.77 98.41 SmoothQuant 2317 884 94.72 70.19 68.91 97.19 GPTQ 2340 885 95.14 71.11 71.79 98.53 BA TQuant 2368 890 95.47 69.93 72.82 98.89 W4A8KV4 R TN 2220 856 94.05 68.50 67.50 94.76 QuaRot 2280 857 94.66 68.52 68.36 95.65 SpinQuant 2248 829 94.18 68.63 64.50 93.65 BRQ 2236 841 94.07 68.63 66.03 94.20 FlatQuant 2293 884 94.88 68.76 70.75 97.11 SmoothQuant 2283 871 94.39 67.02 66.99 95.13 GPTQ 2328 867 94.15 68.10 70.81 96.71 BA TQuant 2332 885 95.07 68.63 70.92 97.51 imen ts on Qw en3-8B. The ov erall p erformance trends across all configurations are shown in Figure 4 and the detailed results for reasoning b enc hmarks are summarized in T able 3. More detailed results can b e found in App endix C. Non-R e asoning T asks. As sho wn in Figure 4, under the W4A8KV16 configura- tion, our method achiev es near-lossless accuracy compared to BF16 baseline. As the quan tization difficult y in tensifies in the aggressive W4A4KV16 and W4A8KV4 regimes, rotation-based methods (e.g., SpinQuan t, QuaRot) suffer from severe p erformance degradation while our metho d main tains a robust lev el of accuracy . This suggests that our blo ck-wise affine transformation effectively mitigates the BA TQuant: Outlier-resilien t MXFP4 Quantization 11 T able 3: Performance comparison of v arious quantization metho ds on reasoning b enc hmarks across different bit-width configurations (e.g., W4A8KV16, W4A4KV16, W4A8KV8 and W4A8KV4).The reco very rate relativ e to the BF16 baseline is also pro vided and the b est result in each case is marked in b old. Bits Metho d GSM8K MA TH-500 AIME24 AIME25 GPQA-D A vg. R e c overy(%) BF16 – 95.15 96.87 71.46 63.12 58.13 76.95 100.00 W4A8KV16 R TN 93.71 95.53 64.58 55.00 54.39 72.64 93.64 QuaRot 94.47 95.67 64.17 55.63 54.39 72.87 93.91 SpinQuant 94.69 95.53 60.42 51.46 54.58 71.34 91.62 BRQ 93.71 95.80 63.96 53.33 55.40 72.39 93.26 FlatQuant 94.62 95.93 69.17 57.08 54.80 74.32 95.99 SmoothQuant 94.92 96.27 65.62 56.04 54.80 73.53 94.80 GPTQ 94.39 96.33 68.02 59.38 55.10 74.64 96.54 BA TQuant 94.84 96.40 68.33 59.38 57.22 75.23 97.46 W4A4KV16 R TN 93.10 94.53 53.33 47.08 49.80 67.57 86.06 QuaRot 94.09 92.47 47.50 39.37 48.13 64.31 81.20 SpinQuant 93.40 91.67 38.57 35.63 45.66 60.99 76.35 BRQ 92.27 91.73 37.29 34.58 48.03 60.78 76.25 FlatQuant 93.40 94.33 58.96 43.54 50.51 68.15 86.78 SmoothQuant 94.69 95.33 60.71 47.29 52.42 70.09 89.60 GPTQ 94.24 95.73 57.50 52.08 52.12 70.33 90.10 BA TQuant 94.77 95.60 62.08 52.92 54.19 71.91 92.45 W4A8KV8 R TN 93.78 95.00 60.21 54.79 53.54 71.46 91.96 QuaRot 94.09 95.73 64.79 55.83 54.49 72.99 94.11 SpinQuant 94.47 95.47 59.38 53.96 55.86 71.87 92.56 BRQ 94.69 95.33 63.75 52.71 54.04 72.10 92.72 FlatQuant 94.54 96.00 65.42 53.96 54.55 72.89 93.87 SmoothQuant 94.39 96.13 66.04 54.79 54.29 73.13 94.21 GPTQ 94.47 96.13 65.00 57.08 53.94 73.32 94.54 BA TQuant 94.62 96.27 69.37 55.21 56.82 74.46 96.22 W4A8KV4 R TN 92.12 91.13 43.54 38.75 46.97 62.50 78.80 QuaRot 94.01 94.80 62.08 52.50 51.82 71.04 91.17 SpinQuant 93.25 94.33 57.71 49.58 52.12 69.40 88.87 BRQ 93.56 95.13 62.08 49.17 53.54 70.70 90.68 FlatQuant 94.09 95.40 63.33 53.54 54.95 72.26 93.07 SmoothQuant 93.03 92.73 46.67 40.33 49.19 63.39 81.46 GPTQ 93.40 93.07 47.92 39.58 49.75 64.74 81.92 BA TQuant 94.77 95.27 66.04 54.48 54.24 72.96 94.00 distortion of activ ation distributions caused by extreme quan tization, ensuring that fundamen tal linguistic patterns remain intact. R e asoning T asks. The disparity b et w een BA TQuan t and baselines becomes ev en more pronounced on complex reasoning b enc hmarks requiring multi-step logi- cal deduction and mathematical computation. As detailed in T able 3, reason- ing tasks are inheren tly more sensitive to quan tization noise due to the com- p ounding effect of errors across long reasoning c hains. In the W4A8KV16 scenario, BA TQuan t ac hieves a reco very rate of 97.46%, surpassing GPTQ b y a substantial margin of 0.92%. Notably , under W4A4KV16 scenario, comp eting metho ds suffer from severe performance collapse on GSM8K and MA TH-500, while BA TQuant 12 Li et al. (a) SpinQuant (b) FlatQuant (c) BRQ (d) BA TQuant (Ours) Fig. 5: Activ ation distributions of the q_proj module in lay er 6 of Qw en3-8B with differen t quan tization metho ds. main tains a stable performance. In W4A8KV8 and W4A8KV4 scenarios, our metho d outp erforms the strong baseline GPTQ and FlatQuan t by 1.68% and 0.93%, resp ectiv ely . The consisten t superiority of BA TQuant across both m ultimo dal tasks and complex linguistic reasoning underscores its remark able cross-mo dalit y general- ization. Our metho d maintains stable performance ev en under aggressiv e lo w- bit configurations where baselines fail. This broad effectiveness stems from the fundamen tal nature of our block-wise affine transformation, whic h dynamically aligns activ ation outliers and mitigates quantization noise at a gran ular level, indep enden t of sp ecific data mo dalities or task seman tics. Qualitativ e Results. T o provide insights in to the mechanism b ehind our p er- formance gains, we visualize the activ ation distributions across differen t quan- tization metho ds in Figure 5. As sho wn in Figure 5a, rotation-based method (e.g., SpinQuan t) tend to smo oth the entire tensor. While it preserves the global energy , it ma y transfer energy from outlier-rich blocks to smo other blocks, ampli- fying quan tization errors in these blo c ks. While FlatQuan t (Figure 5b) effectively suppresses global energy , it fails to preven t this in ter-blo c k energy transfer. F ur- thermore, although BRQ (Figure 5c and Figure 2a) introduces blo ck-wise rota- tion to smo oth within local blocks, our visualization rev eals that it often induces a bimo dal distribution within quan tization blo c ks. Our method (Figure 5d and Figure 2b) effectively prev ents cross-block energy transfer while reshaping acti- v ations within blo c ks into a compact, unimodal distribution. More visualization results are pro vided in App endix C. 4.3 Ablation Study T o v alidate the effectiveness of our core designs, we conduct ablation stud- ies on b oth Qwen3-8B (LLM) and Qwen3-VL-8B-Instruct (MLLM) under the W4A4KV16 configuration. Here, w e first study the effect of blo c k-wise affine trans- formation and blo c k-wise learnable clipping. BA TQuant: Outlier-resilien t MXFP4 Quantization 13 T able 4: Ablation study of blo ck-wise affine transformation and blo c k-wise learnable clipping. W e conduct the exp erimen ts under W4A4KV16 . Model Components Non-Reasoning Benchmarks A vg. Block T rans Blo c k Clip ARC-C ARC-E HellaSwag PIQA Winogrande Qwen3-8B ✓ 53.16 76.36 71.02 74.27 67.72 68.51 ✓ 52.35 77.44 71.71 76.01 63.69 68.24 ✓ ✓ 53.33 77.53 71.12 75.30 66.22 68.70 Model Components Multimodal Benc hmarks Recov ery Block T rans Blo c k Clip MME OCRBenc h DocVQA RealW orldQA VLMBlind (%) Qwen3-VL-8B -Instruct ✓ 2235 861 94.63 67.19 69.99 96.18 ✓ 2249 865 94.04 67.21 70.28 95.59 ✓ ✓ 2360 864 94.31 67.32 69.70 96.43 Fig. 6: Performance of Qw en3-8B (LLM) and Qw en3-VL-8B-Instruct (MLLM) with differen t transformation blo c k sizes. Fig. 7: Performance of Qw en3-8B (LLM) and Qw en3-VL-8B-Instruct (MLLM) with differen t sizes of the global shared matrix. Effect of Block-wise Components. The baseline setting without block-wise affine transformation and blo c k-wise learnable clipping refers to the use of glob al- wise coun terparts. As shown in T able 4, replacing the global transformation with our blo c k-wise v ariant yields significant impro vemen ts. F or Qwen3-8B, applying the blo c k-wise transformation improv es the a v erage accuracy from 68.24% to 68.70%. Similarly , for Qwen3-VL-8B-Instruct, it b oosts the recov ery rate from 95.59% to 96.43%. Applying blo ck-wise clipping also provides comp etitive gains. F or Qwen3-8B, the a v erage accuracy is impro v ed from 68.51% to 68.70%. F or Qw en3-VL-8B-Instruct, the recov ery rate is bo osted from 96.18% to 96.43%. These confirm that using blo c k-wise affine transformation and blo ck-wise learn- able clipping under MXFP quan tization is crucial. Blo c k Size of Affine Matrix. BA TQuan t aligns the block size of affine trans- formation to the MXFP quantization granularit y . T o in v estigate the effect of transformation scop e, w e v ary the size of the affine transformation P i while k eeping the MXFP quan tization block size fixed at g = 32 . As illustrated in Fig- ure 6, for Qw en3-VL-8B-Instruct and Qw en3-8B, the b est performance are b oth ac hieved when the transformation block size exactly matches the quan tization blo c k size ( g = 32 ). This allows affine transformations to precisely reshap e lo cal 14 Li et al. distributions, isolated from cross-blo c k outlier interference. W e can also observ e that deviating from this alignmen t leads to performance degradation. When the block size of affine matrix is smaller than g (e.g., 16), the transformation scop e is narrow to smooth outliers in quantization blo c ks. Additionally , distinct transformations lead to unev en energy ( ℓ 2 -norm) suppression within quan tiza- tion blo c ks, creating imbalanced distributions and inducing new lo c al outliers . When the block size of affine matrix is greater than g (e.g., 128), the transforma- tion mixes elements across multiple quantization blocks. This transfers energy b et w een blo c ks, whic h can increase quantization error. These findings suggest that strictly coupling the affine transformation granularit y with the hardware quan tization blo c k size is an effective design choice. Effect of GPK. T o in v estigate the impact of Global and Priv ate Kroneck er (GPK) module, w e analyze the size of the global shared matrix A (denoted as g 1 ). Recall that g = g 1 · g 2 = 32 ; thus, v arying g 1 inheren tly changes the capac- it y of b oth the shared global basis and the block-specific priv ate comp onen ts. W e ev aluate configurations with g 1 ∈ { 1 , 2 , 4 , 8 , 16 , 32 } . The results are sho wn in Figure 7. Con trary to the intuition that increasing learnable parameters (i.e., decreasing g 1 ) monotonically improv es performance, our experiments rev eal a non-monotonic trend with an optimal p oint at g 1 = 8 or g 1 = 4 . When g 1 is large (e.g., 16 or 32), the dimension of the priv ate matrix B i b ecomes small ( g 2 ≤ 2 ), sev erely limiting the abilit y of each blo c k to adapt its lo cal distribution indep enden tly and leading to a p erformance drop. Conv ersely , when g 1 is small (e.g., 1 or 2), the num ber of priv ate parameters increases significantly , theoret- ically offering higher capacity . Ho wev er, the search space is also expanded. The optimizer may struggle to conv erge to a robust solution without more calibration data or h yp erparameter tuning, leading to sub-optimal p erformance or instabil- it y . Therefore, to strike an optimal balance betw een accuracy and efficiency , we recommend the configuration with g 1 = 8 as the default setting. 5 Conclusion In this pap er, w e present BA TQuan t, a robust framew ork for outlier-resilient MXFP4 quan tization that leverages learnable block-wise optimization. By re- stricting affine transformations to align strictly with hardw are quantization gran- ularit y , our metho d effectively eliminates the cross-block energy transfer and bimo dal distributions inheren t in global rotation tec hniques. This targeted opti- mization, enhanced by efficient Glob al and Private Kr one cker (GPK) decomp o- sition and block-wise learnable clipping, ensures precise outlier suppression with minimal ov erhead. Extensive exp erimen ts on MLLMs and LLMs v alidate that BA TQuan t sets new state-of-the-art results, achieving near-lossless results un- der W4A8KV16 and reco vering up to 96.43% of full-precision p erformance under aggressiv e W4A4KV16 settings. W e hope this w ork offers a practic al solution for deplo ying large mo dels on emerging microscaling arc hitectures. BA TQuant: Outlier-resilien t MXFP4 Quantization 15 References 1. A dv anced Micro Devices, Inc.: AMD CDNA TM 4 Architecture Whitepap er. White pap er, A dv anced Micro Devices, Inc. (2025) 2. Agarw al, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y., Baker, B., Bao, H., et al.: g pt-oss-120b & gpt-oss-20b mo del card. arXiv preprint arXiv:2508.10925 (2025) 3. Ashkb oos, S., Mohtashami, A., Cro ci, M.L., Li, B., Cameron, P ., Jaggi, M., Alis- tarh, D., Ho efler, T., Hensman, J.: Quarot: Outlier-free 4-bit inference in rotated llms. In: NeurIPS. pp. 100213–100240 (2024) 4. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint (2025) 5. Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: Piqa: Reasoning ab out ph ysical commonsense in natural language. In: AAAI. vol. 34, pp. 7432–7439 (2020) 6. Chen, M., W u, M., Jin, H., Y uan, Z., Liu, J., Zhang, C., Li, Y., Huang, J., Ma, J., Xue, Z., et al.: In t vs fp: A comprehensive study of fine-grained low-bit quantization formats. arXiv preprint arXiv:2510.25602 (2025) 7. Cho quette, J.: Nvidia hopper h100 gpu: Scaling performance. IEEE Micro 43 (3), 9–17 (2023) 8. Clark, P ., Cowhey , I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., T afjord, O.: Think you ha ve solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018) 9. Cobb e, K., Kosara ju, V., Ba v arian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., T worek, J., Hilton, J., Nak ano, R., et al.: T raining verifiers to solve math w ord problems. arXiv preprint arXiv:2110.14168 (2021) 10. Co ok, J., Guo, J., Xiao, G., Lin, Y., Han, S.: F our ov er six: More accurate n vfp4 quan tization with adaptive block scaling. arXiv preprint arXiv:2512.02010 (2025) 11. Egiazarian, V., Castro, R.L., Kuznedelev, D., P anferov, A., Kurtic, E., Pandit, S., Marques, A., Kurtz, M., Ashkb oos, S., Hoefler, T., et al.: Bridging the gap b et w een promise and p erformance for microscaling fp4 quan tization. arXiv preprint arXiv:2509.23202 (2025) 12. F rantar, E., Ashkb oos, S., Hoefler, T., Alistarh, D.: Gptq: Accurate post- training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022) 13. F u, C., Chen, P ., Shen, Y., Qin, Y., Zhang, M., Lin, X., Y ang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive ev aluation b enc hmark for m ultimo dal large language mo dels. In: NeurIPS Datasets and Benchmarks T rac k (2025) 14. Hong, W., Y u, W., Gu, X., W ang, G., Gan, G., T ang, H., Cheng, J., Qi, J., Ji, J., P an, L., et al.: Glm-4.5 v and glm-4.1 v-thinking: T ow ards v ersatile multimodal reasoning with scalable reinforcement learning. arXiv preprint (2025) 15. Hu, W., Zhang, Z., Zhang, H., Zhang, C., Guo, C., F eng, Y., Hu, T., Li, G., Hu, G., W ang, J., et al.: M2xfp: A metadata-augmented microscaling data format for efficien t lo w-bit quantization. arXiv e-prints pp. arXiv–2601 (2026) 16. Hu, X., Chen, Z., Y ang, D., Xu, Z., Xu, C., Y uan, Z., Zhou, S., Y u, J.: Moequant: Enhancing quantization for mixture-of-exp erts large language models via expert- balanced sampling and affinit y guidance. arXiv preprint arXiv:2505.03804 (2025) 17. Huang, X., Liu, Z., Liu, S.Y., Cheng, K.T.: Rolora: Fine-tuning rotated outlier- free llms for effective weigh t-activ ation quantization. In: Findings of EMNLP . pp. 7563–7576 (2024) 16 Li et al. 18. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and comp ositional question answering. In: CVPR. pp. 6700–6709 (2019) 19. K won, W., Li, Z., Zh uang, S., Sheng, Y., Zheng, L., Y u, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language mo del serving with pagedatten tion. In: SOSP . pp. 611–626 (2023) 20. Lee, J., Park, J., Cha, S., Cho, J., Sim, J.: Mx+: Pushing the limits of microscaling formats for efficient large language mo del serving. In: Proceedings of the 58th IEEE/A CM In ternational Symp osium on Microarchitecture. pp. 869–883 (2025) 21. LI, J., Beec hing, E., T unstall, L., Lipkin, B., Soletskyi, R., Huang, S.C., Rasul, K., Y u, L., Jiang, A., Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., Lample, G., P olu, S.: Numinamath (2024) 22. Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., Meng, C., Zhu, J.Y., Han, S.: Svdquant: Absorbing outliers b y lo w-rank comp onen ts for 4-bit diffusion mo dels. arXiv preprin t arXiv:2411.05007 (2024) 23. Li, S., Hu, Y., Ning, X., Liu, X., Hong, K., Jia, X., Li, X., Y an, Y., Ran, P ., Dai, G., et al.: Mb q: Modality-balanced quantization for large vision-language mo dels. In: CVPR. pp. 4167–4177 (2025) 24. Ligh tman, H., K osara ju, V., Burda, Y., Edw ards, H., Baker, B., Lee, T., Leike, J., Sc hulman, J., Sutskev er, I., Cobb e, K.: Let’s verify step by step. In: ICLR (2023) 25. Lin, H., Xu, H., W u, Y., Cui, J., Zhang, Y., Mou, L., Song, L., Sun, Z., W ei, Y.: Duquan t: Distributing outliers via dual transformation makes stronger quan tized llms. In: NeurIPS. pp. 87766–87800 (2024) 26. Lin, J., T ang, J., T ang, H., Y ang, S., Chen, W.M., W ang, W.C., Xiao, G., Dang, X., Gan, C., Han, S.: A wq: A ctiv ation-a ware weigh t quan tization for on-device llm compression and acceleration. Proceedings of Mac hine Learning and Systems 6 , 87–100 (2024) 27. Liu, H., Li, C., W u, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS. pp. 34892–34916 (2023) 28. Liu, R., Sun, Y., Zhang, M., Bai, H., Y u, X., Y u, T., Y uan, C., Hou, L.: Quanti- zation h urts reasoning? an empirical study on quan tized reasoning mo dels. arXiv preprin t arXiv:2504.04823 (2025) 29. Liu, W., Meng, H., Luo, Y., Zhang, P ., Ma, X.: Micromix: Efficient mixed-precision quan tization with microscaling formats for large language mo dels. arXiv preprint arXiv:2508.02343 (2025) 30. Liu, X., Xia, X., Zhang, M., Li, J.F., Y u, X., Shen, F., Su, X., Ng, S.K., Ch ua, T.S.: F reeact: F reeing activ ations for llm quan tization. arXiv preprint (2026) 31. Liu, X., Xia, X., Zhao, W., Zhang, M., Y u, X., Su, X., Y ang, S., Ng, S.K., Chua, T.S.: L-m tp: Leap m ulti-token prediction beyond adjacent con text for large lan- guage mo dels. arXiv preprint arXiv:2505.17505 (2025) 32. Liu, Y., Li, Z., Huang, M., Y ang, B., Y u, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrb enc h: on the hidden mystery of o cr in large multimodal models. Science China Information Sciences 67 (12), 220102 (2024) 33. Liu, Z., Zhao, C., F edoro v, I., Soran, B., Choudhary , D., Krishnamo orthi, R., Chan- dra, V., Tian, Y., Blank evoort, T.: Spinquant: Llm quantization with learned ro- tations. arXiv preprint arXiv:2405.16406 (2024) 34. Luo, R., W ang, L., He, W., Chen, L., Li, J., Xia, X.: Gui-r1: A generalist r1- st yle vision-language action model for gui agen ts. arXiv preprin t (2025) BA TQuant: Outlier-resilien t MXFP4 Quantization 17 35. Luo, R., Xia, X., W ang, L., Chen, L., Shan, R., Luo, J., Y ang, M., Ch ua, T.S.: Next-omni: T ow ards an y-to-any omnimodal foundation mo dels with discrete flow matc hing. arXiv preprint arXiv:2510.13721 (2025) 36. Ma, Y., Li, H., Zheng, X., Ling, F., Xiao, X., W ang, R., W en, S., Chao, F., Ji, R.: Affinequan t: Affine transformation quantization for large language models. arXiv preprin t arXiv:2403.12544 (2024) 37. Mathew, M., Karatzas, D., Jaw ahar, C.: Docvqa: A dataset for vqa on document images. In: Pro ceedings of the IEEE/CVF win ter conference on applications of computer vision. pp. 2200–2209 (2021) 38. Meng, H., Luo, Y., Zhao, Y., Liu, W., Zhang, P ., Ma, X.: Arcquan t: Bo osting n vfp4 quantization with augmented residual channels for llms. arXiv preprint arXiv:2601.07475 (2026) 39. Mishra, A., Stosic, D., La yton, S., Micikevicius, P .: Recipes for pre-training llms with mxfp8. arXiv preprin t arXiv:2506.08027 (2025) 40. P aszke, A., Gross, S., M assa, F., Lerer, A., Bradbury , J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorc h: An imperative style, high- p erformance deep learning library . In: NeurIPS (2019) 41. Qin, G., Li, Z., Chen, Z., Zhang, W., K ong, L., Zhang, Y.: V eq: Modality-adaptiv e quan tization for mo e vision-language mo dels. arXiv preprin t (2026) 42. Rein, D., Hou, B.L., Stickland, A.C., P etty , J., Pang, R.Y., Dirani, J., Mic hael, J., Bo wman, S.R.: Gp qa: A graduate-level go ogle-proof q&a b enc hmark. In: COLM (2024) 43. Rouhani, B.D., Zhao, R., More, A., Hall, M., Kho damoradi, A., Deng, S., Choud- hary , D., Cornea, M., Dellinger, E., Denolf, K., et al.: Microscaling data formats for deep learning. arXiv preprint arXiv:2310.10537 (2023) 44. Sak aguc hi, K., Bras, R.L., Bhaga v atula, C., Choi, Y.: Winogrande: An adv ersarial winograd schema challenge at scale. Comm unications of the ACM 64 (9), 99–106 (2021) 45. Shao, W., Chen, M., Zhang, Z., Xu, P ., Zhao, L., Li, Z., Zhang, K., Gao, P ., Qiao, Y., Luo, P .: Omniquant: Omnidirectionally calibrated quan tization for large language mo dels. arXiv preprint arXiv:2308.13137 (2023) 46. Shao, Y., W ang, P ., Chen, Y., Xu, C., W ei, Z., Cheng, J.: Blo c k rotation is all y ou need for mxfp4 quan tization. arXiv preprin t arXiv:2511.04214 (2025) 47. Sun, Y., Liu, R., Bai, H., Bao, H., Zhao, K., Li, Y., Hu, J., Y u, X., Hou, L., Y uan, C., et al.: Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426 (2024) 48. T eam, K., Du, A., Yin, B., Xing, B., Qu, B., W ang, B., Chen, C., Zhang, C., Du, C., W ei, C., et al.: Kimi-vl tec hnical rep ort. arXiv preprin t (2025) 49. Tirumala, A., W ong, R.: Nvidia blac kwell platform: A dv ancing generativ e ai and accelerated computing. In: HCS. pp. 1–33 (2024) 50. T seng, A., Chee, J., Sun, Q., Kuleshov, V., De Sa, C.: Quip#: Ev en b etter llm quan tization with hadamard incoherence and lattice co debo oks. ICML (2024) 51. W ang, H., Ma, S., W ei, F.: Bitnet v2: Nativ e 4-bit activ ations with hadamard transformation for 1-bit llms. arXiv preprint arXiv:2504.18415 (2025) 52. W ang, W., Gao, Z., Gu, L., Pu, H., Cui, L., W ei, X., Liu, Z., Jing, L., Y e, S., Shao, J., et al.: Intern vl3. 5: A dv ancing op en-source multimodal models in versatilit y , reasoning, and efficiency . arXiv preprint arXiv:2508.18265 (2025) 18 Li et al. 53. W ang, W., Chen, W., Luo, Y., Long, Y., Lin, Z., Zhang, L., Lin, B., Cai, D., He, X.: Model compression and efficient inference for large language models: A survey . arXiv preprint arXiv:2402.09748 (2024) 54. W ei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., Liu, X.: Outlier suppres- sion+: A ccurate quantization of large language mo dels b y equiv alen t and effective shifting and scaling. In: EMNLP . pp. 1648–1665 (2023) 55. W olf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P ., Rault, T., Louf, R., F unto wicz, M., et al.: T ransformers: State-of-the-art natural language processing. In: Pro ceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. pp. 38–45 (2020) 56. W u, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., W u, C., W ang, B., et al.: Deepseek-vl2: Mixture-of-exp erts vision-language models for adv anced m ultimo dal understanding. arXiv preprint arXiv:2412.10302 (2024) 57. xAI: Realworldqa: A b enc hmark for real-world spatial understanding (2024) 58. Xiao, G., Lin, J., Seznec, M., W u, H., Demouth, J., Han, S.: Smo othquan t: Accurate and efficient post-training quan tization for large language mo dels. In: ICML. pp. 38087–38099 (2023) 59. Xin, M., Priyadarshi, S., Xin, J., Kartal, B., V a vre, A., Thekkumpate, A.K., Chen, Z., Mahabaleshw ark ar, A.S., Shahaf, I., Berco vic h, A., et al.: Quan tization-aw are distillation for nvfp4 inference accuracy recov ery . arXiv preprint (2026) 60. Xu, M., Yin, W., Cai, D., Yi, R., Xu, D., W ang, Q., W u, B., Zhao, Y., Y ang, C., W ang, S., et al.: A survey of resource-efficient llm and m ultimo dal foundation mo dels. arXiv preprin t arXiv:2401.08092 (2024) 61. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 tec hnical report. arXiv preprint arXiv:2505.09388 (2025) 62. Y u, J., Zhou, S., Y ang, D., Li, S., W ang, S., Hu, X., Xu, C., Xu, Z., Shu, C., Y uan, Z.: Mquant: Unleashing the inference potential of multimodal large language mo dels via static quantization. In: A CM MM. pp. 1783–1792 (2025) 63. Zellers, R., Holtzman, A., Bisk, Y., F arhadi, A., Choi, Y.: Hellasw ag: Can a machine really finish your sen tence? arXiv preprint arXiv:1905.07830 (2019) 64. Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., W ang, C., Yin, D., Zeng, H., Zhang, J., et al.: Glm-4.5: Agen tic, reasoning, and coding (arc) foundation mo dels. arXiv preprin t arXiv:2508.06471 (2025) 65. Zhang, J., W ei, J., Zhang, P ., Xu, X., Huang, H., W ang, H., Jiang, K., Chen, J., Zh u, J.: Sageattention3: Microscaling fp4 atten tion for inference and an exploration of 8-bit training. arXiv preprint arXiv:2505.11594 (2025) 66. Zhang, M., Li, J.F., Sun, Z., Bai, H., Zhen, H.L., Dong, Z., Y u, X.: Benchmarking p ost-training quantization of large language mo dels under microscaling floating p oin t formats. arXiv preprint arXiv:2601.09555 (2026) 67. Zhao, P ., Zhen, H.L., Li, X., Bao, H., Lin, W., Y ang, Z., Y u, Z., W ang, X., Y uan, M., Y u, X., et al.: Unleashing low-bit inference on ascend npus: A comprehensive ev aluation of hifloat formats. arXiv preprint arXiv:2602.12635 (2026) 68. Zh u, X., Li, J., Liu, Y., Ma, C., W ang, W.: A survey on mo del compression for large language mo dels. T ransactions of the Association for Computational Linguistics 12 , 1556–1577 (2024) BA TQuant: Outlier-resilien t MXFP4 Quantization 19 A Implemen tation Details A.1 Multimo dal Benc hmarks – MME . It is a collection of b enchmarks to ev aluate the multimodal under- standing capabilit y of large vision language mo dels (L VLMs). – OCRBenc h . OCRBenc h is a comprehensiv e ev aluation b enc hmark designed to assess the OCR capabilities of Large Multimo dal Mo dels, which contains 1000 question-answ er pairs, including T ext Recognition, SceneT ext-Cen tric V QA, Document-Orien ted V QA, Key Information Extraction, and Hand- written Mathematical Expression Recognition. – Do cV QA . Do cV QA is a benchmark for Visual Question Answering (VQA) on do cumen t images. The dataset consists of 50,000 questions defined on more than 12,000 do cumen t images. – RealW orldQA . It is a b enchmark designed to test spatial and ph ysical reasoning. It features high-qualit y images tak en from vehicles and egocentric views, challenging mo dels to answer questions ab out ob ject relations and en vironmental context in unconstrained, realistic settings. – VLMBlind . It is a b enc hmark of sev en nov el lo w-level visual tasks for test- ing VLM abilit y to “see” simple geometric primitiv es (such as line, circles, squares, in tersections) that are the basic building blo c ks for image tasks. F or all multimodal b enc hmarks, w e use vllm [19] bac kend for ev aluation with a sampling temp erature of 0.7, a top-p v alue of 0.8, a top-k v alue of 20 and a presence p enalty of 2.0. The maxim um sequence length of the mo del is limited to 32,768. A.2 Non-reasoning Benc hmarks – PIQA . It is a ph ysical commonsense reasoning and corresp onding b ench- mark dataset, whic h w as designed to in v estigate the ph ysical kno wledge of existing mo dels. – Winogrande . Winogrande is a collection of 44k problems form ulated as a fill-in-a-blank task with binary options, and the goal is to choose the righ t option for a giv en sentence, which requires commonsense reasoning. – Hellasw ag . It is a commonsense inference benchmark designed to challenge language mo dels with adversarially filtered multiple-c hoice questions. – AR C-Easy & AR C-Challenge . The ARC dataset consists of 7,787 science exam questions dra wn from a v ariety of sources. Eac h question has a m ultiple c hoice structure (t ypically 4 answ er options). AR C-Easy con tains 5,197 easy questions, and AR C-Challenge contains 2,590 hard questions. A.3 Reasoning Benc hmarks – GSM8K . GSM8K is a dataset of appro ximately 8,500 high-quality , linguis- tically diverse grade sc hool math w ord problems created by h uman writers. 20 Li et al. W e employ its test split, whic h contains 1,319 examples in total. W e ev aluate mo del p erformance using A vg@1 (i.e., the accuracy of the first generated answ er). – MA TH-500 . A benchmark that con tains a mix of easy and hard math- ematical problems designed to test comprehensiv e reasoning abilities. W e ev aluate model p erformance using A vg@3 which av erages accuracy ov er 3 indep enden tly sampled reasoning traces p er problem. – AIME24 . It contains 30 problems from the American In vitational Mathe- matics Examination (AIME) 2024. W e rep ort results using A vg@16 whic h a verages accuracy ov er 16 independently sampled reasoning traces p er prob- lem. – AIME25 . It contains 30 problems from the American In vitational Mathe- matics Examination (AIME) 2025. W e rep ort results using A vg@16 whic h a verages accuracy ov er 16 independently sampled reasoning traces p er prob- lem. – GPQA-D . GPQA is a b enc hmark of graduate-lev el questions authored and v alidated by PhD exp erts. It is designed to be "Go ogle-proof": highly skilled non-exp erts with unrestricted w eb access achiev e only 34% accuracy , while domain experts reac h 65% (74% after error correction). W e rep ort results using A vg@10 whic h a verages accuracy ov er 10 independently sampled rea- soning traces p er problem. F or all reasoning benchmarks, w e use vllm [19] bac k end for ev aluation with a sampling tempe rature of 0.6, a top-p v alue of 0.95 and a top-k v alue of 20. The maxim um sequence length of the mo del is limited to 38,912. A.4 Baseline Metho ds – R TN . It is the straigh tforward quan tization strategy that maps original floating-p oin t v alues without additional optimization or calibration. – QuaRot . It uses randomized Hadamard transforms to rotate weigh ts and activ ations into a space where outliers are suppressed, enabling outlier-free 4-bit quan tization. – SpinQuan t . It employs orthogonal matrices optimized via the Ca yley op- timizer to rotate w eights and activ ations in to a space where outliers are suppressed. – BR Q . It is equipped with block-wise rotation to prev ent the energy transfer in w eights and activ ations rotation. – FlatQuan t . It is designed to impro ve lo w-bit quan tization by flattening the activ ation distributions using global affine matrices, specifically optimized for efficien t deploymen t on hardware. – Smo othQuan t . It uses a diagnoal scales to smo oth activ ation outliers by migrating the quan tization difficulty from activ ations to weigh ts. – GPTQ . It is a lay er-wise p ost-training quantization metho d that lev erages appro ximate second-order information (Hessian) to minimize quan tization errors, ac hieving high accuracy for weigh t-only low-bit quantization. BA TQuant: Outlier-resilien t MXFP4 Quantization 21 Algorithm 1 GPK F orw ard Pass (PyT orch St yle) Require: Input tensor X ∈ R B × S × N , Global matrix A ∈ R g 1 × g 1 , Priv ate matrices { B i } k i =1 , Quantization block size g . Ensure: T ransformed tensor ˜ X ∈ R B × S × N . 1: P arameters: Blo ck coun t k , dims g 1 , g 2 s.t. N = k · g 1 · g 2 . 2: Reshap e X from [ B , S, N ] to [ − 1 , k , g 2 , g 1 ] . 1. Global Shared T ransformation (PyT orc h einsum) 3: ˜ X ← einsum ( X , A , equation = (...gij,jk->...gik) ) 2. Blo ck-Specific Priv ate T ransformation (PyT orc h einsum) 4: Stac k { B i } into B stack ∈ R k × g 2 × g 2 . 5: ˜ X ← einsum ( B stack , ˜ X , equation = (gij,bgjk->bgik) ) 6: Reshap e ˜ X back to [ B , S, N ] . 7: return ˜ X A.5 Hyp erparameter Settings W e implemen t BA TQuant based on Huggingface [55], PyT orch [40]. W e adopt the Adam W optimizer with an initial learning rate of 2e-3 and employ a cosine annealing learning rate deca y schedule. BA TQuant is trained for 5 ep ochs, and the batch size is set to 4. F or GPK, we set the size of the global shared matrix g 1 and blo c k-specific priv ate matrix g 2 to 8 and 4, respectively . T o sim ulate the quan tization with MXFP format, we use the microxcaling library 3 for all exp erimen ts. B Detailed Algorithm Flo w In this section, w e provide the detailed algorithmic implementation of BA TQuan t. W e first formalize the efficient forw ard pass of the Glob al and Private Kr one cker (GPK) decomposition, follow ed b y the complete calibration pro cedure for learn- ing the blo c k-wise affine transformations and clipping parameters. B.1 Efficien t Inference via GPK F orw ard Pass T o minimize run time ov erhead during inference, the block-wise affine tr ansfor- mation P i = B i ⊗ A is not materialized as a full dense matrix. Instead, w e lev erage the Kronec ker pro duct to p erform the transformation efficiently with- out explicit matrix construction. Sp ecifically , for the i -th block input vector of size g = g 1 · g 2 , the op eration pro ceeds in three steps. First, the input vector is reshap ed into a 3D matrix of dimensions 1 × g 2 × g 1 . Second, this matrix is m ultiplied b y the global shared matrix A ∈ R g 1 × g 1 from the right and the blo c k-sp ecific priv ate matrix B i ∈ R g 2 × g 2 from the left; Finally , the resulting matrix is reshap ed back to its original shap e. Algorithm 1 details the vectorized implemen tation of this op eration for a batc h of inputs across all blo c ks. 3 h ttps://github.com/microsoft/micro xcaling 22 Li et al. Algorithm 2 BA TQuan t Algorithm Flow Require: F ull-precision weigh ts W ∈ R M × N , Lay er input X ∈ R B × S × N , Global ma- trix A ∈ R g 1 × g 1 , Priv ate matrices { B i } k i =1 , Quantization block size g , Ep och E . Ensure: Calibrated parameters Θ = { A , { B i } , { α min i , α max i }} for each la yer. 1: for ep och = 1 to E do 2: for each batc h in X do 1. T ransformation 3: Obtain transformed activ ations ˜ X using X , A and { B i } based on Alg. 1. 4: Obtain transformed weigh ts ˜ W using W , A − 1 and { B − 1 i } based on Alg. 1. 5: Apply blo c k-wise clipping to weigh ts ˜ W and ˜ X . 2. Quantization 6: ˜ X ← Q ( ˜ X ) , ˜ W ← Q ( ˜ W ) 3. Loss Computation & Optimization 7: ˜ Y ← ˜ X ˜ W ⊤ , Y ← XW ⊤ 8: L ← ∥ Y − ˜ Y ∥ 2 2 9: Up date Θ l using ∇ Θ l L 10: end for 11: end for 4. Offline F usion for Deplo ymen t 12: Obtain transformed weigh ts ˜ W using W , A − 1 and { B − 1 i } based on Alg. 1. 13: Apply blo c k-wise clipping to w eights ˜ W . 14: ˜ W ← Q ( ˜ W ) 15: Store Θ = { A , { B i } , { α min i , α max i }} for online activ ation transformation. B.2 BA TQuan t Calibration Pro cedure The calibration pro cess aims to learn the optimal parameters Θ that minimize the difference b et ween the full-precision lay er output and the quantized output. Algorithm 2 outlines the end-to-end training flo w. F or each lay er in the netw ork, w e iterate ov er a small calibration dataset: 1. In eac h iteration, w e apply the GPK-based affine transformation to weigh ts and activ ations (Line 3-4). 2. W e apply the blo c k-wise learnable clipping to w eights and activ ations (Line 5). 3. The transformed activ ations and the corresponding in verse-transformed weigh ts are quan tized to the target MXFP format (Line 6). 4. The loss is computed as the Mean Squared Error (MSE) b et ween the full- precision output and the quan tized output (Line 7-8). 5. P arameters are updated via backpropagation using the A dam W optimizer (Line 9). After calibration, the weigh t-side transformation P − 1 is fused into the orig- inal weigh ts W offline, while the activ ation-side transformation P and clipping parameters are retained for online inference. BA TQuant: Outlier-resilien t MXFP4 Quantization 23 T able 5: P erformance comparison of v arious quantization metho ds on non-reasoning b enc hmarks across different bit-width configurations (e.g., W4A8KV16, W4A4KV16, W4A8KV8 and W4A8KV4).The reco very rate relativ e to the BF16 baseline is also pro vided and the b est result in each case is marked in b old. Bits Metho d ARC-C ARC-E HellaSwag PIQA Winogrande A vg. R e c overy(%) BF16 – 56.48 81.06 74.96 77.69 68.03 71.64 100.00 W4A8KV16 R TN 55.72 80.81 73.29 77.09 66.93 70.77 98.75 QuaRot 55.20 78.70 72.77 76.88 65.11 69.73 97.31 SpinQuant 54.69 76.98 72.76 78.07 66.85 69.87 97.52 BRQ 53.67 78.87 73.27 76.66 66.93 69.88 97.43 FlatQuant 55.72 79.63 72.66 76.82 66.22 70.21 98.01 SmoothQuant 55.80 79.04 72.38 76.55 66.85 70.12 97.93 GPTQ 55.89 80.60 73.16 77.31 67.09 70.81 98.82 Ours 56.14 79.92 73.10 77.97 68.59 71.14 99.34 W4A4KV16 R TN 52.47 76.89 70.44 74.16 64.80 67.75 94.49 QuaRot 50.43 74.28 67.55 73.67 63.38 65.86 91.81 SpinQuant 45.65 68.18 67.41 74.21 62.19 63.53 88.36 BRQ 48.55 74.71 68.79 75.24 63.93 66.24 92.14 FlatQuant 50.60 78.20 70.36 75.63 63.54 67.67 94.13 SmoothQuant 50.09 75.72 70.15 74.37 64.64 66.99 93.29 GPTQ 51.28 76.98 70.47 75.79 64.56 67.82 94.44 Ours 53.33 77.53 71.12 75.30 66.22 68.70 95.84 W4A8KV8 R TN 55.72 80.51 72.86 76.55 66.93 70.51 98.42 QuaRot 55.38 79.84 72.54 76.88 66.22 70.17 97.92 SpinQuant 53.50 77.65 72.56 77.53 65.9 69.43 96.80 BRQ 52.99 78.11 73.09 76.88 67.8 69.77 97.26 FlatQuant 52.56 77.10 72.46 77.09 68.19 69.48 96.86 SmoothQuant 55.03 79.21 72.76 76.99 67.40 70.28 98.08 GPTQ 56.06 80.68 72.95 77.53 66.46 70.74 98.72 Ours 55.63 79.80 73.15 77.09 67.17 70.57 98.50 W4A8KV4 R TN 51.96 76.89 70.54 75.08 63.61 67.62 94.22 QuaRot 52.73 76.47 70.15 74.81 62.27 67.29 93.82 SpinQuant 49.32 74.07 69.82 75.95 63.30 66.49 92.53 BRQ 50.68 75.97 70.38 74.65 62.43 66.82 93.04 FlatQuant 52.13 77.90 69.60 75.14 62.51 67.46 93.97 SmoothQuant 49.74 73.23 69.61 75.24 66.85 66.93 93.28 GPTQ 52.39 76.52 71.25 75.73 65.35 68.25 95.15 Ours 53.33 78.54 69.53 76.66 65.19 68.65 95.71 C A dditional Results C.1 Results of Non-Reasoning T asks T able 5 presen ts the comprehensive performance comparison on non-reasoning b enc hmarks (AR C-C, AR C-E, HellaSw ag, PIQA, and Winogrande) under four distinct quan tization configurations. In the most challenging W4A4KV16 configu- ration, BA TQuant achiev es an a v erage accuracy of 68.70% , corresp onding to a 95.84% recov ery rate relativ e to the BF16 baseline. This significan tly out- p erforms the strongest competing metho ds, including GPTQ (67.82%, 94.44%) 24 Li et al. and FlatQuant (67.67%, 94.13%). Notably , rotation-based methods lik e Spin- Quan t suffer from catastrophic failure in this regime, dropping to only 63.53% accuracy . Similarly , under the W4A8KV4 setting with aggressive KV cac he quan ti- zation, BA TQuan t secures the highest av erage accuracy ( 68.65% ) and recov ery rate ( 95.71% ), surpassing GPTQ b y a margin of 0.40%. Under the W4A8KV16 configuration, BA TQuant achiev es a near-lossless reco very rate of 99.34% (A vg. 71.14%), establishing a new state-of-the-art result that exceeds GPTQ (98.82%) and R TN (98.75%). In the W4A8KV8 setting, the performance gap narrows as the quantization difficulty decreases. Here, GPTQ ac hiev es the highest a v erage score (70.74%), while BA TQuant remains highly comp etitiv e with 70.57%, out- p erforming all other transformation-based metho ds (e.g., FlatQuan t at 69.48%). C.2 Results of GPTQ and R TN w eigh t quantizer T able 6 and T able 7 compare GPTQ and R TN as weigh t quantizers across v ar- ious MXFP configurations. The results sho w that GPTQ consistently outper- forms R TN in all ev aluated settings. This impro vemen t is attributed to GPTQ’s appro ximate second-order optimization, whic h minimizes quan tization error by accoun ting for in ter-channel w eigh t correlations. In con trast, R TN applies p er- elemen t rounding independently , without leveraging the structural redundancy that GPTQ utilizes for error comp ensation. Giv en these consisten t results, GPTQ serv es as a more effective weigh t quantization strategy than R TN. C.3 A ctiv ation Visualization Here, we pro vide the full details of activ ation distributions within different quan- tization blo c ks as sho wn in Figure 8, Figure 9, Figure 10 and Figure 11. C.4 Case Studies W e qualitatively compare BA TQuant against BRQ (W4A4) on geometric rea- soning and OCR tasks under W4A4KV16 senario. As sho wn in Figures 12 and 13, while BR Q suffers from feature distortion leading to hallucinations, BA TQuan t preserv es critical visual details matching the BF16 baseline. In Figure 12, the task requires counting line in tersections. The BR Q baseline incorrectly halluci- nates an intersection p oin t ( {1} ), lik ely due to quan tization noise distorting edge con tinuit y . In con trast, BA TQuant correctly iden tifies zero in tersections ( {0} ), demonstrating superior preserv ation of spatial structures. Figure 13 presen ts a c hallenging train num ber recognition task. BRQ fails to capture the full se- quence, truncating the answer to “055”. Conv ersely , BA TQuan t accurately re- co vers the complete n umber “055 05995”, proving its effectiveness in retaining high-frequency details essential for dense text recognition. These cases highligh t that unlik e BR Q, whic h struggles with subtle visual cues under aggressiv e quan- tization, BA TQuan t robustly maintains semantic fidelit y . BA TQuant: Outlier-resilien t MXFP4 Quantization 25 6 4 2 0 2 4 6 V alue 0 5000 10000 15000 20000 25000 Count =0.027 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 Count =0.163 6 4 2 0 2 4 6 V alue 0 5000 10000 15000 20000 25000 Count =0.012 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 12000 Count =-0.037 6 4 2 0 2 4 6 V alue 0 2500 5000 7500 10000 12500 15000 17500 Count =0.007 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 12000 Count =-0.063 6 4 2 0 2 4 6 V alue 0 2500 5000 7500 10000 12500 15000 17500 20000 Count =0.113 6 4 2 0 2 4 6 V alue 0 5000 10000 15000 20000 25000 Count =-0.123 6 4 2 0 2 4 6 V alue 0 2500 5000 7500 10000 12500 15000 17500 Count =0.038 6 4 2 0 2 4 6 V alue 0 5000 10000 15000 20000 25000 Count =0.066 6 4 2 0 2 4 6 V alue 0 5000 10000 15000 20000 25000 Count =-0.062 6 4 2 0 2 4 6 V alue 0 2500 5000 7500 10000 12500 15000 17500 20000 Count =-0.017 6 4 2 0 2 4 6 V alue 0 2500 5000 7500 10000 12500 15000 17500 20000 Count =-0.064 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =0.005 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 12000 14000 Count =-0.166 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =0.202 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =-0.173 6 4 2 0 2 4 6 V alue 0 2500 5000 7500 10000 12500 15000 17500 Count =-0.058 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =-0.026 6 4 2 0 2 4 6 V alue 0 2000 4000 6000 8000 10000 Count =-0.176 Method: R TN Module: La yer 35 down_proj Fig. 8: A ctiv ation distributions within differen t quan tization blocks of the down_proj mo dule in la yer 35 of Qwen3-8B with R TN. 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 Count =0.030 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 Count =-0.037 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 Count =-0.009 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 Count =0.026 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 Count =-0.021 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 1600 Count =0.037 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 Count =-0.003 6 4 2 0 2 4 6 V alue 0 200 400 600 800 1000 1200 1400 1600 Count =0.022 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.040 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =0.023 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.005 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.007 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.228 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.199 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.086 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.094 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.449 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.448 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.349 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =0.431 Method: BRQ Module: La yer 35 down_proj Fig. 9: A ctiv ation distributions within differen t quan tization blocks of the down_proj mo dule in la yer 35 of Qwen3-8B with BRQ. 26 Li et al. 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 Count =0.043 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.014 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.143 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.121 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.163 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.224 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.161 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.148 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.217 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.147 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.239 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.403 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.138 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.173 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.001 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.173 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.316 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 Count =0.189 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.130 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.238 Method: QuaRot Module: La yer 35 down_proj Fig. 10: Activ ation distributions within differen t quantization blo c ks of the down_proj mo dule in la yer 35 of Qwen3-8B with QuaRot. 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 3500 Count =0.005 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 Count =-0.160 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 3500 Count =-0.122 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 Count =-0.141 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 3500 4000 Count =-0.173 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 Count =0.181 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 Count =0.280 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 Count =0.135 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 3500 Count =-0.332 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 Count =-0.164 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 Count =-0.027 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =0.059 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 Count =-0.273 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 3500 Count =0.142 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 Count =-0.011 6 4 2 0 2 4 6 V alue 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.046 6 4 2 0 2 4 6 V alue 0 1000 2000 3000 4000 Count =0.049 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 Count =-0.150 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 3000 Count =-0.033 6 4 2 0 2 4 6 V alue 0 500 1000 1500 2000 2500 Count =0.060 Method: BA TQuant Module: La yer 35 down_proj Fig. 11: Activ ation distributions within differen t quantization blo c ks of the down_proj mo dule in la yer 35 of Qwen3-8B with BA TQuant (Ours). BA TQuant: Outlier-resilien t MXFP4 Quantization 27 T able 6: Performance comparison of differen t quantization methods on multimodal b enc hmarks using R TN and GPTQ as weigh t quantizers. Bold indicates the b est result within each quan tizer setting (R TN or GPTQ) for a sp ecific bit configuration. Bits Metho d Quantizer Multimodal Benc hmark Recov ery MME OCRBench Do cV QA RealW orldQA VLMBlind (%) W4A8KV16 QuaRot R TN 2201 814 93.11 65.36 63.21 91.43 BRQ 2272 831 93.66 69.80 62.11 93.47 FlatQuant 2311 880 94.65 66.14 67.96 95.64 BA TQuant 2312 877 94.58 66.80 69.27 96.11 QuaRot GPTQ 2327 870 95.07 69.80 71.12 97.53 BRQ 2329 865 94.72 70.19 67.18 96.40 FlatQuant 2351 886 95.31 69.02 73.90 98.66 BA TQuant 2386 893 95.55 70.20 73.14 99.29 W4A4KV16 QuaRot R TN 1965 710 90.91 60.48 55.31 83.18 BRQ 2096 749 91.09 61.83 56.65 85.92 FlatQuant 2147 846 93.14 62.48 65.49 91.49 BA TQuant 2255 838 93.68 64.71 66.84 93.33 QuaRot GPTQ 2189 810 93.47 64.97 57.62 89.69 BRQ 2147 805 92.94 66.14 62.41 90.75 FlatQuant 2231 873 94.10 65.62 68.86 94.79 BA TQuant 2360 864 94.31 67.32 69.70 96.43 W4A8KV8 QuaRot R TN 2143 816 93.27 65.36 62.49 90.82 BRQ 2277 815 93.55 69.93 60.24 92.67 FlatQuant 2285 871 94.11 60.52 70.04 94.09 BA TQuant 2301 867 94.72 66.67 72.71 96.71 QuaRot GPTQ 2296 868 95.11 69.02 70.26 96.78 BRQ 2283 867 94.63 69.80 67.36 95.98 FlatQuant 2353 888 95.12 69.14 72.77 98.41 BA TQuant 2368 890 95.47 69.93 72.82 98.89 W4A8KV4 QuaRot R TN 2112 781 92.67 62.48 60.34 88.27 BRQ 2194 807 92.75 66.27 57.31 89.80 FlatQuant 2257 867 94.05 59.87 71.05 93.84 BA TQuant 2289 874 94.64 64.97 71.06 95.83 QuaRot GPTQ 2280 857 94.66 68.52 68.36 95.65 BRQ 2236 841 94.07 68.63 66.03 94.21 FlatQuant 2293 884 94.88 68.76 70.75 97.11 BA TQuant 2332 885 95.07 68.63 70.92 97.51 28 Li et al. T able 7: Performance comparison of different quan tization methods on LLM non- reasoning b enchmarks using R TN and GPTQ as weigh t quan tizers. Bold indicates the best result within each quan tizer setting (R TN or GPTQ) for a sp ecific bit config- uration. Bits Metho d Quan tizer Non-Reasoning Benchmark A vg. AR C-C ARC-E HellaSwag PIQA Win ogrande W4A8KV16 QuaRot R TN 51.37 75.76 70.04 76.61 65.67 67.89 BRQ 47.44 72.87 71.37 75.84 65.19 66.54 FlatQuant 55.63 78.83 72.46 76.22 66.85 70.00 BA TQuant 54.33 77.48 72.23 76.61 68.25 69.78 QuaRot GPTQ 55.20 78.70 72.77 76.88 65.11 69.73 BRQ 53.67 78.87 73.27 76.66 66.93 69.88 FlatQuant 55.72 79.63 72.66 76.82 66.22 70.21 BA TQuant 56.14 79.92 73.10 77.97 68.59 71.14 W4A4KV16 QuaRot R TN 44.88 70.37 65.09 74.54 62.51 63.48 BRQ 45.90 67.51 68.47 74.16 61.40 63.49 FlatQuant 51.11 75.93 69.02 74.92 62.83 66.76 BA TQuant 50.09 75.55 71.00 75.19 66.85 67.74 QuaRot GPTQ 50.43 74.28 67.55 73.67 63.38 65.86 BRQ 48.55 74.71 68.79 75.24 63.93 66.24 FlatQuant 50.60 78.20 70.36 75.63 63.54 67.67 BA TQuant 53.33 77.53 71.12 75.30 66.22 68.70 W4A8KV4 QuaRot R TN 47.18 72.64 67.43 74.32 60.06 64.33 BRQ 45.82 69.82 69.71 74.21 62.43 64.40 FlatQuant 48.12 73.23 68.96 74.37 63.30 65.60 BA TQuant 50.85 75.97 70.07 76.50 64.56 67.59 QuaRot GPTQ 52.73 76.47 70.15 74.81 62.27 67.29 BRQ 50.68 75.97 70.38 74.65 62.43 66.82 FlatQuant 52.13 77.90 69.60 75.14 62.51 67.46 BA TQuant 53.33 78.54 69.53 76.66 65.19 68.65 W4A8KV8 QuaRot R TN 52.30 76.47 69.68 77.04 65.67 68.23 BRQ 48.55 72.47 71.84 76.66 64.96 66.90 FlatQuant 52.73 77.09 72.18 76.71 64.25 68.59 BA TQuant 54.52 79.55 72.20 76.50 65.59 69.67 QuaRot GPTQ 55.38 79.84 72.54 76.88 66.22 70.17 BRQ 52.99 78.11 73.09 76.88 67.80 69.77 FlatQuant 52.56 77.10 72.46 77.09 68.19 69.48 BA TQuant 55.63 79.80 73.15 77.09 67.17 70.57 BA TQuant: Outlier-resilien t MXFP4 Quantization 29 Prompt: Q u e s t i o n : C o u n t t h e i n t e r s e c t i o n p o i n t s w h e r e t h e b l u e a n d r e d l i n e s m e e t . P u t y o u r a n s w e r i n c u r l y b r a c k e t s , e . g . , { 2 } . Resp onse: BF16 { 0 } BR Q (W4A4KV16) T h e r e i s o n e i n t e r s e c t i o n p o i n t w h e r e t h e b l u e a n d r e d l i n e s m e e t . T h e r e d l i n e s f o r m a " V " s h a p e , a n d t h e b l u e l i n e r u n s d i a g o n a l l y a c r o s s t h e i m a g e . T h e y i n t e r s e c t a t a s i n g l e p o i n t . { 1 } BA TQuan t (W4A4KV16) { 0 } Fig. 12: Case study of Qwen3-VL-8B-Instruct on VLMBlind . The input in- cludes a real image (shown ab o ve) and a text question asking to count in tersection p oin ts. Compared with the BRQ metho d which fails b y hallucinating an intersection (1), BA TQuant correctly iden tifies that there are no in tersections (0), matching the BF16 baseline. 30 Li et al. Prompt: W h a t i s t h e t r a i n n u m b e r ? Resp onse: BF16 0 5 5 0 5 9 9 5 BR Q (W4A4KV16) B a s e d o n t h e i m a g e p r o v i d e d , t h e t r a i n n u m b e r i s * * 0 5 5 * * . T h i s n u m b e r i s c l e a r l y v i s i b l e o n t h e s i d e o f t h e t r a i n c a r i n t h e i m a g e . I t a p p e a r s a s a w h i t e o r l i g h t - c o l o r e d n u m b e r ( l i k e l y " 0 5 5 " ) . . . T h e t r a i n a p p e a r s t o b e a p a s s e n g e r t r a i n , l i k e l y o p e r a t i n g w i t h i n V i e t n a m g i v e n t h e V i e t n a m e s e l a n g u a g e u s e d i n t h e m a r k i n g s . BA TQuan t (W4A4KV16) B a s e d o n t h e i m a g e p r o v i d e d , t h e t r a i n n u m b e r i s * * 0 5 5 0 5 9 9 5 * * . T h i s n u m b e r i s c l e a r l y v i s i b l e o n t h e s i d e o f t h e t r a i n c a r , j u s t b e l o w t h e w i n d o w . T h e " 0 5 5 " a p p e a r s t o b e t h e t r a i n ’ s r o u t e o r s e r v i c e n u m b e r , a n d " 0 5 9 9 5 " i s l i k e l y i t s s p e c i f i c c a r o r v e h i c l e n u m b e r w i t h i n t h a t s e r v i c e . Fig. 13: Case study of Qwen3-VL-8B-Instruct on OCRBench . The input in- cludes a real image of a train and a qu estion asking for the train num b er. Compared with the BR Q method which fails b y only recognizing partial information ("055"), BA TQuant correctly identifies the full train num b er ("055 05995"), matching the BF16 baseline.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment