IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

IsoQuan t: Hardw are-Aligned SO(4) Iso clinic Rotations for LLM KV Cac he Compression Zhongping Ji Abstract Orthogonal feature decorrelation is a useful primitive for lo w-bit online vector quan tization, but the standard approac h of applying a dense random orthogonal matrix incurs prohibitive O ( d 2 ) storage and compute. Recen t w ork suc h as RotorQuan t reduces this cost by replacing the global transform with blockwise 3D Cliﬀord rotors. While eﬀective, the resulting 3D partition is not naturally aligned with mo dern hardw are, since transformer feature widths are typically pow ers of tw o and thus lead to a wkward tails, irregular memory access, and limited lo cal mixing. W e prop ose IsoQuant , a blo ckwise rotation framework based on quaternion algebra and the iso clinic decomp osition of S O (4). Each block of four coordinates is identiﬁed with a quaternion and transformed b y the closed-form map T ( v ) = q L v q R parameterized by a pair of unit quaternions. This yields tw o main v ariants: IsoQuant-F ul l , which realizes the full six degrees of freedom of S O (4), and IsoQuant-F ast , whic h retains a single iso clinic factor for low er ov erhead. The same framework also admits a light weigh t 2D planar sp ecial case, which w e treat as an auxiliary op erating p oint rather than the primary metho d. At d = 128, IsoQuant-F ull reduces forw ard rotation cost from approximately 2 , 408 FMAs in RotorQuant to 1 , 024, while IsoQuant-F ast reduces it further to 512. Across 18 fused CUD A b enchmark settings spanning d ∈ { 128 , 256 , 512 } , bit widths { 2 , 3 , 4 } , and b oth FP16 and FP32 execution, IsoQuant-F ull, IsoQuant-F ast, and the 2D sp ecial case achiev e mean kernel-lev el sp eedups of 4 . 49 × , 4 . 66 × , and 4 . 66 × o v er RotorQuan t while maintaining comparable reconstruction MSE, with p eak sp eedups ab ov e 6 × . Our current v alidation is limited to the stage-1 quan tize–dequantize path on synthetic normalized vectors; full end-to-end KV-cache ev aluation on real mo del activ ations remains future work. Co de: https://github.com/ParaMind2025/isoquant . 1 In tro duction KV-cac he compression has emerged as a central systems b ottleneck for long-context large language mo del inference. A core insight b ehind online vector quan tization metho ds such as T urb oQuant [ 1 ] is that decorrelating features b efore scalar quantization substantially impro ves rate–distortion b eha vior. In particular, a random orthogonal transform spreads information across co ordinates, making per-co ordinate Lloyd–Max quan tization far more eﬀectiv e than quantization in the original basis. The main drawbac k is cost. F or a head dimension d , a dense orthogonal transform requires O ( d 2 ) parameters and arithmetic, whic h is diﬃcult to justify in latency-sensitive settings suc h as autoregressive deco ding. RotorQuant [ 2 ] addresses this issue b y replacing the dense transform with sparse blockwise rotations in the geometric algebra Cl(3 , 0). This preserves the qualitative b eneﬁt of lo cal decorrelation while reducing complexity to linear in d . Despite this progress, the 3D block structure lea ves p erformance on the table. First, it is not hardw are aligned: common head dimensions suc h as 64, 128, and 256 are pow ers of t w o, so partitioning in to triples creates a wkw ard remainder handling and non-ideal memory lay out. F or example, d = 128 yields fort y-tw o full 3D blo c ks plus one 2D tail. Second, a 3D rotation has only three intrinsic degrees of freedom, which limits how aggressively each local subspace can mix correlated co ordinates. This paper studies a diﬀerent op erating point. W e mov e from 3D Cliﬀord blocks to 4D quaternion blo cks and lev erage the Lie-theoretic decomp osition so (4) ∼ = su (2) ⊕ su (2) , (1) whic h implies that ev ery 4D rotation can b e represen ted b y a pair of unit quaternions acting from the left and righ t. This yields a closed-form, low-o verhead parameterization of S O (4) that is b oth mathematically clean and implemen tation-friendly . W e therefore name the metho d IsoQuan t , emphasizing the isoclinic structure underlying its blo ckwise S O (4) transform. 1 Con tributions. Our contributions are as follo ws. 1. W e introduce IsoQuant , a 4D blo ck rotation sc heme for online vector quan tization based on quaternion sandwic h pro ducts. 2. W e derive tw o practical 4D instantiations, IsoQuant-F ull and IsoQuant-F ast , and show that the same framew ork also admits a light weigh t 2D planar sp ecial case as an additional op erating point. 3. W e analyze why 4D blocks are attractive for systems deploymen t: they improv e arithmetic eﬃciency , eliminate most b oundary handling, and align naturally with SIMD and fused-kernel execution. 4. W e provide a fused CUDA implementation and sho w, in fair k ernel-level comparisons against RotorQuant fused CUD A kernels, that IsoQuant ac hiev es consistent sp eedups while preserving reconstruction quality . 2 Related W ork Online vector quan tization and KV-cache compression. T urb oQuan t [ 1 ] frames online v ector quan tization as a dense-rotation-plus-scalar-quan tization pip eline and combines it with QJL residual correction [ 3 ]. RotorQuant [ 2 ] reduces the hea vy dense transform b y replacing it with sparse 3D Cliﬀord rotors. Other KV-cache compression metho ds suc h as KIVI [ 4 ] and KV Quant [ 5 ] focus on calibration, asymmetric quantization, or hardw are-aw are pac king rather than explicit geometric decorrelation. Quaternion and geometric representations. Unit quaternions are a classical to ol for representing rotations [ 6 ]. In mac hine learning, algebraically structured represen tations hav e app eared in equiv arian t mo dels, geometric neural net w orks, and eﬃcien t transformations on structured features. RiemannF ormer [ 7 ] provides a recen t geometric treatmen t that explicitly discusses the iso clinic decomposition of S O (4) in the setting of curved-space attention. Our use of this structure is narrow er and more systems-orien ted: we exploit the same S O (4) factorization to design a cheap blo ckwise decorrelating transform for low-bit quantization. 3 Problem Setup W e consider the stage-1 mean-squared-error comp onen t of online vector quantization. Given an input v ector x ∈ R d , w e seek an enco der E , scalar quantizer Q , and deco der D suc h that ˆ x = D ( Q ( E ( x ))) (2) minimizes reconstruction error while keeping b oth parameter coun t and online compute small. As in prior w ork [ 1 , 2 ], w e factor the problem in to three pieces: 1. an orthogonal or appro ximately orthogonal transform that decorrelates co ordinates; 2. coordinate-wise scalar quan tization in the transformed basis; 3. an inv erse transform to reconstruct the v ector. T o stabilize the scalar quantizer, we additionally separate norm and direction. W riting x = ρ ¯ x, ρ = ∥ x ∥ 2 , ∥ ¯ x ∥ 2 = 1 , (3) w e quantize the normalized direction ¯ x and store or transmit the norm ρ separately . This follows the implemen tation pattern used by eﬃcient quan tizers and keeps the transformed co ordinates within a predictable dynamic range. 4 Quaternion and S O (4) Preliminaries W e b egin by identifying each 4D blo ck with a quaternion v = x 0 + x 1 i + x 2 j + x 3 k ∈ H , (4) where H denotes the quaternion algebra and i 2 = j 2 = k 2 = ijk = − 1 [ 6 ]. F or q = a + b i + c j + d k , (5) its quaternion conjugate is q = a − b i − c j − d k . (6) 2 Unit quaternions form the 3-sphere S 3 and pro vide a nonsingular parameterization of rotational factors that will b e cen tral to our construction. The relev ant geometric structure is the special orthogonal group in four dimensions. At the Lie-algebra level one has the classical decomp osition so (4) ∼ = su (2) L ⊕ su (2) R , (7) whic h underlies the iso clinic decomposition of S O (4) discussed, for example, in RiemannF ormer [ 7 ]. This splitting implies that every inﬁnitesimal 4D rotation can b e written as the sum of t wo commuting components. More precisely , if X = X A + X B , X A ∈ su (2) L , X B ∈ su (2) R , (8) then [ X A , X B ] = 0, and therefore the corresp onding group element factors as R = exp( X ) = exp( X A + X B ) = exp( X A ) exp( X B ) , (9) where the exp onential map is given by exp : so (4) → S O (4) , X 7→ exp( X ) = ∞ X k =0 X k k ! . (10) This commuting factorization is the Lie-theoretic counterpart of the left/right iso clinic decomp osition. Prop osition 1. L et q L , q R ∈ S 3 b e unit quaternions. Then the map T ( v ) = q L v q R (11) deﬁnes an ortho gonal tr ansformation of R 4 . Its inverse is T − 1 ( v ) = q L v q R , (12) and the p airs ( q L , q R ) and ( − q L , − q R ) induc e the same element of S O (4) . Pr o of sketch. Quaternion m ultiplication b y a unit quaternion preserves the Euclidean norm on H ∼ = R 4 , b oth for left m ultiplication and for right multiplication by the conjugate. Hence the composition v 7→ q L v q R is norm-preserving and therefore orthogonal. The inv erse follo ws immediately from asso ciativity of quaternion m ultiplication together with q L q L = q R q R = 1. Finally , replacing b oth q L and q R b y their negatives leav es the map unchanged b ecause ( − q L ) v ( − q R ) = ( − q L ) v ( − q R ) = q L v q R . (13) The left and right quaternion factors corresp ond to the left-isoclinic and right-isoclinic comp onents, resp ectiv ely . Hence a pair of unit quaternions provides a compact, closed-form parameterization of a general 4D rotation, up to the standard double-co ver ambiguit y . This parameterization is esp ecially attractive for quan tization b ecause it realizes the full six degrees of freedom of S O (4) without ever requiring explicit storage of a dense 4 × 4 orthogonal matrix. 4.1 Blo c k-Diagonal Rotations in S O (4 g ) Let g =  d 4  . (14) After zero padding when necessary , the input is naturally viewed as an element of R 4 g . In a fully dense formulation one would apply an unrestricted orthogonal transform in S O (4 g ), which plays the role of a padded surrogate for the original d × d orthogonal rotation. IsoQuant instead restricts this transform to a structured Lie subgroup obtained from indep endent 4D rotational blo c ks. F ormally , deﬁne the Lie subalgebra g = g M i =1 so (4) ( i ) ⊂ so (4 g ) . (15) 3 An y element H ∈ g has blo ck-diagonal form H = diag ( X 1 , X 2 , . . . , X g ) , (16) where each X i ∈ so (4) and all oﬀ-diagonal blo cks v anish. Since the exp onential map preserv es blo ck-diagonal structure, one obtains exp( H ) = diag (exp( X 1 ) , exp( X 2 ) , . . . , exp( X g )) . (17) Because exp ( X i ) ∈ S O (4) for eac h blo c k, the resulting matrix exp ( H ) b elongs to S O (4 g ). The set of all suc h matrices forms an em b edded Lie subgroup isomorphic to S O (4) × S O (4) × · · · × S O (4) , (18) with g factors. In this sense, IsoQuant replaces a dense high-dimensional orthogonal transform b y a blo ck-diagonal subgroup of S O (4 g ) comp osed of local S O (4) actions. The matrix-exp onen tial viewp oint is primarily conceptual: it makes clear whic h subgroup of S O (4 g ) is b eing used and how this subgroup arises from a direct-sum Lie algebra. In the actual computation, ho wev er, we do not construct the skew-symmetric blo cks X i or ev aluate the corresp onding matrix exp onentials. Each lo cal S O (4) action is instead realized more eﬃciently by a quaternion pair ( q ( i ) L , q ( i ) R ), which yields the same blo ck rotation in closed form while a v oiding dense matrix materialization. 5 Metho d: IsoQuan t 5.1 Blo c kwise 4D Rotation W e now instantiate the subgroup construction ab ov e as a stage-1 quantization transform. Let ¯ x ∈ R d denote the normalized input direction. W e partition ¯ x in to g =  d 4  (19) blo c ks, ¯ x = h v (1) , v (2) , . . . , v ( g ) i , (20) where eac h v ( i ) ∈ R 4 is iden tiﬁed with an elemen t of H . If d is not divisible by 4, the ﬁnal blo ck is zero padded, so the full transformed ob ject lies in R 4 g . F rom the p ersp ective of Section 4.1 , the enco der acts on the padded vector by an element of the blo ck-diagonal subgroup ( S O (4)) g ⊂ S O (4 g ). The algorithm do es not materialize this action as a dense matrix. Instead, each lo cal S O (4) factor is represented b y one or t wo unit quaternions and applied directly in closed form. Accordingly , each blo ck undergo es the sequence v ( i ) 7→ ˜ v ( i ) 7→ ˆ v ( i ) 7→ v ( i ) rec . (21) That is, one ﬁrst applies a lo cal S O (4) rotation, then p erforms co ordinate-wise scalar quantization in the rotated basis, and ﬁnally applies the in verse lo cal rotation. The recov ered blo cks are concatenated and the original norm ρ is restored. 5.2 IsoQuan t-F ull IsoQuan t-F ull uses the complete double-sided action of S O (4). F or block i , we maintain a pair of unit quaternions ( q ( i ) L , q ( i ) R ) and compute ˜ v ( i ) = q ( i ) L v ( i ) q ( i ) R , (22) ˆ v ( i ) = Q  ˜ v ( i )  , (23) v ( i ) rec = q ( i ) L ˆ v ( i ) q ( i ) R . (24) This uses the full six-dimensional rotational freedom of S O (4) and oﬀers the strongest lo cal mixing. 4 5.3 IsoQuan t-F ast IsoQuan t-F ast restricts the transform to a single iso clinic factor: ˜ v ( i ) = q ( i ) L v ( i ) , (25) ˆ v ( i ) = Q  ˜ v ( i )  , (26) v ( i ) rec = q ( i ) L ˆ v ( i ) . (27) Geometrically , this corresp onds to a 3-dimensional subgroup of S O (4) isomorphic to S O (3). It sacriﬁces expressivit y for low er parameter coun t, low er arithmetic cost, and simpler k ernels. 5.4 A Ligh tw eigh t 2 D Sp ecial Case Although IsoQuan t is fundamentally motiv ated b y blo ckwise S O (4) structure, the same implementation philosophy admits a degenerate planar sp ecial case on co ordinate pairs. W e partition the normalized input into 2D blo cks u ( j ) ∈ R 2 , (28) and apply a standard planar rotation ˜ u ( j ) = R ( θ ( j ) ) u ( j ) , (29) ˆ u ( j ) = Q  ˜ u ( j )  , (30) u ( j ) rec = R ( − θ ( j ) ) ˆ u ( j ) , (31) where R ( θ ) =  cos θ − sin θ sin θ cos θ  . (32) W e do not present this 2D case as a separate metho d family; rather, we treat it as an extremely light weigh t IsoQuan t op erating point. Its lo cal mixing capacity is weak er than that of the 4D formulations, so IsoQuant-F ull remains the primary and most expressiv e v ariant. Nevertheless, the 2D special case is useful exp erimentally b ecause it clariﬁes ho w muc h accuracy is lost, if an y , when blo ckwise rotation cost is pushed ev en low er. 5.5 P arameterization and Learning T o a v oid constrained optimization on the manifold directly , we parameterize eac h unit quaternion by an unconstrained v ector and normalize it on the ﬂy: q ( i ) L = u ( i ) L ∥ u ( i ) L ∥ 2 , q ( i ) R = u ( i ) R ∥ u ( i ) R ∥ 2 , (33) where u ( i ) L , u ( i ) R ∈ R 4 are free parameters. This k eeps optimization in Euclidean space while enforcing the unit- quaternion constraint implicitly in the computation graph. A practical ligh tw eight v ariant is to sample the initial u v ectors from a Gaussian distribution and k eep them ﬁxed, yielding random block rotations analogous to the randomized transform used b y T urb oQuan t. When randomized rotations are desired, we sample from the Haar distribution on the corresp onding rotation group: this reduces to uniform angle sampling for the 2D sp ecial case and to Gaussian-normalize sampling on S 3 for the quaternion factors used in the 4D v ariants. 5.6 Quan tization Pip eline Algorithm 1 summarizes the stage- 1 pip eline. 5 Algorithm 1 IsoQuan t Stage-1 Quantization Require: Input vector x ∈ R d , scalar quantizer Q , mode ∈ { Full , F ast , 2D } 1: Compute norm ρ ← ∥ x ∥ 2 and normalized vector ¯ x ← x/ max( ρ, ε ) 2: P artition ¯ x into zero-padded lo cal blo c ks according to the mo de (4D for Full / F ast , 2D for 2D ) 3: for i = 1 to g do 4: if mo de = Full then 5: ˜ v ( i ) ← q ( i ) L v ( i ) q ( i ) R 6: ˆ v ( i ) ← Q ( ˜ v ( i ) ) 7: v ( i ) rec ← q ( i ) L ˆ v ( i ) q ( i ) R 8: else if mo de = F ast then 9: ˜ v ( i ) ← q ( i ) L v ( i ) 10: ˆ v ( i ) ← Q ( ˜ v ( i ) ) 11: v ( i ) rec ← q ( i ) L ˆ v ( i ) 12: else 13: ˜ u ( i ) ← R ( θ ( i ) ) u ( i ) 14: ˆ u ( i ) ← Q ( ˜ u ( i ) ) 15: u ( i ) rec ← R ( − θ ( i ) ) ˆ u ( i ) 16: end if 17: end for 18: Concatenate reconstructed blo cks and drop padded co ordinates 19: return ˆ x ← ρ times the concatenated reconstructed blo cks 5.7 Probabilistic In tuition for Random Subspace Rotations A natural question is why low-dimensional random block rotations can still improv e quan tization in a high-dimensional v ector. The key p oint is that quan tization do es not require full global mixing; it often suﬃces to isotropize energy lo cally within each blo ck. Let x ( b ) ∈ R k denote a ﬁxed block with radius r b = ∥ x ( b ) ∥ 2 , and let R b ∼ Haar ( S O ( k )) be a Haar-distributed random rotation. Then y ( b ) = R b x ( b ) (34) is distributed as a p oin t with ﬁxed norm r b and uniformly random direction on the sphere S k − 1 . Hence, for any co ordinate j , E [ y ( b ) j | x ( b ) ] = 0 , E [( y ( b ) j ) 2 | x ( b ) ] = r 2 b k . (35) Th us, random blo ck rotation redistributes the energy of a ﬁxed blo c k evenly across co ordinates in exp ectation. More can b e said ab out the marginal law of eac h rotated coordinate. If u ∼ Unif ( S k − 1 ), then y ( b ) = r b u , and the marginal density of one normalized co ordinate z = u j is f k ( z ) ∝ (1 − z 2 ) k − 3 2 , | z | ≤ 1 . (36) This highlights an imp ortant diﬀerence betw een k = 2 and k = 4. F or k = 2, the marginal follows an arcsine law, f 2 ( z ) = 1 π √ 1 − z 2 , (37) whic h places relatively more mass near the extremes. F or k = 4, the marginal b ecomes f 4 ( z ) = 2 π p 1 − z 2 , (38) whic h is more concentrated near the center and v anishes at the b oundaries. Consequently , the 4D case is structurally more fav orable for scalar quantization, since individual coordinates are less lik ely to attain extreme v alues. A complemen tary cov ariance-level view leads to the same intuition. Let x ∈ R d ha v e cov ariance Σ, partitioned in to k × k blo cks, and let R = diag( R 1 , . . . , R m ) , R i ∼ Haar( S O ( k )) (39) 6 T able 1: F orward rotation complexity at d = 128. Metho d Blo c k Structure P arams FMAs T urb oQuan t [ 1 ] dense 128 × 128 16 , 384 16 , 384 RotorQuan t [ 2 ] 43 × 3D blo c ks 172 ≈ 2 , 408 IsoQuan t-2D 64 × 2D blo c ks 128 ≈ 256 IsoQuan t-F ull 32 × 4D blo cks 256 1 , 024 IsoQuan t-F ast 32 × 4D blo cks 128 512 b e an indep endent blo ck-diagonal random rotation. Then E R [ R Σ R ⊤ ] = diag  tr(Σ 11 ) k I k , . . . , tr(Σ mm ) k I k  . (40) Therefore, each blo ck b ecomes isotropic in exp ectation, while cross-block correlations v anish in exp ectation under indep enden t random rotations. F or any ﬁxed realization of R , cross-blo ck correlation energy is not literally remov ed; rather, its directions are scram bled, which mitigates worst-case structured dependencies seen by co ordinate-wise quan tization. 6 Complexit y Analysis A quaternion m ultiplication uses 16 scalar m ultiplications and 12 scalar additions. T o matc h common systems rep orting conv entions, we count this as approximately 16 fused multiply-add op erations. F or d = 128, the forward rotation cost b ecomes straightforw ard to compare. T able 1 summarizes the forward stage-1 rotation cost at d = 128. The comparison highlights three regimes. IsoQuan t-F ull uses somewhat more parameters than RotorQuan t b ecause each blo ck stores t w o quaternions, but it still remains tiny relative to dense rotations and cuts rotation arithmetic by more than 2 × . IsoQuant-F ast go es further and is cheaper than RotorQuan t in b oth parameters and compute. The 2D sp ecial case is c heap er still, but also has the weak est lo cal mixing and should therefore b e view ed as a lo w-cost op erating p oint rather than the main target of the pap er. More generally , if g 4 = ⌈ d/ 4 ⌉ and g 2 = ⌈ d/ 2 ⌉ , then the parameter count is 8 g 4 for F ull, 4 g 4 for F ast, and 2 g 2 for the 2D sp ecial case. The forw ard rotation cost is 32 g 4 FMAs for F ull, 16 g 4 FMAs for F ast, and appro ximately 4 g 2 FMAs for the 2D case. All v ariants therefore scale linearly in d . 7 Systems Considerations 7.1 Wh y 4D Blo cks Are Hardware F riendly The choice of block size is not purely algebraic. It directly shap es memory la yout, vectorization eﬃciency , and k ernel fusion opp ortunities. Alignmen t. Most transformer head dimensions are p ow ers of tw o. A 4D partition therefore av oids the pathological tails induced by 3D c hunking in almost ev ery common setting. At d = 128, IsoQuan t uses exactly 32 blo cks with no remainder, whereas a 3D design requires 42 full blo cks plus a lefto ver fragment. V ectorization. F our-wide blo cks ﬁt naturally into SIMD-friendly load and store patterns such as float4 . This reduces b oundary chec ks and helps b oth CPU SIMD back ends and GPU kernels maintain regular con trol ﬂow. Kernel fusion. The p er-blo c k transform is a closed-form sequence of one or t wo quaternion pro ducts, co ordinate- wise scalar quantization, and an inv erse transform. This structure is esp ecially suitable for fused kernels in online quan tization pip elines b ecause the en tire blo ck can often remain in registers from input load through output store. 7 7.2 Protot yp e Mapping Our current prototype follows exactly this design pattern: it pac ks feature vectors into 4D blo cks, applies either double-sided or single-sided quaternion rotation, performs scalar Llo yd-Max quan tization, and reconstructs the result with a fused CUDA kernel. In addition, we include a light weigh t 2D planar sp ecial case that serves as an ev en cheaper op erating p oint within the same implemen tation family . W e built a b enc hmark harness that directly compares these fused IsoQuant k ernels against the fused RotorQuant CUDA kernel under matc hed tensor shap es, bit widths, and data types. 8 Compatibilit y with Residual Correction IsoQuan t is in tended as a replacement for the stage-1 decorrelation transform, not as a rejection of later residual correction. In tw o-stage pip elines such as T urb oQuan t [ 1 ], one can keep the residual inner-product correction unc hanged: r = x − ˆ x mse . (41) The residual can still b e pro jected with a quantized Johnson-Lindenstrauss transform [ 3 ] or a related lo w-bit correction mechanism. In this sense, IsoQuant is complemen tary to existing stage-2 estimators: it reduces the cost of the orthogonalization step while remaining compatible with unbiased inner-pro duct correction. 9 Exp erimen tal Results 9.1 Ev aluation Proto col This section ev aluates the stage-1 quantize–dequan tize path of IsoQuan t. The purp ose is to isolate the contribution of the decorrelation transform itself b efore in tro ducing stage-2 residual correction or do wnstream task eﬀects. W e therefore fo cus on tw o directly attributable quantities: reconstruction MSE on normalized synthetic v ectors and fused CUDA kernel latency for the blo ckwise stage-1 transform. All experiments in this section use syn thetic normalized vectors with batch size 8192. W e ev aluate all com binations of d ∈ { 128 , 256 , 512 } , b ∈ { 2 , 3 , 4 } , dtype ∈ { fp16 , fp32 } , (42) for a total of 18 fused-kernel b enchmark settings. W e fo cus on these dimensions b ecause they cov er common p er-head or group ed-KV widths in practical LLM deploymen ts, while av oiding the large-dimension kernel-specialization regime that would otherwise dominate the discussion. On the IsoQuan t side, w e ev aluate IsoQuant-F ull, IsoQuant-F ast, and the light weigh t IsoQuant-2D sp ecial case. On the baseline side, we compare against RotorQuant [ 2 ], including its fused CUD A k ernel in rotor fused kernel.cu . T urb oQuant [ 1 ] is used as a conceptual dense-rotation reference in the complexity analysis, not as a run time baseline. All fused CUDA b enc hmarks were conducted on a single NVIDIA R TX 4090 GPU. F or every conﬁguration, RotorQuan t and IsoQuan t are benchmark ed under the same tensor shap e, bit width, and execution dt yp e. In addition to kernel latency , w e rep ort reconstruction MSE, parameter count, and estimated forw ard arithmetic cost. 9.2 Main Results T able 2 rep orts the full fused CUDA sweep. The main takea wa y is that the en tire IsoQuan t family is consisten tly faster than RotorQuant under an apples-to-apples kernel comparison while preserving essentially identical reconstruction qualit y . Across all 18 settings, IsoQuant-F ull ac hiev es an av erage sp eedup of 4 . 49 × o v er RotorQuant fused CUDA, while IsoQuant-F ast and the light weigh t IsoQuan t-2D sp ecial case achiev e nearly identical av erages of 4 . 66 × and 4 . 66 × , respectively . The strongest settings o ccur in lo w-bit and medium-width regimes, where sp eedups exceed 6 × while MSE remains unc hanged up to the rep orted precision. The quality story is equally encouraging. In every tested setting, the MSE of all IsoQuant v ariants is either indistinguishable from RotorQuant or slightly low er. F or example, in the largest tested FP32 conﬁguration ( d = 512 , b = 4), all three IsoQuant v ariants remain at the same numerical scale as RotorQuant, and in many settings the prin ted MSE v alues are exactly iden tical. W e therefore do not observe a tradeoﬀ in which IsoQuan t wins sp eed by degrading reconstruction quality . 8 T able 2: F used CUDA comparisons against RotorQuan t on normalized synthetic v ectors with batch size 8192. Latencies are in µ s. dt yp e bits dim RotorQuan t IsoQuan t -F ull IsoQuan t -F ast IsoQuan t -2D F ull sp eedup F ast sp eedup 2D sp eedup fp16 2 128 32.7 8.5 8.2 8.5 3.86 3.98 3.85 fp16 3 128 36.4 6.2 6.1 6.2 5.92 6.00 5.92 fp16 4 128 44.2 9.6 9.4 9.4 4.62 4.71 4.70 fp16 2 256 32.6 8.8 8.3 8.4 3.72 3.92 3.89 fp16 3 256 36.8 9.7 9.5 9.2 3.80 3.88 3.99 fp16 4 256 46.7 8.1 7.5 7.5 5.76 6.24 6.20 fp16 2 512 36.0 8.4 8.5 8.3 4.27 4.23 4.31 fp16 3 512 39.9 9.9 8.4 9.6 4.05 4.73 4.15 fp16 4 512 50.4 13.9 12.5 13.4 3.63 4.02 3.77 fp32 2 128 33.4 7.3 7.1 7.0 4.60 4.68 4.77 fp32 3 128 35.1 6.9 7.6 7.1 5.07 4.59 4.91 fp32 4 128 44.7 8.2 8.2 8.3 5.49 5.48 5.40 fp32 2 256 33.8 7.4 7.2 7.2 4.59 4.66 4.66 fp32 3 256 37.9 7.2 7.0 6.9 5.29 5.45 5.47 fp32 4 256 47.9 8.1 7.6 7.5 5.92 6.31 6.39 fp32 2 512 37.8 12.1 11.4 10.5 3.11 3.32 3.60 fp32 3 512 44.1 12.5 11.5 11.0 3.54 3.85 4.02 fp32 4 512 52.9 14.8 13.5 13.9 3.56 3.91 3.79 9.3 Result In terpretation The full sweep reveals three robust trends. 1. IsoQuan t-F ast and IsoQuan t- 2 D are the lo west-latency op erating points. Their mean sp eedups are nearly identical, and eac h v ariant attains the b est latency in a subset of the tested settings. W e therefore view the 2D case as a useful light w eigh t sp ecial case rather than a replacement for the 4D family . 2. The gain is strong in b oth FP16 and FP32. Av eraged o v er all tested FP16 settings, IsoQuant-F ull, IsoQuan t-F ast, and IsoQuant-2D achiev e 4 . 40 × , 4 . 63 × , and 4 . 53 × sp eedups, resp ectively; the corresponding FP32 av erages are 4 . 57 × , 4 . 69 × , and 4 . 78 × . 3. Lo w-bit and medium-width settings are esp ecially fav orable. Several conﬁgurations in the ( d = 128 , 256) range exceed 6 × sp eedup, showing that the compact blo ckwise formulations amortize p er-blo ck o v erhead particularly well in practical KV-cac he regimes. These trends are consistent with the systems argumen t dev elop ed earlier. Compared with RotorQuant’s 3D Cliﬀord blo c ks, IsoQuant a voids the expansion to an 8-comp onent m ultiv ector representation, k eeps the p er-blo ck state smaller, and aligns naturally with tw o-wide or four-wide memory and register organization. The result is not a c hange in asymptotic complexity—all metho ds remain linear in d —but a real and rep eatable reduction in constant factors. 9.4 Mo dule-Lev el vs Kernel-Level Measuremen ts In addition to fused-k ernel measurements, we also b enchmark ed higher-level PyT orch mo dule paths. There, the apparen t sp eedups can reach roughly 4 × –10 × , because the IsoQuant prototype curren tly enjoys a more streamlined execution path than the baseline RotorQuant mo dule. How ever, w e regard the fused CUDA comparison as the fairest measure of method-intrinsic systems adv antage. The fused results therefore serve as our primary hardware claim, while the larger mo dule-level gains should b e interpreted as an implemen tation-dep endent systems outcome. 9.5 Ablation Co verage T able 3 summarizes the exp erimental axes co vered b y the current draft. The completed exp eriments already establish a strong stage-1 result across rotation t yp e, bit width, and back end, but they do not yet include residual correction or downstream KV-cache task metrics. 9 T able 3: Ablation axes. Completed settings in this work are marked by chec kmarks. Axis V alues Rotation type F ull, F ast, 2D sp ecial case ✓ Bit width 2, 3, 4 bits ✓ Blo c k size 3D, 4D, optionally 8D group ed v ariants Quaternion parameters random ﬁxed ✓ , learned normalized Residual correction oﬀ ✓ , QJL-style correction on Deplo ymen t back end PyT orc h ✓ , fused CUDA ✓ 9.6 What Remains for a F ull Submission The curren t exp eriments establish a strong stage-1 result, but a full submission should still add do wnstream KV-cac he metrics: 1. in tegration with a stage-2 residual correction mo dule such as QJL; 2. atten tion-logit preserv ation and inner-pro duct error under the complete t w o-stage pip eline; 3. retriev al and p erplexit y exp eriments on real KV tensors extracted from deploy ed LLMs. These follow naturally from the present implementation b ecause IsoQuant only replaces the stage-1 transform and remains compatible with the same residual correction mac hinery used b y T urb oQuant and RotorQuan t. 10 Limitations and F uture W ork IsoQuan t is not a full solution b y itself. 1. Blo c k lo cality . Lik e other block-diagonal transforms, it do es not mix information across blocks, so global correlations remain unaddressed. 2. Ev aluation gap. This draft now includes fused CUDA b enchmarks and reconstruction exp eriments, but the ﬁnal conference version still needs end-to-end KV-cache v alidation on real mo dels and tasks. 3. Learning dynamics. Although normalized quaternion parameters are simple to optimize, the relative v alue of learned versus random rotations remains an empirical question. 4. Stage-2 in teraction. The b est coupling b etw een 4D block decorrelation and residual correction may dep end on the downstream attention estimator. Promising next steps include hierarchical cross-blo ck mixing, join tly learned co deb o oks and quaternion parameters, and sp ecialized kernels for fused KV-cache compression during autoregressiv e deco ding. 11 Conclusion IsoQuan t replaces awkw ard 3D blo ck rotations with a hardware-aligned 4D quaternion formulation grounded in the iso clinic decomp osition of S O (4). The resulting design retains the spirit of blo ckwise decorrelation while oﬀering stronger lo cal mixing, cleaner memory alignmen t, and low er arithmetic cost. Our fused CUD A exp eriments on practical dimensions d ∈ { 128 , 256 , 512 } sho w that this design adv an tage is measurable in practice: across 18 matc hed settings, IsoQuan t-F ull, IsoQuant-F ast, and the light weigh t 2D sp ecial case deliver av erage sp eedups of 4 . 49 × , 4 . 66 × , and 4 . 66 × o v er RotorQuant fused CUDA while preserving essentially identical reconstruction MSE. W e therefore view the 2D case as a useful low-cost op erating p oint, while the full 4D construction remains the primary and most expressive form of the metho d. These results make a strong case that blo ckwise iso clinic rotations are a comp elling stage-1 replacement for online v ector quantization, and they motiv ate the next step of in tegrating the metho d with residual correction and full KV-cac he ev aluation. An additional adv antage of quaternion-pair parameterization is that it admits smo oth in terp olation on the underlying rotation manifold, whic h may b e useful for shared or adaptiv e blo ck rotations in future work. References [1] A. Zandieh, M. Daliri, M. Hadian, and V. Mirrokni. T urb oQuan t: Online vector quan tization with near-optimal distortion rate. In International Confer enc e on L e arning R epr esentations , 2026. 10 [2] J. D. Pope. RotorQuant: Cliﬀord algebra vector quantization for LLM KV cac he compression. 2026. https: //www.scrya.com/rotorquant/ . Co de: https://github.com/scrya- com/rotorquant . [3] A. Zandieh, M. Daliri, and I. Han. QJL: 1-bit quantized JL transform for KV cac he quan tization with zero o v erhead. arXiv pr eprint arXiv:2406.03482 , 2024. [4] Z. Liu, J. Y uan, H. Jin, S. Zhong, Z. Xu, V. Bra verman, B. Chen, and X. Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cac he. arXiv pr eprint arXiv:2402.02750 , 2024. [5] C. Ho op er, S. Kim, H. Mohammadzadeh, M. W. Mahoney , Y. S. Shao, K. Keutzer, and A. Gholami. KVQuan t: T o wards 10 million context length LLM inference with KV cac he quantization. arXiv pr eprint arXiv:2401.18079 , 2024. [6] J. B. Kuip ers. Quaternions and R otation Se quenc es . Princeton Universit y Press, 1999. [7] Z. Ji. RiemannF ormer: A framework for attention in curved spaces. arXiv pr eprint arXiv:2506.07405 , 2025. 11

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment