Fast Evaluation of Truncated Neumann Series by Low-Product Radix Kernels

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Truncated Neumann series $S_k(A)=I+A+\cdots+A^{k-1}$ are used in approximate matrix inversion and polynomial preconditioning. In dense settings, matrix-matrix products dominate the cost of evaluating $S_k$. Naive evaluation needs $k-1$ products, while splitting methods reduce this to $O(\log k)$. Repeated squaring, for example, uses $2\log_2 k$ products, so further gains require higher-radix kernels that extend the series by $m$ terms per update. Beyond the known radix-5 kernel, explicit higher-radix constructions were not available, and the existence of exact rational kernels was unclear. We construct radix kernels for $T_m(B)=I+B+\cdots+B^{m-1}$ and use them to build faster series algorithms. For radix 9, we derive an exact 3-product kernel with rational coefficients, which is the first exact construction beyond radix 5. This kernel yields $5\log_9 k=1.58\log_2 k$ products, a 21% reduction from repeated squaring. For radix 15, numerical optimization yields a 4-product kernel that matches the target through degree 14 but has nonzero spillover (extra terms) at degrees $\ge 15$. Because spillover breaks the standard telescoping update, we introduce a residual-based radix-kernel framework that accommodates approximate kernels and retains coefficient $(μ_m+2)/\log_2 m$. Within this framework, radix 15 attains $6/\log_2 15\approx 1.54$, the best known asymptotic rate. Numerical experiments support the predicted product-count savings and associated runtime trends.

💡 Research Summary

The paper addresses the problem of efficiently evaluating the truncated Neumann series Sₖ(A)=I+A+⋯+A^{k‑1} for dense matrices, where the dominant cost is the number of matrix‑matrix multiplications (GEMMs). Traditional approaches either compute the series naively with k‑1 products or use splitting identities of the form S_{mn}(A)=S_n(A)·T_m(A^n) to achieve an O(log k) complexity. In a radix‑m scheme the kernel T_m(B)=I+B+⋯+B^{m‑1} must be evaluated; if it requires μ_m products, a single update costs C(m)=μ_m+2 products (μ_m for the kernel, one for concatenation, one for updating the power). The asymptotic constant of the whole algorithm is therefore C(m)/log₂m. A simple lower bound shows μ_m≥⌈log₂(m‑1)⌉, giving μ_5≥2, μ_9≥3, μ_15≥4. The known radix‑5 kernel achieves μ_5=2, yielding a constant of 4/log₂5≈1.72.

Exact radix‑9 kernel
The authors construct a kernel for m=9 that meets the lower bound with μ_9=3. They first compute U=B² and V=U·(B+2U)=B³+2B⁴ (using two products). Then they form two linear combinations P and Q of B, U, V with rational coefficients and multiply them (third product). The product P·Q generates the high‑degree terms B⁵…B⁸ exactly; the lower‑degree terms B²…B⁴ are corrected by adding scaled copies of U and V. All coefficients are rational with denominators dividing 800, so the kernel can be implemented exactly in rational or fixed‑point arithmetic. The update therefore needs 5 products (μ_9+2) and the total product count is 5·log₉k≈1.58·log₂k, a 21 % reduction over binary splitting (2·log₂k) and an 8 % improvement over the radix‑5 scheme.

Approximate radix‑15 kernel
For m=15 the lower bound demands μ_15≥4. The authors attempt a symbolic construction but fail to find an exact solution. Instead they use numerical optimization (L‑BFGS‑B) to obtain a 4‑product kernel eT₁₅(B) that matches the target polynomial up to degree 14 with very small error, but introduces non‑zero “spillover” coefficients for degrees ≥15. Consequently (1−z)·eT₁₅(z)=1+O(z^{15}) holds, yet the exact telescoping identity T_m(Aⁿ)(I−Aⁿ)=I−A^{mn} breaks because the spillover prevents exact extraction of A^{mn}.

Residual‑based general framework
To handle spillover, the paper defines an approximate kernel f(z) and its error map E(z)=1−(1−z)f(z). Lemma 5.5 shows E(E(z))=1−(1−z)f(z)f(E(z)). Thus the composition f(z)·f(E(z)) matches the true geometric series (1−z)^{-1} up to degree m²−1. Repeatedly applying the kernel to its own residual pushes any spillover to ever higher degrees (Lemma 5.7), while the prefix of the series remains exact. Using this residual‑based scheme, a radix‑15 update still costs μ_15+2=6 products per level, giving an asymptotic constant 6/log₂15≈1.54, which is currently the best known.

Experimental validation
The authors benchmark the exact radix‑9 kernel, the residual‑based radix‑15 kernel, and existing binary, ternary, and hybrid {2,3} methods on dense matrices of sizes 128–2048 and various k. The radix‑9 kernel achieves the theoretical product‑count reduction, yielding about 18 % faster runtimes than binary splitting with zero numerical error. The radix‑15 scheme, despite its spillover, attains the lowest asymptotic constant and often outperforms radix‑9 by a few percent in wall‑clock time, while preserving accuracy thanks to the residual correction. Hybrid {2,3} methods, with a constant near 1.9, lag behind both new schemes.

Conclusions
The paper makes two major contributions. First, it provides an exact, rational‑coefficient radix‑9 kernel that meets the theoretical lower bound and delivers a 21 % reduction in matrix multiplications over binary splitting. Second, it introduces a residual‑based framework that allows approximate kernels with spillover to be used without sacrificing correctness, enabling a radix‑15 algorithm that achieves the optimal asymptotic constant ≈1.54. These results advance the state of the art for high‑performance computation of truncated Neumann series, with immediate applications to approximate matrix inversion, polynomial preconditioning, log‑determinant estimation, and large‑scale MIMO detection.

Fast Evaluation of Truncated Neumann Series by Low-Product Radix Kernels

💡 Research Summary

Comments & Academic Discussion

Leave a Comment