RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication

Matrix-vector multiplication is a fundamental building block in neural networks, vector databases, and large language models, particularly during inference. As a result, efficient matrix-vector multiplication engines directly translate into more effi…

Authors: Mohsen Dehghankar, Abolfazl Asudeh

RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication
RSR-core: A High-Performance Engine for Low-Bit Matrix- V ector Multiplication Mohsen Dehghankar University of Illinois Chicago mdehgh2@uic.edu Abolfazl Asudeh University of Illinois Chicago asudeh@uic.edu ABSTRA CT Matrix–vector multiplication is a fundamental building block in neural networks, vector databases, and large language models, par- ticularly during inference. As a r esult, ecient matrix–vector mul- tiplication engines directly translate into more ecient inference. Recent work has e xplored low-bit quantization of model w eights, where matrices are r epresented using binary (1-bit) or ternary (1.58- bit) values while activation kept in higher pr ecision. These repre- sentations enable ecient hardware-level computation. In parallel, algorithms such as Redundant Segment Reduction (RSR) pro vide theoretical guarantees for accelerating low-bit matrix–vector mul- tiplication. Howev er , existing implementations operate at the ap- plication level and cannot be eciently integrated into har dware kernels, limiting practical performance. T o bridge this gap, we present RSR-core, a high-p erformance engine that implements the RSR algorithm as optimized low-level kernels for both CP U and CUDA envir onments. RSR-core supports ecient matrix–vector multiplication for binary and ternary weight matrices and general vectors while enabling practical deployment of RSR algorithm in real inference pipelines. RSR-core is provided as a production-ready engine with Hug- gingFace integration for preprocessing low-bit models and running accelerated inference. Experimental results demonstrate signicant performance im- provements over baseline HuggingFace PyT orch multiplication, achieving up to 62 × speedup on CP U and up to 1.9 × spee dup for token generation on CUDA for popular ternary LLMs. The source code is publicly available at RSR-core repository 1 . 1 IN TRODUCTION Inference is a critical stage in neural networks, particularly for large language mo dels (LLMs), where it dominates both energy consumption and resource allocation [ 7 , 12 ]. A key computational bottleneck during inference is matrix–vector multiplication 𝑀 𝑣 , where the matrix 𝑀 represents xed model weights after training and the vector 𝑣 corresponds to input activations generate d during execution. Improving the eciency of matrix–vector multiplication therefore directly translates into faster and more resource-ecient neural model inference. Ecient matrix–vector multiplication is also central to many data management and retrieval workloads beyond neural infer- ence [ 1 , 3 ]. For example, vector database systems compute inner products between a quer y vector and indexed data points for near- est neighb or retrieval, where the indexed dataset can be viewed as a matrix and the query as the input vector [ 4 ]. Similarly , ranking over tabular datasets applies user-dene d weight vectors to attribute 1 https://github.com/UIC-InDeXLab/RSR-core Figure 1: The layered architecture of RSR-core system. columns to compute scores for sorting and ltering, where the table rows form the matrix and the weight vector denes the ranking function [9, 10]. One common approach to accelerating matrix–vector multipli- cation, specically in model inference, is low-bit quantization of weight matrices [ 6 , 11 ]. In low-bit quantized LLMs, matrices are represented using binar y or ternary values while activation vectors remain in higher precision, enabling sp ecialized hardware-ecient implementations that reduce computation cost during inference. Orthogonal to hardwar e-aware quantization approaches, algo- rithmic metho ds aim to further accelerate matrix–vector multiplica- tion. In particular , Redundant Segment Reduction (RSR) [ 2 ] reduces the computational cost of low-bit matrix–vector multiplication by a logarithmic factor by exploiting redundancy in matrix structure. Mohsen Dehghankar and Abolfazl Asudeh Specically , identical column patterns 2 contribute identical partial results and can therefor e be aggregated before multiplication, re- ducing the total numb er of required op erations without aecting inference accuracy . However , existing implementations of RSR op- erate at the application lev el and cannot be eciently integrated into hardware kernel-le vel execution, limiting their practical per- formance benets in real inference systems. In this work, we address this gap by providing a kernel-level implementation of the RSR algorithm and enabling its deployment in practical inference pipelines. T o the best of our kno wledge, this is the rst implementation of RSR optimized for both CP U and CUD A inference environments. W e present RSR-core , a high-performance engine implement- ing RSR as optimized CP U and CUD A kernels for ecient low- bit matrix–vector multiplication. RSR-core pro vides a production- ready Python interface integrated with HuggingFace workows [ 5 ], enabling users to preprocess quantized models once and execute accelerated inference eciently through an end-to-end workow . W e benchmark RSR-core against widely used hardware-optimized matrix multiplication implementations provided by the Py T orch [ 8 ] backend and used within the HuggingFace inference interface, as well as BitNet [ 11 ]. Experimental results show substantial perfor- mance improvements, achieving up to 62 × speedup on CP U and up to 1 . 9 × speedup on CUDA for popular ternary LLM inference workloads compar ed to Py T orch bfloat16 computation, while pr e- serving exact inference accuracy . 2 SYSTEM O VERVIEW W e consider matrix–vector multiplication 𝑀 · 𝑣 , where 𝑀 is a binary matrix with entries in { 0 , 1 } or a ternary matrix with entries in {− 1 , 0 , + 1 } . The matrix corresponds to quantized model weights and remains xed after prepr ocessing, while the vector 𝑣 represents input activations arriving during inference, as discussed in the Introduction section. The goal is to accelerate this multiplication in practical inference pipelines while preserving exact computation. Algorithm overview . The Redundant Segment Reduction (RSR) algorithm [ 2 ] consists of two phases: an oine preprocessing ap- plied once to the matrix 𝑀 , followed by a fast online multiplication phase applied repeatedly during inference. During preprocessing , RSR splits the matrix into horizontal blocks of 𝑘 rows and identies identical column segments within each block, grouping columns that share the same bit pattern. The preprocessing stage produces auxiliary data structures—including column permutations, group boundaries, and scatter indices—that enable ecient reuse of inter- mediate results during inference. During infer ence , given an input vector 𝑣 , RSR aggregates vector entries corresponding to identical column segments before multiplication and distributes contribu- tions only once per unique segment. This r educes the number of arithmetic operations required for computing 𝑀 · 𝑣 while preserving exact multiplication results without sacricing the space usage. 2.1 System Architecture RSR-core is an optimize d RSR execution engine using a layered architecture illustrated in Figure 1. At its core, the system provides 2 In the original paper , the multiplication 𝑣 · 𝑀 is considered instead of 𝑀 𝑣 , and therefore rows (rather than columns) ar e grouped together . kernel-level implementations of RSR multiplication on both CP U (C kernels) and CUD A (GP U kernels). These kernels are exposed through intermediate multiplier abstractions and integrated into a Python interface that supp orts end-to-end preprocessing and inference workows. The architecture enables users to pr eprocess quantized models once and reuse the generated artifacts across repeate d inference calls. The full source code is publicly available at this repository . 2.2 Kernels Implementation While the RSR algorithm is conceptually straightforward, a direct Python implementation does not yield practical sp eedups due to interpreter o verhead, general-purpose sorting, and inecient mem- ory access patterns during the gather , aggregate, and scatter phases. RSR-core therefore provides native kernel implementations in C (CP U) and CUDA (GP U) for both binary (1-bit) and ternar y (1.58-bit) matrices. A key design choice spans all backends: preprocessing uses counting sort over the discrete pattern space ( 2 𝑘 buckets for binary , 4 𝑘 for ternar y), achieving 𝑂 ( 𝑛 + buckets ) complexity per block instead of 𝑂 ( 𝑛 log 𝑛 ) from comparison-based sorting. Meta- data is uniformly shrunk to compact types—permutation indices and group boundaries stored as 16-bit integers, and scatter targets encoded as xed-size bitmasks rather than variable-length index arrays—reducing bandwidth pressure in the inference hot path. On the CP U , the C kernels fuse the gather and aggregate phases into a single pass over the input vector , accumulating partial sums directly without materializing intermediate arrays. The binary ker- nel uses scalar-unrolled loads with software prefetch hints, which outperforms hardware vector gather instructions on the random- access pattern inherent to RSR. For ternary matrices, where each group must scatter both a p ositive and a negative contribution, the kernel replaces variable-length signed scatter lists with tw o com- pact bitmasks per group and iterates their set bits directly , making the per-group scatter cost independent of how many output rows are active. All CPU kernels are parallelized across row blocks. For end-to-end model inference on CP U , RSR-core further fuses activation quantization with the RSR matrix-vector multiply into a single native call, and batches sibling linear lay ers that share the same input vector (e .g., 𝑊 𝑞 , 𝑊 𝑘 , 𝑊 𝑣 ) into one combined operation. This eliminates repeated Python dispatch and redundant quantiza- tion, and exposes a larger pool of row blocks to distribute across cores—a layer of optimization that contributes substantially to the observed wall-clock sp eedups beyond the kernel itself. On CUD A, each row block is assigned to one thread block, and warps within a block process groups in parallel. Each group’s meta- data is packed into a single 64-bit word encoding the permutation range, length, and scatter masks, so the kernel loads one value per group and begins work immediately . Permutation indices are sorted within each group during preprocessing, which do es not aect the computed sum but improves cache locality on the gather reads. Per-warp shared-memory partial buers avoid contention on output writes, and zero-contribution groups are dr opped during preprocessing to eliminate useless work. Compile-time specializa- tions on 𝑘 allow the compiler to fully unroll scatter loops for each supported block height 3 . 3 See the code repository for more technical implementation detail. RSR-core: A High-Performance Engine for Low-Bit Matrix- V ector Multiplication Figure 2: Benchmarking the matrix multiplication on CUDA and CP U. Matrix is binar y while vector is float32 . 2.3 Multipliers On top of the kernel implementations, RSR-core provides a set of Python classes called Multiplier . These classes encapsulate multiplier-specic logic, including both RSR-base d multiplication and baseline multipliers such as BitNet. Each multiplier exposes two primary operations: preprocessing and multiplication. The preprocessing stage operates on a given low-bit matrix and produces reusable artifacts that encode segment-level structure for ecient inference-time execution. The multiplication stage then uses these artifacts together with an input vector 𝑣 to compute the exact matrix–vector product 𝑀 · 𝑣 . 4 2.4 Integration RSR-core provides an interface for executing inference with ternar y large language models and integrates directly with mo dels available through the HuggingFace model hub 5 . Models that use ternar y (BitLinear) layers, such as BitNet [ 11 ] and its variants, can be use d with RSR-core without modifying existing inference pipelines 6 . Users apply preprocessing once to a selected mo del downloaded from the HuggingFace model hub. The generated preprocessing artifacts are stored locally , with at most the same space usage as the models themselves, and reused for subsequent inference executions. During inference, users can interact with the model through stan- dard prompting worko ws and obtain responses using accelerate d matrix–vector multiplication provided by RSR-core (see demonstra- tion section). RSR-core also includes a lightweight user interface that simplies model selection, preprocessing management, storage monitoring of generated artifacts, and interactive prompting of the models. Integration with Py T orch-based infer ence pipelines is provided through a custom PyTorch module called RSRLinear , which re- places ternary linear layers with RSR-enable d implementations. A ny system using Py T orch for neural network inference can incor- porate RSR-core through this module with minimal changes. 2.5 Benchmarking Results T o e valuate the performance of RSR-core, we compar e multiplica- tion time against baseline implementations provided by Py T orch, which already use hardwar e-optimized kernels on both CP U and 4 Note that preprocessing is required only for RSR-based multipliers. 5 https://huggingface.co/models 6 For example, microsoft/bitnet-b1.58-xxx models. T able 1: T ernary (1.58-bit) LLM Inference Speedup over Hug- gingFace Py T or ch bfloat16 Model HF (T ok/s) RSR (T ok/s) Speed-up CP U Falcon3-10B-1.58b 0.2 11.3 62.0x Llama3-8B-1.58b 0.2 13.4 53.8x BitNet-2B-4T -bf16 2.1 28.8 13.9x BitNet-2B-4T 14.2 29.3 2.1x CUDA Falcon3-10B-1.58b 25.2 47.4 1.9x Llama3-8B-1.58b 31.9 59.3 1.9x BitNet-2B-4T -bf16 33.1 57.4 1.7x BitNet-2B-4T 41.6 57.1 1.4x CUD A platforms. W e also compare against BitNet [ 11 ] as an addi- tional baseline for low-bit matrix multiplication. The r esults (Fig- ure 2) show substantial speedups for matrix–vector multiplication on both CP U and CUD A. W e further evaluate end-to-end LLM inference performance during the decoding (token generation) phase and compare against HuggingFace Py T orch infer ence on b oth CPU and CUD A platforms. The results (T able 1) demonstrate signicant improvements in token generation throughput using RSR-core. A dditional benchmarking results are available in the project repository . 3 DEMONSTRA TION SCENARIOS During the demonstration, participants interact with RSR-core through an integrated web-based interface that supports model preprocessing, accelerated inference, and kernel-level performance exploration workows. Scenario 1: Model Onboarding. Participants search for a ternar y model on the HuggingFace Hub, select it, and congure RSR prepro- cessing parameters including the block height 𝑘 , multiplier v ersion, and target device. Upon launching pr eprocessing, the UI displays real-time progress as each layer is processed. Once complete, the model appears in the model list with metadata such as size, lay er count, and selected 𝑘 . RSR preprocessing is a one-time cost; the UI makes it accessible to non-experts. Figure 3a illustrates the prepro- cessing interface. Scenario 2: Side-by-Side Inference. Participants select a pre- processed model, enter a prompt, and run text generation with both the RSR backend and the HuggingFace bfloat16 baseline. The UI displays the generated text alongside p erformance metrics—tokens per se cond, generation latency , and model load time—and computes the end-to-end speedup factor . This scenario demonstrates RSR’s practical acceleration on real LLM inference . Figure 3b shows the inference comparison interface. Scenario 3: Kernel-Le vel Exploration. Participants navigate to the Matvec Benchmark tab, enter custom matrix shapes (e .g., matching a specic model’s weight dimensions), select a range of 𝑘 values and bit-width (1-bit or 1.58-bit), and launch kernel-lev el micro-benchmarks. The UI produces a bar chart comparing RSR against FP32 and BF16 baselines at the b est 𝑘 per shape, and a line chart showing how RSR latency varies across 𝑘 values. Participants Mohsen Dehghankar and Abolfazl Asudeh (a) Mo del preprocessing. (b) Model inference. (c) Kernel-level matvec b enchmarking. (d) Cross-conguration comparison. Figure 3: RSR-core web interface demonstration scenarios. can explore the parameter sensitivity of RSR on the demonstration hardware, validating that redundancy structure varies across shap es. Figure 3c illustrates the kernel-level benchmarking interface. Scenario 4: Cross-Conguration Comparison. Participants switch between CP U and CUDA using the device selector and com- pare pre-computed shap e benchmark results across bit-widths (1-bit vs. 1.58-bit). The Best K tab re veals how the optimal block height shifts across devices and matrix shapes, demonstrating that RSR’s advantage is hardware-dependent. The dashb oard enables rapid exploration across congurations without re-running experiments. Figure 3d shows the cross-conguration comparison interface . REFERENCES [1] Tingyang Chen, Cong Fu, Kun W ang, Xiangyu Ke, Y unjun Gao, W enchao Zhou, Y abo Ni, and Anxiang Zeng. 2025. Maximum inner product is query-scaled nearest neighbor . arXiv preprint arXiv:2503.06882 (2025). [2] Mohsen Dehghankar , Mahdi Erfanian, and Abolfazl Asudeh. 2024. An ecient matrix multiplication algorithm for accelerating inference in binar y and ternary neural networks. arXiv preprint arXiv:2411.06360 (2024). [3] Matthijs Douze. 2025. Machine learning and high dimensional vector search. arXiv preprint arXiv:2502.16931 (2025). [4] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Je Johnson, Gergely Szilvasy , Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library . (2024). arXiv:2401.08281 [cs.LG] [5] Inc. Hugging Face. 2024. Hugging Face Model Hub and T ransformers Library . https://huggingface.co. [6] et al Ma, Shuming. 2024. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764 (2024). [7] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. 2025. T owards ecient generative large language model serving: A survey from algorithms to systems. Comput. Sur veys 58, 1 (2025), 1–37. [8] et al Paszke, Adam. 2019. Pytorch: An imperative style, high-performance deep learning library . Advances in neural information processing systems 32 (2019). [9] Thomas Rölleke, Theodora T sikrika, and Gabriella Kazai. 2006. A general ma- trix framework for modelling information retrieval. Information processing & management 42, 1 (2006), 4–30. [10] Mohamed Trabelsi, Zhiyu Chen, Brian D Davison, and Je Hein. 2021. Neural ranking models for document retrieval. Information Retrieval Journal 24, 6 (2021), 400–444. [11] et al W ang, Hongyu. 2023. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453 (2023). [12] Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning W ang, Zhihang Yuan, Xiuhong Li, et al . 2024. A survey on ecient inference for large language models. arXiv preprint arXiv:2404.14294 (2024).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment