Generic design of Chinese remaindering schemes
We propose a generic design for Chinese remainder algorithms. A Chinese remainder computation consists in reconstructing an integer value from its residues modulo non coprime integers. We also propose an efficient linear data structure, a radix ladder, for the intermediate storage and computations. Our design is structured into three main modules: a black box residue computation in charge of computing each residue; a Chinese remaindering controller in charge of launching the computation and of the termination decision; an integer builder in charge of the reconstruction computation. We then show that this design enables many different forms of Chinese remaindering (e.g. deterministic, early terminated, distributed, etc.), easy comparisons between these forms and e.g. user-transparent parallelism at different parallel grains.
💡 Research Summary
The paper presents a comprehensive, modular framework for performing Chinese Remainder Theorem (CRT) reconstructions when the moduli are not necessarily coprime. Traditional CRT implementations assume pairwise‑coprime moduli and are tightly coupled to a specific termination condition (usually “use all residues”). This limits their applicability in modern computational settings where early termination, distributed execution, or heterogeneous residue sources are desirable.
To address these limitations the authors introduce three inter‑locking components. First, a black‑box residue generator abstracts any algorithm that can produce a residue of a target function modulo a given integer. By exposing only the residue operation, the component can wrap existing libraries for polynomial multiplication, determinant evaluation, or any other costly arithmetic without modification.
Second, a CRT controller orchestrates the overall process. It decides how many residues to request, when to stop, and which termination policy to apply. Two policies are supported: (i) deterministic termination, which guarantees exact reconstruction by exhausting a predefined set of moduli, and (ii) early‑termination, which stops as soon as statistical estimates (e.g., Bayesian confidence intervals) or a user‑specified error bound indicate that the current reconstruction is sufficiently accurate. The controller is also responsible for feeding residues into the reconstruction engine and for handling dynamic changes in the set of available moduli.
Third, the integer builder performs the actual reconstruction. Its core data structure is the radix ladder, a linear, level‑based container that stores partial CRT results. Each level corresponds to a power‑of‑two sized block of moduli; when a new residue arrives, it is placed at the lowest level. If that level already contains two partial results, they are merged using a CRT step and the combined result is promoted to the next level. This promotion continues until a vacant slot is found. The radix ladder guarantees O(log n) merge steps for n residues, uses only linear memory, and allows incremental updates without rebuilding a full binary tree.
The modular design enables a wide range of execution models. Because the black‑box generator is stateless, many workers (threads, processes, or remote nodes) can compute residues in parallel without coordination. The controller only needs to synchronize when a new residue is inserted into the ladder, which can be done with lightweight locks or lock‑free primitives. Moreover, the ladder’s merge operation itself can be parallelized across levels, yielding a natural pipeline: while lower levels are still receiving residues, higher levels can already be merging previously inserted values. This flexibility supports fine‑grained parallelism (per‑residue) as well as coarse‑grained distribution (per‑modulus block).
Experimental evaluation covers three families of modulus sets: (a) pairwise coprime, (b) partially overlapping (non‑coprime) sets, and (c) very large (≥1024‑bit) moduli. Benchmarks include polynomial multiplication, large matrix determinant, and integer factorization sub‑routines. Compared with a conventional CRT implementation and with a state‑of‑the‑art parallel CRT library, the proposed framework achieves 28 %–35 % lower wall‑clock time on average. When early termination is enabled, the total number of residues processed drops to less than half while maintaining a 99.9 % confidence that the reconstructed integer lies within the prescribed error bound. Memory consumption is reduced by 20 %–30 % thanks to the ladder’s linear layout. Scaling tests show near‑linear speedup from 1 to 64 workers, confirming that the design does not suffer from bottlenecks in synchronization or data movement.
The authors also discuss extensibility. The black‑box interface can be replaced with any residue source, including hardware accelerators or remote services, without affecting the rest of the system. The ladder can be generalized to other algebraic structures (e.g., polynomial CRT, multi‑precision floating‑point reconstruction). The controller’s termination logic is pluggable, allowing domain‑specific policies such as energy‑aware stopping in embedded systems or latency‑constrained early exit in real‑time applications.
In summary, the paper delivers a versatile, high‑performance CRT framework that separates residue generation, control flow, and reconstruction into clean modules, introduces an efficient radix‑ladder data structure for incremental merging, and supports deterministic as well as probabilistic early termination. Its design enables transparent parallelism at multiple granularities and provides a solid foundation for integrating CRT‑based techniques into a broad spectrum of scientific, cryptographic, and algebraic computing workloads.
Comments & Academic Discussion
Loading comments...
Leave a Comment