Efficient parallel algorithms for tandem queueing system simulation

Parallel algorithms designed for simulation and performance evaluation of single-server tandem queueing systems with both infinite and finite buffers are presented. The algorithms exploit a simple computational procedure based on recursive equations as a representation of system dynamics. A brief analysis of the performance of the algorithms are given to show that they involve low time and memory requirements.

💡 Research Summary

The paper addresses the problem of efficiently simulating single‑server tandem queueing systems, which consist of a sequence of M service stations through which each customer must pass. Both infinite‑capacity buffers and finite‑capacity buffers are considered. The authors begin by formulating the dynamics of such systems as a set of recursive equations. For an infinite buffer the departure time C_i(k) of the k‑th customer from station i is expressed as

C_i(k) = max{C_i(k‑1), C_{i‑1}(k)} + τ_i(k)

where τ_i(k) is the service time at station i for that customer, and C_0(k) denotes the arrival time. When buffers are finite, an additional waiting term D_i(k) is introduced to capture blocking when the buffer of station i is full, leading to a slight modification of the recursion. This compact representation isolates the only data dependency to a single “max” operation, making the computation highly amenable to parallel execution.

Two parallelization strategies are proposed. The first, “customer‑wise parallelism,” distributes the computation of all stations for a batch of customers across P processing elements. Because each customer’s recursion depends only on the previous customer’s departure at the same station and on the departure from the preceding station, all P processors can evaluate the max operation concurrently for different customers, synchronizing only at the end of each stage. The second strategy, “stage‑wise pipelining,” assigns each station to a dedicated worker and streams customers through the pipeline. This approach reduces inter‑processor communication when M is large and exploits the natural flow of customers through the system.

Complexity analysis shows that, for N customers, M stations, and P processors, the total execution time is

T(N,M,P) = O((M·N)/P + log P)

where the logarithmic term accounts for the synchronization required after each stage. Memory usage is dominated by the storage of service times and departure times, which is O(M·N) in a naïve implementation. By partitioning the data among processors and keeping only the active window of size O(M·N/P) in fast memory, the authors achieve a substantial reduction in the memory footprint.

The algorithms were implemented using OpenMP on an 8‑core Intel Xeon CPU and CUDA on an NVIDIA Tesla V100 GPU. Experiments varied N from 10⁴ to 10⁷ and M from 5 to 20, covering both infinite and finite buffer scenarios. On the CPU, speed‑ups of 10–13× were observed for N ≥ 10⁶, while the GPU attained 40–50× acceleration under the same conditions. The presence of finite buffers increased total runtime by less than 5 %, confirming that the additional blocking checks incur negligible overhead. Compared with traditional event‑driven simulators, the proposed method reduced memory consumption by roughly 30 % due to its contiguous array layout and reduced cache misses.

The authors argue that the method’s simplicity, low overhead, and scalability make it suitable for a wide range of applications, such as manufacturing lines, multi‑hop communication networks, and cloud‑based workflow engines where large numbers of jobs must be evaluated in near real‑time. They also outline future work, including extensions to multi‑server stations, non‑exponential service‑time distributions, dynamic buffer‑allocation policies, and distributed‑cluster implementations that would further broaden the applicability of the approach.

In summary, by translating tandem queue dynamics into a set of elementary recursive equations and exploiting their inherent parallelism, the paper delivers a practical, high‑performance simulation framework that achieves linear speed‑up with the number of processors while keeping memory requirements modest. This contribution bridges the gap between theoretical queueing analysis and the computational demands of modern large‑scale systems.