Efficient parallel algorithms for tandem queueing system simulation
Parallel algorithms designed for simulation and performance evaluation of single-server tandem queueing systems with both infinite and finite buffers are presented. The algorithms exploit a simple computational procedure based on recursive equations as a representation of system dynamics. A brief analysis of the performance of the algorithms are given to show that they involve low time and memory requirements.
š” Research Summary
The paper addresses the problem of efficiently simulating singleāserver tandem queueing systems, which consist of a sequence of M service stations through which each customer must pass. Both infiniteācapacity buffers and finiteācapacity buffers are considered. The authors begin by formulating the dynamics of such systems as a set of recursive equations. For an infinite buffer the departure time C_i(k) of the kāth customer from station i is expressed as
āC_i(k) = max{C_i(kā1), C_{iā1}(k)} + Ļ_i(k)
where Ļ_i(k) is the service time at station i for that customer, and C_0(k) denotes the arrival time. When buffers are finite, an additional waiting term D_i(k) is introduced to capture blocking when the buffer of station i is full, leading to a slight modification of the recursion. This compact representation isolates the only data dependency to a single āmaxā operation, making the computation highly amenable to parallel execution.
Two parallelization strategies are proposed. The first, ācustomerāwise parallelism,ā distributes the computation of all stations for a batch of customers across P processing elements. Because each customerās recursion depends only on the previous customerās departure at the same station and on the departure from the preceding station, all P processors can evaluate the max operation concurrently for different customers, synchronizing only at the end of each stage. The second strategy, āstageāwise pipelining,ā assigns each station to a dedicated worker and streams customers through the pipeline. This approach reduces interāprocessor communication when M is large and exploits the natural flow of customers through the system.
Complexity analysis shows that, for N customers, M stations, and P processors, the total execution time is
āT(N,M,P) = O((MĀ·N)/P + logāÆP)
where the logarithmic term accounts for the synchronization required after each stage. Memory usage is dominated by the storage of service times and departure times, which is O(MĀ·N) in a naĆÆve implementation. By partitioning the data among processors and keeping only the active window of size O(MĀ·N/P) in fast memory, the authors achieve a substantial reduction in the memory footprint.
The algorithms were implemented using OpenMP on an 8ācore Intel Xeon CPU and CUDA on an NVIDIA Tesla V100 GPU. Experiments varied N from 10ā“ to 10ā· and M from 5 to 20, covering both infinite and finite buffer scenarios. On the CPU, speedāups of 10ā13Ć were observed for N ā„ 10ā¶, while the GPU attained 40ā50Ć acceleration under the same conditions. The presence of finite buffers increased total runtime by less than 5āÆ%, confirming that the additional blocking checks incur negligible overhead. Compared with traditional eventādriven simulators, the proposed method reduced memory consumption by roughly 30āÆ% due to its contiguous array layout and reduced cache misses.
The authors argue that the methodās simplicity, low overhead, and scalability make it suitable for a wide range of applications, such as manufacturing lines, multiāhop communication networks, and cloudābased workflow engines where large numbers of jobs must be evaluated in near realātime. They also outline future work, including extensions to multiāserver stations, nonāexponential serviceātime distributions, dynamic bufferāallocation policies, and distributedācluster implementations that would further broaden the applicability of the approach.
In summary, by translating tandem queue dynamics into a set of elementary recursive equations and exploiting their inherent parallelism, the paper delivers a practical, highāperformance simulation framework that achieves linear speedāup with the number of processors while keeping memory requirements modest. This contribution bridges the gap between theoretical queueing analysis and the computational demands of modern largeāscale systems.