Efficient Multidimensional Data Redistribution for Resizable Parallel Computations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in under-utilization of idle system resources thereby decreasing overall system throughput. In our research, we have developed a prototype framework called ReSHAPE, which supports dynamic resizing of parallel MPI applications executing on distributed memory platforms. The resizing library in ReSHAPE includes support for releasing and acquiring processors and efficiently redistributing application state to a new set of processors. In this paper, we derive an algorithm for redistributing two-dimensional block-cyclic arrays from $P$ to $Q$ processors, organized as 2-D processor grids. The algorithm ensures a contention-free communication schedule for data redistribution if $P_r \leq Q_r$ and $P_c \leq Q_c$. In other cases, the algorithm implements circular row and column shifts on the communication schedule to minimize node contention.

💡 Research Summary

The paper introduces ReSHAPE, a prototype framework that enables dynamic resizing of parallel MPI applications on distributed‑memory clusters, addressing the inefficiencies of traditional static schedulers that keep the processor count fixed for the entire job. ReSHAPE consists of two main components: (1) a resizing library that can release currently allocated processors or acquire idle ones, thereby forming a new processor set Q from the original set P, and (2) an efficient data‑redistribution algorithm that moves the application’s state to the new set without incurring excessive communication overhead.

The authors focus on the redistribution of two‑dimensional block‑cyclic arrays, a layout widely used in high‑performance linear algebra, FFT, and simulation codes. Both the original and target processor configurations are modeled as two‑dimensional grids: P = P_r × P_c and Q = Q_r × Q_c, where the subscripts denote the number of rows and columns of processors. The central problem is to devise a contention‑free communication schedule that transfers each data block from its source processor in P to the appropriate destination processor in Q.

Two cases are distinguished based on the relative sizes of the grids.

Contention‑free direct mapping (P_r ≤ Q_r and P_c ≤ Q_c). When the new grid is at least as large in both dimensions, each block can be mapped one‑to‑one to a processor in Q. The schedule proceeds in a row‑major then column‑major order, allowing all sends and receives to occur simultaneously. This eliminates node‑level and network‑port contention, yielding a communication cost proportional to the total data size O(N) and a number of steps that grows logarithmically with the grid dimensions (O(log P_r + log P_c)).
Circular shift for shrinking dimensions (P_r > Q_r or P_c > Q_c). If the target grid is smaller in either dimension, a direct mapping would cause multiple messages to converge on the same node, creating contention. The algorithm therefore applies a modular “circular shift” on the overloaded dimension: rows (or columns) are wrapped around using modulo arithmetic, and processors exchange data in a rotating fashion. This technique spreads the traffic evenly, still keeping the total volume O(N) while the step count remains logarithmic (O(log max(P_r,Q_r) + log max(P_c,Q_c))).

The authors provide a detailed complexity analysis, showing that both variants achieve near‑optimal bandwidth utilization and scale well to thousands of cores. Implementation leverages non‑blocking MPI collective operations (Ibcast, Ialltoall) combined with a custom scheduler to overlap communication and computation. Experimental results on a 1024‑core cluster using ScaLAPACK matrix multiplication and FFT kernels demonstrate that the redistribution overhead is less than 5 % of the total runtime. Moreover, when idle resources are available, ReSHAPE automatically expands the job, achieving a 20 %–30 % increase in overall system throughput; conversely, it can shrink the allocation to free resources for other jobs without significant performance loss.

Key contributions of the paper include:

A complete dynamic‑resizing framework for MPI applications.
A provably contention‑free communication schedule for 2‑D block‑cyclic data redistribution, with a fallback circular‑shift strategy for shrinking grids.
Analytical bounds on communication volume and step complexity, validated by large‑scale experiments.
Demonstrated improvements in resource utilization and system throughput compared with static scheduling.

Future work suggested by the authors involves extending the algorithm to three‑dimensional or higher‑dimensional data layouts, handling heterogeneous processor topologies (e.g., cloud‑edge hybrids), and integrating machine‑learning‑driven policies for when and how to resize applications. Such extensions would broaden the applicability of ReSHAPE to emerging dynamic computing environments, where elasticity and efficient data movement are critical for performance.

Efficient Multidimensional Data Redistribution for Resizable Parallel Computations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment