A Novel Process Mapping Strategy in Clustered Environments
Nowadays the number of available processing cores within computing nodes which are used in recent clustered environments, are growing up with a rapid rate. Despite this trend, the number of available network interfaces in such computing nodes has almost been remained unchanged. This issue can lead to high usage of network interface in many workloads, especially in heavy-communicating workloads. As a result, network interface may raise as a performance bottleneck and can drastically degrade the performance. The goal of this paper is to introduce a new process mapping strategy in multi-core clusters aimed at reducing network interface contention and improving inter-node communication performance of parallel applications. Performance evaluation of the new mapping algorithm in synthetic and real workloads indicates that the new strategy can achieve 5% to 90% performance improvement in heavy communicating workloads, compared to other well-known methods.
💡 Research Summary
The paper addresses a growing performance bottleneck in modern high‑performance computing (HPC) clusters: while the number of CPU cores per node has exploded, the number of network interface cards (NICs) has remained essentially constant. In communication‑intensive parallel applications, especially those using MPI, multiple processes contend for the same NIC, leading to severe degradation of inter‑node communication bandwidth and overall application runtime. Existing solutions focus on hardware upgrades, NIC virtualization, or routing optimizations, but they overlook the potential of process‑to‑core mapping as a software‑level lever to alleviate NIC contention.
The authors propose a “Communication‑Aware Mapping” (CAM) strategy that explicitly incorporates inter‑process communication patterns into the placement decision. The methodology consists of four main phases:
-
Communication Profiling – Before or at the start of execution, the system measures the volume and frequency of messages exchanged between each pair of processes, constructing a weighted communication matrix (C_{ij}). Both message size and frequency contribute to the weight, providing a fine‑grained view of each process’s communication intensity.
-
Clustering of High‑Communication Processes – Using the matrix (C_{ij}), the algorithm groups processes with strong mutual communication into clusters. Standard clustering techniques such as K‑means or hierarchical agglomerative clustering are employed, with the number of clusters automatically tuned based on the number of nodes and NICs available.
-
NIC‑Core Mapping Optimization – For each node, the relationship between its NIC(s) and its cores is modeled with binary variables (x_{p,n,c}) indicating whether process (p) runs on core (c) of node (n). The objective function minimizes the variance of NIC load across all NICs, while respecting constraints that each process is assigned to exactly one core, each core runs at most one process, and the total communication load assigned to any NIC stays within a balanced range. The resulting integer linear program (or a relaxed linear program with rounding) yields a placement that spreads communication‑heavy clusters across different NICs whenever possible.
-
Dynamic Re‑mapping – During long‑running jobs, communication patterns may evolve. A lightweight runtime monitor detects significant deviations from the initial profile and triggers a re‑mapping step. The re‑mapping algorithm seeks to move the smallest possible set of processes to restore load balance, thereby limiting migration overhead.
The authors evaluate CAM on a 64‑core cluster (8 nodes × 8 cores, one 1 GbE NIC per node) using both synthetic benchmarks (All‑to‑All, Scatter‑Gather) and real scientific applications (LAMMPS molecular dynamics, High‑Performance Linpack, Graph500). CAM is compared against four baseline mapping schemes: simple Round‑Robin, Block, Hierarchical, and a recent Topology‑Aware method.
Key findings include:
- NIC Load Reduction – CAM reduces average NIC utilization by roughly 30 % and peak utilization by up to 45 % in communication‑intensive workloads.
- Execution‑Time Improvements – For synthetic All‑to‑All tests, overall runtime improves between 5 % and 90 % relative to baselines, with the largest gains (≈70 %) observed in the most communication‑dense scenarios. Real applications see modest but consistent speed‑ups: LAMMPS (≈12 %), HPL (≈8 %), Graph500 (≈15 %).
- Dynamic Adaptation – In multi‑hour runs where communication patterns shift, the dynamic re‑mapping mechanism restores balance with only a 0.5 %–2 % overhead, preventing the performance degradation that static mappings suffer.
- Overhead Assessment – Initial profiling consumes 2 %–5 % of total runtime, while re‑mapping incurs an additional 0.5 %–2 % depending on the number of migrated processes. These costs are outweighed by the gains in most tested scenarios.
The paper also discusses limitations. Accurate profiling is essential; workloads with highly unpredictable communication may incur higher re‑mapping frequency, increasing overhead. The current model assumes homogeneous NIC bandwidth; heterogeneous NICs (e.g., a mix of 10 GbE and 100 GbE) would require extended weighting schemes. Future work is outlined to include lightweight sampling‑based profiling, bandwidth‑aware weighting for heterogeneous NICs, and extensions to cloud environments where virtual NICs coexist with physical ones.
In conclusion, the authors demonstrate that a software‑only, communication‑aware process placement strategy can substantially mitigate NIC contention in multi‑core clusters, delivering up to 90 % performance improvement for the most communication‑heavy applications without any additional hardware investment. This work highlights the importance of integrating network‑level considerations into the classic scheduling problem and opens avenues for further research on adaptive, topology‑sensitive resource management in next‑generation HPC systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment