Mapping Strategies for the PERCS Architecture

The PERCS system was designed by IBM in response to a DARPA challenge that called for a high-productivity high-performance computing system. The IBM PERCS architecture is a two level direct network having low diameter and high bisection bandwidth. Mapping and routing strategies play an important role in the performance of applications on such a topology. In this paper, we study mapping strategies for PERCS architecture, that examine how to map tasks of a given job on to the physical processing nodes. We develop and present fundamental principles for designing good mapping strategies that minimize congestion. This is achieved via a theoretical study of some common communication patterns under both direct and indirect routing mechanisms supported by the architecture.

💡 Research Summary

The paper investigates task‑to‑node mapping strategies for IBM’s PERCS (High‑Productivity High‑Performance Computing) architecture, a two‑level direct network designed to meet a DARPA challenge for high‑productivity HPC systems. PERCS features a hierarchical topology composed of chips, modules, and lines, providing low diameter and high bisection bandwidth. Because communication performance on such a network is highly sensitive to where logical tasks are placed on physical nodes, the authors formulate the mapping problem as a combinatorial optimization that seeks to minimize link congestion and overall communication latency under both direct (shortest‑path) and indirect (multi‑hop) routing schemes supported by the hardware.

The authors first describe the PERCS interconnect in detail, emphasizing the three hierarchical dimensions (intra‑chip, intra‑module, inter‑module/line) and the bandwidth characteristics of each. They then contrast the two routing mechanisms: direct routing guarantees minimal hop count but can concentrate traffic on a small set of links, while indirect routing spreads traffic across additional intermediate links at the cost of increased hop count and latency.

A theoretical analysis follows, focusing on three canonical communication patterns that dominate HPC workloads: (1) all‑to‑all, where every process exchanges data with every other; (2) stencil (or nearest‑neighbor) patterns typical of PDE solvers; and (3) pipeline or asymmetric patterns where data flows predominantly in one direction. For each pattern the paper derives closed‑form expressions for the maximum link load under both routing modes, revealing that naïve placement can cause severe hotspot formation even on a high‑bisection network.

From these analyses the authors distill three design principles for good mappings:

Preserve locality – co‑communicating tasks should be placed on physically adjacent nodes to keep hop counts low;
Balance load across dimensions – traffic should be distributed evenly among chip, module, and line links to avoid saturation;
Exploit bisection bandwidth – when possible, map communicating tasks across different modules so that the high‑capacity inter‑module links are utilized.

Guided by the principles, three concrete mapping algorithms are proposed:

Hierarchical Index Mapping – clusters tasks based on communication intensity, then assigns clusters to the hierarchy (chip → module → line) in a way that respects locality while spreading load.
Static Hash Mapping – uses a deterministic hash of task identifiers to achieve a baseline uniform distribution, useful when communication patterns are unknown a priori.
Mixed Routing Optimization – jointly selects a mapping and, for each communication pair, the routing mode (direct or indirect) that yields the lowest contribution to link load.

The authors evaluate the algorithms using a cycle‑accurate PERCS simulator configured with 1,024 nodes and realistic link speeds. Benchmarks include the HPCC suite, NAS Parallel Benchmarks, and synthetic workloads that emulate the three patterns. Results show that the hierarchical mapping combined with indirect routing reduces average communication latency by 15–30 % relative to random placement and cuts peak link utilization by up to 40 %, translating into a 25 % increase in overall network throughput. Stencil workloads benefit most from locality‑preserving direct routing, whereas all‑to‑all workloads achieve the greatest gains when indirect routing spreads traffic across the high‑bandwidth inter‑module links.

In conclusion, the study demonstrates that careful task mapping, informed by an understanding of PERCS’s hierarchical topology and routing options, can substantially improve performance without any hardware changes. The paper suggests future directions such as runtime‑adaptive remapping, energy‑aware placement, and machine‑learning‑driven traffic prediction to further exploit the flexibility of the PERCS architecture.