A Study of Network Congestion in Two Supercomputing High-Speed Interconnects

A Study of Network Congestion in Two Supercomputing High-Speed   Interconnects

Tool Demonstration

Gemini
Aries
Congested link durations vs. PTS threshold for Blue Waters (Gemini) and Edison (Aries)

In this section, we describe the following two results obtained from Monet  tool on field-congestion data.

  • impact of routing algorithms on congestion (see Subsection 4.1)

  • impact of heterogeneity in link-bandwidth on congestion (see Subsection 4.2)

Impact of Routing Algorithms

Figure 3 shows the quantile values for different congested link durations, i.e., durations for which the PTS value on the link is above a fixed threshold ($`PTS_{th}`$). The figure leads to the following insights:

  • Use of the dragonfly topology and adaptive routing has led to improvement in congestion control between two generations of Cray interconnects. The Dragonfly topology used in Aries has a low global diameter of one hop, which helps to contain the back pressure of congested links. Furthermore, adaptive routing allows packets to take a longer but less congested path, which helps to alleviate congestion on the minimal path. Figure 3 provides empirical evidence for that observation. For every $`PTS_{th}`$ threshold, the congested link duration in Aries is an order of magnitude less than in Gemini. For example, if the threshold for congestion is fixed at 15% PTS, while the median duration is close to zero in both systems, the 99.9th percentile duration is approximately 1 minute for Edison and 400 minutes for Blue Waters. However, while Aries manages long bouts of congestion better than Gemini does, application runtime variability due to network performance remains a concern.

  • *Detection of long-duration congestion using traffic measurements can facilitate intervention such as rank remapping or rescheduling of bully jobs *. The 99.9th percentile congested link duration observed in both systems for $`PTS_{th} \le 20\%`$ is greater than a minute. Such long duration congestion allows us to tolerate greater latency for detection and diagnosis in real time. Moreover, a diagnosis can be converted to actionable feedback to be used by tools such as TopoMesh , which can remap MPI ranks or the scheduler to reschedule bully jobs.

X+ and X-
Y+ and Y-
Z+ and Z-
Congested link durations for different link types in Gemini
Green
Black
Blue
Congested link durations for different link types in Aries

Heterogeneity in link bandwidth across different link types (electrical and optical links) increases the susceptibility to congestion. Figure 7 (a), Figure 7 (b) and Figure 7 (c) respectively show congested link durations at different quantile values for X, Y and Z directional links of Cray Gemini interconnect in Blue Waters, and Figure 11 (a), Figure 11 (b) and Figure 11 (c) respecitvely show the congested link durations at different quantile values for Green, Black and Blue links of Cray Aries interconnect in Edison. In Gemini, for higher $`PTS_{th}`$ thresholds ($`\ge20\%`$), links along the X direction have longer lasting congestion than those on the Y and Z direction links. Similarly, in Aries, optical links (Blue) have shorter and less severe bursts of congestion than the electrical links (Green and Black). Thus, mismatch and heterogeneity in link-bandwidth leads to varying levels of congestion along network path.