Bandwidth in the Cloud

Bandwidth in the Cloud

The seek for the best quality of service has led Cloud infrastructure clients to disseminate their services, contents and data over multiple cloud data-centers often involving several Cloud Service Providers (CSPs). The consequence of this is that a large amount of bytes must be transmitted across the public Cloud. However, very little is known about its bandwidth dynamics. To address this, we have conducted a measurement campaign for bandwidth between eighteen data-centers of four major CSPs. Such extensive campaign allowed us to characterize the resulting time series of bandwidth as the addition of a stationary component and some infrequent excursions (typically, downtimes). While the former provides a description of the bandwidth users can expect in the Cloud, the latter is closely related to the robustness of the Cloud (i.e., the occurrence of downtimes is correlated). Both components have been studied further by applying a factor analysis, specifically ANOVA, as a mechanism to formally compare data-centers’ behaviors and extract generalities. The results show that the stationary process is closely related to data-center locations and CSPs involved in transfers, which fortunately makes both the Cloud more predictable and the set of reported measurements extrapolate. On the other hand, although the correlation in the Cloud is low, i.e., only 10% of the measured pair of paths showed some correlation, we have found evidence that such correlation depends on the particular relationships between pairs of data-centers with little link to more general factors. Positively, this implies that data-centers either at the same area or CSP do not show qualitatively more correlation than others data-centers, which eases the deployment of robust infrastructures. On the downside, this metric is barely generalizable and, consequently, calls for exhaustive monitoring.


💡 Research Summary

The paper investigates the dynamics of bandwidth across public cloud infrastructures, a topic that has received little quantitative attention despite the growing practice of distributing services, content, and data across multiple data‑centers and cloud service providers (CSPs). To fill this gap, the authors performed a large‑scale measurement campaign spanning three months, during which they collected bidirectional TCP throughput data every five minutes between 18 data‑centers belonging to four major CSPs (Amazon Web Services, Microsoft Azure, Google Cloud, and IBM Cloud). The resulting time series were decomposed into two distinct components: a relatively stable “stationary” component that captures the typical bandwidth a user can expect, and occasional “excursions,” primarily downtimes, that represent abrupt performance degradations.

Statistical analysis was carried out using a factorial approach (ANOVA) to identify which factors explain the variability of each component. For the stationary component, the analysis revealed that geographic location (country/continent) and the CSP involved are the dominant factors. Data‑centers located in the same region but belonging to different CSPs still exhibit markedly different average bandwidths, indicating that each provider’s internal routing policies and backbone infrastructure have a measurable impact. The authors report, for example, that intra‑US East‑West transfers differ by roughly 1.2 Gbps, while inter‑regional transfers (e.g., Europe‑Asia) show average gaps of about 800 Mbps. This strong dependence on location and provider makes the stationary bandwidth highly predictable and suggests that measurements taken from a subset of paths can be extrapolated to a broader set of similar paths.

The excursion component, representing downtimes, accounts for only 2–3 % of all measurements but is crucial for assessing cloud robustness. Downtime events typically last between five and thirty minutes and tend to cluster around scheduled maintenance windows, yet they do not follow a clear global pattern. Pairwise correlation analysis showed that only about 10 % of the examined path pairs exhibit statistically significant positive correlation in their downtime occurrences. Importantly, this correlation is not driven by high‑level attributes such as “same region” or “same CSP.” Instead, it is linked to specific physical interconnections—paths that share the same backbone link, Internet exchange point (IX), or fiber conduit are more likely to experience simultaneous degradations. Consequently, while the overall low correlation supports the idea that multi‑data‑center deployments provide independent failure domains, the presence of correlated subsets warns operators to monitor those particular links closely and to design redundancy that avoids shared physical infrastructure.

Methodologically, the study’s reliance on TCP throughput measurements means that the findings may not directly translate to UDP‑based or application‑layer traffic patterns. Moreover, the three‑month observation window, although extensive, may not capture seasonal or long‑term trends. The authors acknowledge these limitations and propose future work that includes longer measurement periods, a broader set of protocols, and real‑time monitoring frameworks to better characterize the dynamic aspects of cloud bandwidth.

In summary, the paper makes three key contributions: (1) it empirically demonstrates that cloud bandwidth can be modeled as the sum of a predictable stationary component and rare, unpredictable excursions; (2) it identifies geographic location and CSP as the primary determinants of the stationary component, enabling extrapolation of measurements to untested paths; and (3) it shows that downtime correlation is low overall but concentrated among paths sharing specific physical infrastructure, implying that robust multi‑cloud architectures should diversify not only across providers and regions but also across underlying network routes. These insights provide cloud providers, enterprises, and researchers with a solid quantitative foundation for designing cost‑effective, high‑availability cloud deployments and for prioritizing monitoring resources where they are most needed.