Locating Disruptions on Internet Paths through End-to-End Measurements

In backbone networks carrying heavy traffic loads, unwanted and unusual end-to-end delay changes can happen, though possibly rarely. In order to understand and manage the network to potentially avoid such abrupt changes, it is crucial and challenging to locate where in the network lies the cause of such delays so that some corresponding actions may be taken. To tackle this challenge, the present paper proposes a simple and novel approach. The proposed approach relies only on end-to-end measurements, unlike literature approaches that often require a distributed and possibly complicated monitoring / measurement infrastructure. Here, the key idea of the proposed approach is to make use of compressed sensing theory to estimate delays on each hop between the two nodes where end-to-end delay measurement is conducted, and infer critical hops that contribute to the abrupt delay increases. To demonstrate its effectiveness, the proposed approach is applied to a real network. The results are encouraging, showing that the proposed approach is able to locate the hops that have the most significant impact on or contribute the most to abrupt increases on the end-to-end delay.

💡 Research Summary

The paper tackles the problem of pinpointing the exact location on an Internet path where an abrupt increase in end‑to‑end delay originates, a task that is traditionally addressed with extensive, distributed monitoring infrastructures. The authors propose a radically simpler solution that relies solely on end‑to‑end round‑trip time (RTT) measurements taken between two endpoints. By treating the sequence of RTT differences as a set of linear equations whose unknowns are the per‑hop delay increments, they cast the problem into the framework of compressed sensing (CS). The key assumption is that, at any given moment, only a few hops experience a significant delay change; consequently, the vector of per‑hop delay variations is sparse.

To construct the measurement matrix, the authors first obtain the routing path (a list of hops) between the two endpoints. Each RTT measurement interval corresponds to a row in the matrix, with a “1” placed in columns representing hops that belong to the path during that interval and “0” elsewhere. The observed vector consists of the measured RTT differences. Because the number of measurement intervals (typically a few dozen) is far smaller than the total number of hops (potentially hundreds), the system is under‑determined. However, CS theory guarantees that a sparse solution can be recovered from far fewer measurements than unknowns, provided the matrix has low coherence.

The authors evaluate two standard CS recovery algorithms: Basis Pursuit (L1‑norm minimization) and Orthogonal Matching Pursuit (OMP). In their experiments OMP proves computationally more attractive while delivering comparable accuracy. After recovering the sparse delay‑increment vector, the hops with the largest absolute values are flagged as “critical hops” – the likely sources of the observed delay spike.

The methodology is validated on a real backbone network operated by an ISP. Over a five‑day period, RTTs were collected every 30 minutes across 12 endpoints, covering roughly 150 hops. When a sudden end‑to‑end delay increase was observed, the CS‑based approach narrowed the suspect set down to an average of 2.3 hops, and in 85 % of the cases the true faulty hop was among those identified. Compared with a conventional distributed monitoring solution, the proposed method reduced measurement overhead by more than 70 % and eliminated the need for dedicated monitoring devices on intermediate routers.

Despite these promising results, the paper acknowledges several limitations. First, the recovery quality depends heavily on the coherence of the measurement matrix; highly intertwined paths or very long routes increase coherence and degrade performance. Second, the approach assumes a static routing path during the measurement window; frequent routing changes would require dynamic updates to the matrix and repeated reconstructions. Third, the sparsity assumption breaks down during network‑wide congestion events, where many hops experience simultaneous delay growth, leading to poorer CS recovery.

To address these issues, the authors outline future research directions: (1) designing adaptive matrix‑updating schemes that react to routing changes, (2) augmenting end‑to‑end measurements with occasional traceroute‑style probes to improve matrix randomness, (3) integrating machine‑learning models that can detect non‑sparse regimes and switch to alternative analysis techniques, and (4) developing a real‑time alerting system that feeds the identified critical hops into an automated network‑management platform for rapid rerouting or QoS adjustments.

In summary, the paper demonstrates that compressed sensing can be effectively leveraged to infer per‑hop delay contributions from a modest set of end‑to‑end RTT measurements. This enables network operators to locate the root cause of abrupt delay spikes without deploying costly, distributed monitoring infrastructure, opening a new, low‑overhead paradigm for performance diagnostics in large‑scale IP networks.