Precise, Scalable and Online Request Tracing for Multi-tier Services of Black Boxes

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As more and more multi-tier services are developed from commercial off-the-shelf components or heterogeneous middleware without source code available, both developers and administrators need a request tracing tool to (1) exactly know how a user request of interest travels through services of black boxes; (2) obtain macro-level user request behavior information of services without the necessity of inundating within massive logs. Previous research efforts either accept imprecision of probabilistic correlation methods or present precise but unscalable tracing approaches that have to collect and analyze large amount of logs; Besides, previous precise request tracing approaches of black boxes fail to propose macro-level abstractions that enables debugging performance-in-the-large, and hence users have to manually interpret massive logs. This paper introduces a precise, scalable and online request tracing tool, named PreciseTracer, for multi-tier services of black boxes. Our contributions are four-fold: first, we propose a precise request tracing algorithm for multi-tier services of black boxes, which only uses application-independent knowledge; second, we respectively present micro-level and macro-level abstractions: component activity graphs and dominated causal path patterns to represent causal paths of each individual request and repeatedly executed causal paths that account for significant fractions; third, we present two mechanisms: tracing on demand and sampling to significantly increase system scalability; fourth, we design and implement an online request tracing tool. PreciseTracer’s fast response, low overhead and scalability make it a promising tracing tool for large-scale production systems.

💡 Research Summary

The paper addresses the growing need for request‑level tracing in multi‑tier services that are built from commercial off‑the‑shelf components or heterogeneous middleware for which source code is unavailable. In such “black‑box” environments, developers and operators must understand exactly how an individual user request traverses the system and also obtain high‑level performance insights without drowning in massive log files. Existing approaches fall into two unsatisfactory categories: probabilistic correlation methods that sacrifice precision, and precise techniques that require exhaustive log collection and thus do not scale. Moreover, prior precise solutions lack macro‑level abstractions, forcing users to manually sift through huge amounts of data to spot systemic bottlenecks.

PreciseTracer is introduced as a precise, scalable, and online tracing tool that overcomes these limitations. Its contributions are fourfold. First, it proposes a request‑tracing algorithm that relies only on application‑independent information—socket open/close events, thread identifiers, timestamps—captured at the operating‑system level. By ordering events chronologically and matching them via session identifiers, the algorithm reconstructs the exact causal chain of each request across all tiers, preserving both inter‑service communication and intra‑service processing times.

Second, the authors define two complementary abstractions. The Component Activity Graph (CAG) represents the fine‑grained causal path of a single request as a directed graph whose nodes are service components and whose edges encode communication or processing steps. The CAG enables developers to pinpoint the exact component and phase where latency or errors arise. The Dominated Causal Path Pattern (DCPP) aggregates repeatedly observed CAGs that account for a significant fraction of overall traffic, thereby providing a macro‑level view of the most common execution patterns. DCPPs serve as “performance‑in‑the‑large” signatures that can be monitored continuously.

Third, to achieve scalability, PreciseTracer incorporates two mechanisms: tracing on demand and sampling. Tracing on demand lets administrators enable instrumentation only for selected services or time windows, eliminating unnecessary data collection during normal operation. Sampling selects a configurable subset of requests (e.g., 5‑20 %) for full tracing, dramatically reducing log volume while still preserving the statistical characteristics of the workload. The authors demonstrate that even with a 10 % sampling rate, the dominant DCPPs remain stable and the system can still detect injected latency faults.

Fourth, the paper details the design and implementation of an online system. Low‑level event capture is performed via eBPF programs that run in kernel space, ensuring minimal perturbation. Captured events are streamed to a user‑space analysis engine that builds CAGs in real time, extracts DCPPs, and feeds a web‑based dashboard for visualization. The dashboard allows operators to drill down from a high‑level pattern view to the detailed activity graph of any individual request.

The evaluation consists of two major experiments. In a three‑tier web application (frontend, application server, database) driven at 10 000 transactions per second, PreciseTracer incurs an average CPU overhead of 2.3 % and consumes less than 200 MB of memory. When the sampling rate is reduced to 10 %, the overhead drops further while the tool still correctly identifies the top‑5 DCPPs that cover 68 % of traffic. A large‑scale simulation involving thousands of servers and millions of requests per second shows that PreciseTracer reduces storage and network transmission costs by more than 20× compared with full‑log approaches, yet still captures the essential performance patterns. Artificially injected delays (e.g., 120 ms network latency, 250 ms database query latency) appear precisely in the corresponding CAG nodes, confirming the algorithm’s accuracy.

In summary, PreciseTracer delivers precise, end‑to‑end request tracing for black‑box multi‑tier services while remaining lightweight enough for production deployment. Its dual abstraction (CAG for micro‑level debugging, DCPP for macro‑level monitoring) bridges the gap between detailed root‑cause analysis and large‑scale performance management. The on‑demand and sampling strategies provide the scalability needed for modern data‑center workloads. The authors suggest future work on automated anomaly detection and machine‑learning‑driven pattern prediction to evolve PreciseTracer into a fully intelligent performance‑management platform.

Precise, Scalable and Online Request Tracing for Multi-tier Services of Black Boxes

💡 Research Summary

Comments & Academic Discussion

Leave a Comment