Decreasing log data of multi-tier services for effective request tracing

Decreasing log data of multi-tier services for effective request tracing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Previous work shows request tracing systems help understand and debug the performance problems of multi-tier services. However, for large-scale data centers, more than hundreds of thousands of service instances provide online service at the same time. Previous work such as white-box or black box tracing systems will produce large amount of log data, which would be correlated into large quantities of causal paths for performance debugging. In this paper, we propose an innovative algorithm to eliminate valueless logs of multitiers services. Our experiment shows our method filters 84% valueless causal paths and is promising to be used in large-scale data centers.


💡 Research Summary

The paper addresses the severe scalability problem of request‑tracing systems in large‑scale data centers that host multi‑tier services. Traditional tracing approaches (both white‑box and black‑box) record every activity—BEGIN, END, SEND, RECEIVE—producing massive logs that must be shipped to a central correlation server and assembled into causal paths. In a realistic setting, a modest e‑commerce benchmark can generate 150 K events per minute, which translates to roughly 10 MB of log data per minute per node. With thousands of nodes, the volume quickly reaches 0.1 TB per minute, making storage, network bandwidth, and analysis costs prohibitive.

The authors observe that not all causal paths are equally valuable for performance debugging. A typical three‑tier web service (web server → application server → database) yields two categories of paths: (1) Simple causal paths, which involve only the client and the first tier (static content requests); and (2) Complex causal paths, which traverse at least two tiers and therefore contain richer interaction information. Empirical data shows that about 80 % of requests are simple, and in their experiments more than 60 % of observed paths are simple. Consequently, eliminating simple paths could dramatically reduce the amount of data without sacrificing diagnostic power.

To achieve this, the authors extend their previous PreciseTracer framework with an elimination algorithm that operates at the transformation stage on each service node. The algorithm works as follows:

  1. Feature extraction – The first RECEIVE activity on port 80 (the web server) is examined. Its payload size is taken as a distinguishing feature because simple requests (static files) are typically much smaller than complex ones (which carry parameters, session IDs, etc.).
  2. Threshold determination – Using k‑means clustering on a sample of message sizes, two clusters are identified (small vs. large). The midpoint between the cluster centroids becomes the size threshold.
  3. Thread‑level state tracking – A map <Tid, state> records the current state of each thread (start, simple, complex, end). When a BEGIN log arrives, the thread is marked “start”. Upon the first RECEIVE, if the message size exceeds the threshold the thread transitions to “complex”; otherwise it becomes “simple”. END logs set the state to “end”.
  4. Selective logging – Only activities belonging to threads in the “complex” state are transformed into tuple records and forwarded to the correlation server. All other logs (simple‑state threads) are discarded locally.

The algorithm offers three practical benefits: (a) it dramatically cuts the volume of logs transmitted over the network; (b) it reduces the storage and processing burden on the central correlation server; and (c) it shrinks the final set of causal paths, focusing analysis on the most informative ones.

Evaluation was conducted on a RUBiS deployment with three physical nodes: Apache web server, JBoss application server, and MySQL database. Each node runs on a dual‑Pentium III (1 GB RAM) system under RedHat 4.1.1‑30, with SystemTap providing kernel‑level instrumentation. The experiment consisted of a 1.5‑minute sampling period, followed by transformation and elimination, and finally correlation of the remaining logs. Results are summarized in Table 1:

  • Original logs: 9.5 M records → transformed to 11 M tuple records.
  • After elimination: tuple records reduced to 2.5 M (‑77 %).
  • Causal paths dropped from 12,373 to 1,997 (‑84 %).

All surviving causal paths were complex, confirming that the elimination step preserved the diagnostically valuable information while discarding the bulk of irrelevant data.

In conclusion, the study demonstrates that a lightweight, size‑based classification combined with per‑thread state tracking can effectively prune unnecessary tracing data in multi‑tier services. This makes request tracing feasible even in environments with hundreds of thousands of instances. The authors acknowledge that the fixed size threshold may not generalize across all workloads; future work will explore adaptive thresholding, richer feature sets, and machine‑learning classifiers to broaden applicability and improve robustness.


Comments & Academic Discussion

Loading comments...

Leave a Comment