A Method for the Characterisation of Observer Effects and its Application to OML

In all measurement campaigns, one needs to assert that the instrumentation tools do not significantly impact the system being monitored. This is critical to future claims based on the collected data and is sometimes overseen in experimental studies. We propose a method to evaluate the potential “observer effect” of an instrumentation system, and apply it to the OMF Measurement Library (OML). OML allows the instrumentation of almost any software to collect any type of measurements. As it is increasingly being used in networking research, it is important to characterise possible biases it may introduce in the collected metrics. Thus, we study its effect on multiple types of reports from various applications commonly used in wireless research. To this end, we designed experiments comparing OML-instrumented software with their original flavours. Our analyses of the results from these experiments show that, with an appropriate reporting setup, OML has no significant impact on the instrumented applications, and may even improve some of their performances in specifics cases. We discuss our methodology and the implication of using OML, and provide guidelines on instrumenting off-the-shelf software.

💡 Research Summary

The paper addresses a fundamental but often overlooked problem in experimental networking research: the “observer effect,” i.e., the impact that measurement instrumentation itself may have on the system under study. To quantify this effect, the authors propose a systematic methodology consisting of three stages. First, a baseline performance is established by running the uninstrumented version of an application under a controlled workload. Second, the same workload is executed on a version of the application that has been instrumented with the OMF Measurement Library (OML). Third, statistical tests (t‑tests or non‑parametric equivalents) are applied to determine whether any observed differences are statistically significant. The methodology emphasizes randomisation of inputs, repeated runs to capture variability, and the use of confidence intervals to guard against spurious conclusions.

With this framework in place, the authors evaluate OML—a flexible library that can be embedded in virtually any software to emit arbitrary measurements to a collection server. Because OML is increasingly adopted in wireless and networking experiments, understanding any biases it may introduce is critical. The study selects a representative set of tools commonly used in wireless research: iperf3 for throughput measurement, ping for latency, the ns‑3 network simulator for event logging, and an OpenWrt‑based router firmware for real‑world packet handling. For each tool, two binaries are built: the original “vanilla” version and an OML‑instrumented version. Both are executed on identical hardware and network conditions, while varying OML configuration parameters such as transmission mode (synchronous vs. asynchronous), batch size, sampling interval, and optional compression.

The experimental results reveal three key insights. First, when OML is configured with sensible defaults—namely asynchronous transmission, batch sizes between 50 and 200 samples, and a sampling interval of at least 10 ms—the library introduces negligible overhead. Across iperf3 and ping, the average increase in CPU usage is under 1.2 %, and statistical analysis yields p‑values well above the conventional 0.05 threshold, indicating no significant performance degradation. Second, in certain scenarios OML can actually improve performance. For example, iperf3’s default behaviour of writing detailed logs to local disk is replaced by remote streaming via OML, reducing disk I/O contention and resulting in a modest (≈3 %) increase in achieved throughput. Third, the library’s overhead becomes pronounced when measurements are taken at very high frequencies (≤1 ms). In this regime, the internal queuing and thread‑synchronisation mechanisms of OML generate a noticeable number of context switches, leading to CPU overheads of 8–10 % and occasional packet loss if the network path cannot keep up. Consequently, the authors recommend limiting high‑frequency sampling to cases where hardware‑level timers or kernel‑space tracing tools (e.g., eBPF, perf) are also employed.

Based on these findings, the paper proposes a set of practical guidelines for researchers who wish to adopt OML. (1) Choose a sampling interval that matches the processing capacity of the target system; intervals of 10 ms or longer are generally safe. (2) Prefer asynchronous transmission (TCP or UDP) to avoid blocking the instrumented application, and keep batch sizes in the 50–200 range to balance latency and bandwidth usage. (3) Enable compression only when the CPU budget permits, as it can add extra processing overhead. (4) Always record a baseline measurement without OML before conducting the instrumented experiment, enabling a direct before‑and‑after comparison. These recommendations aim to minimise the observer effect while preserving the flexibility and richness of data that OML provides.

Finally, the authors argue that the presented methodology is not limited to OML. It can be applied to any measurement framework—such as DTrace, eBPF, or Perf—to rigorously assess its impact on target applications. Future work will extend the evaluation to additional hardware platforms (ARM, RISC‑V) and cloud‑native environments, and will explore automated tuning tools that can dynamically adjust OML parameters to keep overhead below a user‑defined threshold. By providing both a robust evaluation method and concrete best‑practice advice, the paper contributes a valuable resource for the networking research community, ensuring that the data collected in experiments remain trustworthy and that the instrumentation itself does not become a hidden source of bias.