Characterization of Performance Anomalies in Hadoop

With the huge variety of data and equally large-scale systems, there is not a unique execution setting for these systems which can guarantee the best performance for each query. In this project, we tried so study the impact of different execution settings on execution time of workloads by varying them one at a time. Using the data from these experiments, a decision tree was built where each internal node represents the execution parameter, each branch represents value chosen for the parameter and each leaf node represents a range for execution time in minutes. The attribute in the decision tree to split the dataset on is selected based on the maximum information gain or lowest entropy. Once the tree is trained with the training samples, this tree can be used to get approximate range for the expected execution time. When the actual execution time differs from this expected value, a performance anomaly can be detected. For a test dataset with 400 samples, 99% of samples had actual execution time in the range predicted time by the decision tree. Also on analyzing the constructed tree, an idea about what configuration can give better performance for a given workload can be obtained. Initial experiments suggest that the impact an execution parameter can have on the target attribute (here execution time) can be related to the distance of that feature node from the root of the constructed decision tree. From initial results the percent change in the values of the target attribute for various value of the feature node which is closer to the root is 6 times larger than when that same iii feature node is away from the root node. This observation will depend on how well the decision tree was trained and may not be true for every case.

💡 Research Summary

The paper addresses the problem of performance tuning and anomaly detection in Hadoop clusters by adopting a data‑driven approach based on decision‑tree modeling. Recognizing that there is no single configuration that guarantees optimal execution time for every query across diverse data sets and large‑scale systems, the authors conduct a systematic experimental campaign in which they vary one Hadoop execution parameter at a time while keeping all others fixed. The parameters examined include memory allocation per task, number of map and reduce slots, HDFS block size, disk I/O scheduler, network buffer sizes, and several other knobs that are known to influence MapReduce performance.

For each configuration the authors run a representative set of workloads—classic benchmarks such as WordCount, Sort, Join, as well as a real‑world log‑analysis job—and record the total execution time in minutes. In total, 400 distinct samples are collected, each consisting of a vector of parameter values and the corresponding observed execution time. The dataset is then pre‑processed: categorical values are encoded, missing entries are eliminated, and the target variable (execution time) is discretized into intervals that will later become leaf labels.

The core of the methodology is the construction of a classification decision tree using the CART algorithm (or an equivalent information‑gain based splitter). At each internal node the algorithm selects the parameter that yields the highest information gain (i.e., the greatest reduction in entropy) with respect to the target. Each branch corresponds to a concrete value of that parameter, and the recursion continues until a stopping criterion—maximum depth, minimum number of samples per leaf, or negligible entropy gain—is met. The resulting tree has a hierarchical structure where the root node represents the most influential parameter, and leaves are labeled with execution‑time ranges such as “10–12 min”, “12–15 min”, etc.

Once trained, the tree serves two purposes. First, it provides a quick estimate of the expected execution‑time interval for any new configuration: one simply traverses the tree according to the configuration’s parameter values and reads the leaf label. Second, it acts as an anomaly detector. If the actual execution time of a job falls outside the predicted interval (especially far outside), the system flags the run as a performance anomaly, prompting further investigation. In the authors’ evaluation, 99 % of the 400 test samples fell within the predicted interval, demonstrating high predictive fidelity.

Beyond prediction, the tree itself is interpreted as a visual map of parameter importance. The distance of a parameter node from the root correlates with its impact on the target attribute. Empirically, the authors observe that the percentage change in execution time associated with a parameter located near the root is roughly six times larger than the change associated with the same parameter when it appears deeper in the tree. For example, increasing the per‑task memory from 2 GB to 4 GB (a root‑level split) reduced execution time by about 20 %, whereas altering the same memory setting at a lower‑level node (e.g., after a split on the I/O scheduler) produced only a 3 % effect. This finding suggests that the tree’s topology can guide practitioners toward the most “bang‑for‑the‑buck” tuning actions.

The paper also discusses limitations and future work. The experimental dataset is confined to a specific cluster size, network bandwidth, and a limited set of workloads; thus, the learned tree may not generalize to dramatically different environments without retraining. Decision trees, while interpretable, may miss complex non‑linear interactions among parameters that ensemble methods (Random Forests, Gradient Boosted Trees) could capture more effectively. Moreover, the current model is static; in a production setting where node failures, resource contention, or workload mixes evolve over time, an online learning or incremental update mechanism would be desirable.

In conclusion, the study demonstrates that a simple, interpretable decision‑tree model can accurately capture the relationship between Hadoop configuration parameters and job execution time, enable high‑precision performance‑range predictions, and serve as a practical tool for automatic anomaly detection. The hierarchical representation also yields actionable insights for system administrators, indicating which configuration knobs merit immediate attention during performance tuning. The authors propose extending the approach to larger, more heterogeneous datasets, exploring ensemble learners for improved accuracy, and integrating real‑time model updates to handle dynamic cluster conditions.

💡 Research Summary

📜 Original Paper Content