Exploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig

Cloud computing deals with heterogeneity and dynamicity at all levels and therefore there is a need to manage resources in such an environment and properly allocate them. Resource planning and scheduling requires a proper understanding of arrival patterns and scheduling of resources. Study of workloads can aid in proper understanding of their associated environment. Google has released its latest version of cluster trace, trace version 2.1 in November 2014.The trace consists of cell information of about 29 days spanning across 700k jobs. This paper deals with statistical analysis of this cluster trace. Since the size of trace is very large, Hive which is a Hadoop distributed file system (HDFS) based platform for querying and analysis of Big data, has been used. Hive was accessed through its Beeswax interface. The data was imported into HDFS through HCatalog. Apart from Hive, Pig which is a scripting language and provides abstraction on top of Hadoop was used. To the best of our knowledge the analytical method adopted by us is novel and has helped in gaining several useful insights. Clustering of jobs and arrival time has been done in this paper using K-means++ clustering followed by analysis of distribution of arrival time of jobs which revealed weibull distribution while resource usage was close to zip-f like distribution and process runtimes revealed heavy tailed distribution.

💡 Research Summary

The paper addresses the fundamental challenge of managing resources in a cloud environment that is intrinsically heterogeneous and dynamic at every layer. To devise effective planning and scheduling policies, a deep statistical understanding of workload arrival patterns, resource consumption, and execution characteristics is required. The authors focus on Google’s publicly released Cluster Trace version 2.1 (November 2014), which captures 29 days of activity across roughly 700 000 jobs and hundreds of cells. Because the dataset is massive, the study leverages the Hadoop ecosystem: raw logs are ingested into HDFS, metadata is handled through HCatalog, and Hive (accessed via the Beeswax web UI) is used for large‑scale SQL‑like querying. Hive’s strength lies in its ability to perform aggregations, joins, and filters over terabytes of data, enabling the extraction of per‑job attributes such as CPU demand, memory usage, disk I/O, start and end timestamps, and job class identifiers.

In parallel, the authors employ Pig Latin scripts to conduct more intricate preprocessing steps that are cumbersome in pure HiveQL. Pig’s procedural abstraction allows the definition of data‑flow pipelines, custom UDFs, and iterative transformations, complementing Hive’s declarative approach and improving overall development productivity.

The core analytical workflow consists of two stages. First, the authors apply K‑means++ clustering to the multidimensional feature space (arrival time, requested resources, runtime). K‑means++ improves upon classic K‑means by selecting initial centroids in a probabilistically informed manner, reducing the risk of poor local minima in a dataset of this scale. The algorithm yields five to seven coherent clusters, each representing a distinct workload class—for example, short‑interval, low‑resource batch jobs; long‑running, high‑resource data‑processing jobs; and medium‑size interactive tasks.

Second, the authors fit probability distributions to three key temporal and resource dimensions. Inter‑arrival times of jobs follow a Weibull distribution, capturing the early burstiness and later tapering typical of cloud workloads. Resource usage (CPU and memory) exhibits a Zipf‑like heavy‑tail, indicating that a small fraction of jobs consume a disproportionate share of the cluster’s capacity—a manifestation of the Pareto principle. Finally, job runtimes are best described by a heavy‑tailed (power‑law) distribution, confirming that a minority of long‑running jobs dominate overall system latency.

These statistical findings have direct implications for scheduler design. A Weibull‑based arrival model can inform predictive scaling policies that provision extra capacity ahead of anticipated spikes. Recognizing Zipf‑like resource consumption suggests that naïve “big‑job‑first” allocation may lead to resource fragmentation; instead, schedulers should incorporate fairness or throttling mechanisms to prevent a few dominant jobs from starving the rest. The heavy‑tailed runtime distribution underscores the necessity of preemptive or priority‑based scheduling to mitigate the impact of straggler tasks on overall job completion times.

The paper also acknowledges several limitations. It analyzes only a single trace from Google’s data center, so the generalizability to other cloud providers or workload mixes remains uncertain. The methodology is retrospective; applying the insights in real‑time would require streaming ingestion and online clustering capabilities not covered in the study. Moreover, K‑means++ assumes roughly spherical clusters in a linear feature space; alternative techniques such as DBSCAN, Gaussian Mixture Models, or hierarchical clustering could capture non‑linear relationships more effectively.

In summary, the work demonstrates a practical pipeline that combines Hive and Pig to process a massive cloud workload trace, applies K‑means++ clustering to uncover distinct job classes, and rigorously models arrival, resource, and runtime characteristics using Weibull, Zipf‑like, and heavy‑tailed distributions respectively. By translating raw trace data into actionable statistical models, the study provides valuable guidance for designing more adaptive, fair, and efficient resource management and scheduling algorithms in large‑scale, heterogeneous cloud environments.

💡 Research Summary

📜 Original Paper Content