Analysis and Clustering of Workload in Google Cluster Trace based on Resource Usage

Cloud computing has gained interest amongst commercial organizations, research communities, developers and other individuals during the past few years.In order to move ahead with research in field of data management and processing of such data, we need benchmark datasets and freely available data which are publicly accessible. Google in May 2011 released a trace of a cluster of 11k machines referred as Google Cluster Trace.This trace contains cell information of about 29 days.This paper provides analysis of resource usage and requirements in this trace and is an attempt to give an insight into such kind of production trace similar to the ones in cloud environment.The major contributions of this paper include Statistical Profile of Jobs based on resource usage, clustering of Workload Patterns and Classification of jobs into different types based on k-means clustering.Though there have been earlier works for analysis of this trace, but our analysis provides several new findings such as jobs in a production trace are trimodal and there occurs symmetry in the tasks within a long job type

💡 Research Summary

The paper presents a comprehensive analysis of the publicly released Google Cluster Trace—a dataset capturing 29 days of activity from a production‑scale cluster of roughly 11,000 machines. The authors begin by preprocessing the raw logs, cleaning missing and anomalous entries, and aggregating per‑task measurements of four primary resources: CPU cycles, memory consumption, disk I/O, and network bandwidth. From these time‑series aggregates they compute per‑job statistics (mean, max, variance) and construct a high‑dimensional feature vector for each job.

Statistical profiling is the first analytical layer. Jobs are divided into three duration‑based categories—short, medium, and long—and the distribution of each resource metric is examined using histograms and kernel density estimates. A striking finding is that CPU usage across all three categories exhibits a trimodal distribution: a sharp peak for short, bursty jobs, a moderate plateau for medium‑length jobs, and a lower, relatively flat region for long‑running jobs. Memory usage shows high variance with occasional spikes, especially among long jobs that likely correspond to batch analytics or machine‑learning workloads. Disk and network metrics display clear diurnal patterns, with higher demand during typical business hours (09:00–18:00) and reduced activity at night, mirroring real‑world enterprise traffic.

The core contribution lies in the application of k‑means clustering to the resource‑usage vectors. The authors determine the optimal number of clusters (k = 4) through silhouette analysis and the elbow method. The resulting clusters are interpreted as: (1) CPU‑intensive jobs, dominated by short, interactive services; (2) Memory‑intensive jobs, largely long‑running batch pipelines that require large in‑memory data structures; (3) I/O‑intensive jobs, characterized by high disk and network throughput (e.g., backups, large file transfers); and (4) Balanced jobs, where CPU, memory, and I/O are used in comparable proportions. Each cluster’s composition is quantified, revealing that CPU‑intensive jobs constitute the majority of short jobs, while memory‑intensive jobs dominate the long‑job cohort.

Beyond clustering, the study uncovers two novel phenomena not previously reported. First, within the long‑job category, many tasks share almost identical resource‑usage profiles, indicating a symmetry that likely stems from the scheduler reusing a common template or from multiple replicas of the same service running concurrently. This symmetry suggests that workload prediction models could treat groups of tasks as a single statistical entity, simplifying forecasting. Second, the trimodal CPU distribution itself is a new statistical insight into production traces, challenging the common assumption of a unimodal or heavy‑tailed distribution in cloud environments.

The authors compare their findings with earlier analyses of the same trace, emphasizing that prior work focused mainly on overall utilization trends or simple classification schemes. By integrating detailed statistical profiling with unsupervised learning, this paper provides a richer, multi‑dimensional view of workload behavior. The practical implications are significant: (i) cloud operators can design more nuanced autoscaling policies that react differently to CPU‑intensive bursts versus memory‑heavy batch phases; (ii) the identified symmetry among long‑job tasks enables more efficient scheduling, as a single resource reservation can cover multiple similar tasks; and (iii) the diurnal patterns support time‑of‑day provisioning strategies to reduce energy costs.

Finally, the paper contributes a reproducible research artifact: the processed dataset and the clustering code are made publicly available, encouraging further exploration and validation. In sum, the work advances our understanding of real‑world cloud workloads by revealing trimodal CPU usage, task‑level symmetry, and a robust four‑cluster taxonomy, all of which can inform smarter resource management, cost optimization, and future trace‑driven research.

💡 Research Summary

📜 Original Paper Content