The Improved Job Scheduling Algorithm of Hadoop Platform

This paper discussed some job scheduling algorithms for Hadoop platform, and proposed a jobs scheduling optimization algorithm based on Bayes Classification viewing the shortcoming of those algorithms which are used. The proposed algorithm can be summarized as follows. In the scheduling algorithm based on Bayes Classification, the jobs in job queue will be classified into bad job and good job by Bayes Classification, when JobTracker gets task request, it will select a good job from job queue, and select tasks from good job to allocate JobTracker, then the execution result will feedback to the JobTracker. Therefore the scheduling algorithm based on Bayes Classification influence the job classification via learning the result of feedback with the JobTracker will select the most appropriate job to execute on TaskTracker every time. We need to consider the feature usage of job resource and the influence of TaskTracker resource on task execution, the former of which we call it job feature, for instance, the average usage rate of CPU and average usage rate of memory, the latter node feature, such as the usage rate of CPU and the size of idle physical memory, the two are called feature variables. Results show that it has a significant improvement in execution efficiency and stability of job scheduling.

💡 Research Summary

The paper addresses the well‑known problem of inefficient job scheduling in Hadoop clusters, where traditional schedulers (FIFO, Capacity, Fair) allocate tasks based primarily on static priorities or simple resource quotas. The authors argue that such approaches ignore two crucial dimensions: (1) the typical resource consumption pattern of each job (average CPU usage, average memory usage, I/O intensity, etc.) and (2) the current state of each TaskTracker node (available CPU percentage, free physical memory, network bandwidth, etc.). To bridge this gap, they propose a Bayesian‑based scheduling algorithm that classifies incoming jobs into “good” or “bad” categories using a Naïve Bayes classifier.

Algorithm Overview

Feature Extraction – For every job, the system records historical averages of CPU, memory, and I/O usage (job features). For every node, it continuously monitors real‑time CPU load, idle memory, and other resource metrics (node features). These two sets are concatenated into a feature vector x.
Bayesian Classification – The classifier computes the posterior probability
\