The Grid technology is evolving into a global, service-orientated architecture, a universal platform for delivering future high demand computational services. Strong adoption of the Grid and the utility computing concept is leading to an increasing number of Grid installations running a wide range of applications of different size and complexity. In this paper we address the problem of elivering deadline/economy based scheduling in a heterogeneous application environment using statistical properties of job historical executions and its associated meta-data. This approach is motivated by a study of six-month computational load generated by Grid applications in a multi-purpose Grid cluster serving a community of twenty e-Science projects. The observed job statistics, resource utilisation and user behaviour is discussed in the context of management approaches and models most suitable for supporting a probabilistic and autonomous scheduling architecture.
Increasing demand for high-performance computer systems in recent years has helped establish Grid technology as an attractive choice for academic research clusters and high-demand business computing alike [1]. But proliferation of Grid installations, encouraged by low entry barriers typical of off-the-shelf components and open source licensing, has also meant that Grid is evolving into a global, serviceorientated utility platform. Grid's migration from cutting edge to the mainstream is already evident, and is driven in part by the financial benefits (such as reduced total cost of ownership and better value through economies of scale) as well as policy shifts within large computational consumers. As this shift takes place, an associated change of the Grid application landscape will follow. Our work is focused on enabling a more efficient and user-friendlier scheduling in a future Grid serving a rich mix of diverse e-Science applications. We will argue that in such environments, a combination of deadline and economy based scheduling can be delivered by utilising statistical models of application execution times and resource requirements based on historical data, meta-data normally associated with submitted jobs, and the current state of the Grid system.
The need for a more efficient scheduling has been discussed previously [2][3][4][5]. Departing from a batch model inherited from legacy cluster systems, a deadline/economy based system more suited to human workflow would allow users to specify a deadline and a nominative price by which they would expect their job finished. Such scheduling systems could not be built unless certain predictions could be made on the length of execution and resource requirements of the jobs pending in the queue. Previous research in predicting the execution times has focused on deep analysis of the application source code [6], instrumentation of the application, or relinking to specialised prediction libraries customised for both the type of the application and its target hardware [7]. Some of these methods have given encouraging results [8], but the development, deployment and administration effort can only be justified in cases of highest-end computational resources.
We have taken the view that in delivering predictions in a varied application landscape executing on a utility Grid platform, spot prediction accuracy would come second to the prediction speed and the level of administrator intervention. With this in mind, we have strived to develop a significantly self-managing system. Working in a heterogeneous application space, and delivering statistics-based predictions requires a good understanding of computational load presented to a Grid cluster. Studies of production, multi-purpose Grids are rare, and the interaction and multiplexing effects of various applications are unknown. Therefore, in Section 2 we present important aspects of our six month study of University College London (UCL) production Grid cluster and their implication in cluster resource management and scheduling. Section 3 details the proposed probabilistic Grid scheduling model, while Section 4 gives preliminary simulation results. The direction of our future work and conclusions are outlined in Section 5, while Section 6 gives references.
To establish the feasibility of the statistical prediction methods, and to gain further understanding into the dynamics of job scheduling on a production Grid, a study of workload generated by twenty e-Science projects on a two hundred CPU Sun Grid Engine [9] cluster was undertaken. The fully analysed data includes around 50,000 jobs from the six month period July -Dec 2004, while validation of observations was run on the data up to and including May 2005. Considering that the ultimate source of the computational load was human workflow, a significant degree of correlation, patterns and trends was expected. Our aim was to establish whether these really do occur, and if so whether they can be exploited for predictions with usable accuracy in the context of resource management and scheduling.
In our previous publication [10], we have reported on the data collection and analysis process. The observations here presented are based on the analysis of wallclock execution times -real time value of process execution time which is larger or, for a perfectly optimised process, equal to the actual CPU time. The data has been examined in two ways: by treating every job consecutively (Figure 1) and by clustering them according to their meta-data (Figure 2). The latter shows clustering based on one of the readily available fields in the accounting file -the Unix group name of the submitter. The group is assigned by the administrator and is loosely related to the e-Science project the users are involved with. Fig. 1. Time-series plot of wall-clock execution times for a six month period Figure 1 shows a log-normal plot of all recorded job execution times in the observed six month period
This content is AI-processed based on open access ArXiv data.