Dynamic Deferral of Workload for Capacity Provisioning in Data Centers

Dynamic Deferral of W orkload for Capacity Pro visioning in Data Centers Muhammad Abdullah Adnan ∗ , Ryo Sugihara † , Y an Ma ‡ and Rajesh K. Gupta ∗ ∗ Univ ersity of California San Diego, † Amazon.com, ‡ Shandong Univ ersity Abstract —Recent increase in energy prices has led researchers to ﬁnd better ways for capacity pro visioning in data centers to reduce energy wastage due to the variation in workload. This paper explores the opportunity for cost saving utilizing the ﬂexibility fr om the Service Level Agreements (SLAs) and proposes a novel approach for capacity pro visioning under bounded latency requir ements of the workload. W e in vestigate how many servers to be kept active and how much workload to be delayed f or energy saving while meeting e very deadline. W e present an ofﬂine LP formulation for capacity provisioning by dynamic deferral and give two online algorithms to determine the capacity of the data center and the assignment of workload to servers dynamically . W e prov e the feasibility of the online algorithms and show that their worst case performance are bounded by a constant factor with respect to the ofﬂine formu- lation. W e validate our algorithms on a MapReduce workload by pr ovisioning capacity on a Hadoop cluster and sho w that the algorithms actually perform much better in practice compared to the naive ‘follow the workload’ provisioning, resulting in 20- 40% cost-sa vings. I . I N T R O D U C T I O N W ith the advent of cloud computing, data centers are emerging all over the world and their energy consumption becomes signiﬁcant; as estimated 61 million MWh per year , costing about 4.5 billion dollars [1]. Naturally , ener gy ef- ﬁciency in data centers has been pursued in various ways including the use of renewable energy [2], [3] and improv ed cooling ef ﬁciency [4], [5], [6], etc. Among these, improv ed scheduling algorithm is a promising approach for its broad applicability regardless of hardware conﬁgurations. Among the attempts to improve scheduling [6], [7], recent ef fort has focussed on optimization of schedule under performance constraints imposed by Service Level Agreements (SLAs). T ypically , a SLA speciﬁcation provides a measure of ﬂexibil- ity in scheduling that can be exploited to improv e performance and efﬁciency [8], [9]. T o be speciﬁc, latenc y is an important performance metric for any web-based service and is of great interest for service providers who run their services on data centers. The goal of this paper is to utilize the ﬂexibility from the SLAs for dif ferent types of workload to reduce energy consumption. The idea of utilizing SLA information to improv e performance and efﬁciency is not entirely new . Recent work explores utilization of application deadline information for improving the performance of the applications (e.g. see [9], [10]). But the opportunities for energy efﬁcienc y remain unexplored, certainly in a manner that seeks to establish bounds on the energy cost from the proposed solutions. In this paper, we are interested in minimizing the energy consumption of a data center under guarantees on latency/ deadline. W e use the deadline information to defer some tasks so that we can reduce the total cost for ener gy consumption for ex ecuting the workload and switching the state of the servers. W e determine the portion of the released workload to be executed at the current time and the portions to be deferred to be executed at later time slots without violating deadlines. Our approach is similar to ‘valle y ﬁlling’ that is widely used in data centers to utilize server capacity during the periods of low loads [7]. But the load that is used for valley ﬁlling is mostly background/maintenance tasks (e.g. web indexing, data backup) which is dif ferent from actual workload. In fact current valley ﬁlling approaches ignore the workload characteristics for capacity provisioning. In this paper , we determine ho w much w ork to defer for valley ﬁlling in order to reduce the current and future energy consumption while prov ably ensuring satisf action of SLA requirements. Later we generalize our approach for more general workloads where different jobs ha ve dif ferent deadlines. This paper makes three contributions. First, we present an LP formulation for capacity provisioning with dynamic deferral of workload. The formulation not only determines capacity but also determines the assignment of workload for each time slot. As a result the utilization of each serv er can be determined easily and resources can be allocated accordingly . Therefore this method well adapts to other scheduling policies that take into account dynamic resource allocation, priority aware scheduling, etc. Second, we design two optimization based online algo- rithms depending on the nature of the deadline. F or uniform deadline, our algorithm named V alle y F illing with W orkload (VFW( δ )) , looks ahead δ slots to optimize the total energy consumption. The algorithm uses the v alley ﬁlling approach to defer some workload to ex ecute in the periods of low loads. For nonuniform deadline, we design a Generalized Capacity Pr ovisioning (GCP) algorithm that reduces the switching (on/off) of servers by balancing the workloads in adjacent time slots and thus reduces energy consumption. W e prove the feasibility of the solutions and sho w that the performance of the online algorithms are bounded by a constant factor with respect to the ofﬂine formulation. Third, we validate our algorithms using MapReduce traces (representativ e workload for data centers) and ev aluate cost savings achie ved via dynamic deferral. W e run simulations to deal with a wide range of settings and sho w signiﬁcant sa vings in each of them. Ov er a period of 24 hours, we ﬁnd more than 40% total cost saving for GCP and around 20% total cost saving for VFW( δ ) even for small deadline requirements. W e compare the two online algorithms with different parameter settings and ﬁnd that GCP giv es more cost savings than VFW( δ ). In order to show that our algorithms work on real systems, we perform e xperiments on a 35 node Hadoop cluster and ﬁnd energy savings of ∼ 6.02% for VFW( δ ) and ∼ 12% (a) Original W orkload (b) Batch and Interacti ve Job Fig. 1. Illustration of (a) original workload and (b) distinction between batch and small interactive jobs. for GCP over a period of 4 hours. The experimental results show that the peak energy consumption for the operation of a data center can be reduced by provisioning capacity and scheduling workload using our algorithms. The rest of the paper is organized as follows. Section II presents the model that we use to formulate the optimization and giv es the of ﬂine formulation. In Section III, we present the VFW( δ ) algorithm for determining capacity and w orkload assignment dynamically when the deadline is uniform. In Section IV , we illustrate the GCP algorithm with nonuniform deadline. Section V discusses the e xtension of our algorithms for long jobs. In Section VI and VII, we illustrate the simulation and experimental results respectiv ely . In Section VIII, we describe the state of the art research related to capacity provisioning and Section IX concludes the paper . I I . M O D E L F O R M U L A T I O N In this section, we describe the model we use for capacity provisioning via dynamic deferral. W e note that the assump- tions used in this model are minimal and this formulation captures many properties of current data center capacity and workload characteristics. A. W orkload T races T o build a realistic model, we need real workload from data centers but the data center providers are reluctant to publish their production traces due to priv acy issues and competitiv e concerns. T o overcome the scarcity of publicly av ailable traces, efforts hav e been made to extract summary statistics from production traces and workload generators based on those statistics hav e been proposed [11], [12]. F or the purposes of this paper , we use such a workload generator and use the MapReduce traces released by Chen et al [11]. MapReduce frame work is widely used in Data centers and acts as representativ e workload where each of the jobs consists of 3 steps of computation: map, shufﬂe and reduce [14]. Figure 1(a) illustrates the statistical MapReduce traces over 24 hours generated from real Facebook traces. T ypically the w orkload traces consist of a mix of batch and interactiv e jobs. Chen et al. carried out an interactiv e analysis to classify the workload and showed that the workload is dominated ( ∼ 98%) by small and interactiv e jobs showing signiﬁcant and unpredictable variation with time. T able I illustrates the classiﬁcation on the MapReduce traces by k - means clustering based on the sizes of map, shuf ﬂe and reduce stages (in bytes) with k = 10 and Figure 1(b) shows the distinction in time v ariation between the long batch jobs and T ABLE I C L US T E R S I ZE S A N D M ED I A N S B Y k - M E A NS C L U ST E R I NG O N T H E M A P R E D U CE T R AC E # Jobs % Jobs Input Shufﬂe Output 5691 96.56 15 KB 0 685 KB 116 1.97 44 GB 15 GB 84 MB 27 0.46 56 GB 145 GB 16 GB 23 0.39 123 GB 0 52 MB 19 0.32 339 KB 0 48 GB 8 0.14 203 GB 404 GB 3 GB 5 0.08 529 GB 0 53 KB 3 0.05 46 KB 0 199 GB 1 0.02 7 TB 48 GB 101 GB 1 0.02 913 GB 8 TB 61 KB small interactive jobs. T o adapt with the large variation in the small and interactive workload, valley ﬁlling methods hav e been proposed using the low priority batch jobs to ﬁll in the periods of low workload [13]. Howe v er , Chen et al. hav e sho wn that the portion of lo w priority long jobs ( ∼ 2%) are insufﬁcient to reduce the variation (to smooth) in the workload curve [12]. In this paper , we propose valle y ﬁlling with workload (mix of long and interactiv e jobs) and devise algorithms for capacity provisioning by scheduling jobs under bounded latency requirements. B. W orkload Model W e consider a workload model where the total workload varies over time. The time interval we are interested in is t ∈ { 0 , 1 , . . . , T } where T can be arbitrarily large. In practice, T can be a year and the length of a time slot τ could be as small as 2 minutes (the minimum time required to change power state of a server). In our model, each job has a deadline D (in terms of number of slots) associated with it, where D is a nonnegati ve integer . In other words, a job released at time t , needs to be ex ecuted within time slot t + D . W e ﬁrst model for small interactive jobs having length less than τ . Later in Section V , we extend our model for mix of jobs with arbitrary lengths. Based on the nature of deadlines, we hav e two cases: (i) uniform deadline, when deadline is uniform for all the jobs; (ii) non-uniform deadlines, when dif ferent jobs hav e different deadlines. In Section II and III, we formulate model and algorithm for the case of uniform deadline and the non-uniform deadline case is considered in Section IV . Let L t be the amount of workload released at time slot t . Since the deadline D is uniform for all the jobs, the total amount of work L t must be executed by the end of time slot t + D . Since L t varies over time, we often refer to it as a workload curve . W e consider data center as a collection of homogeneous servers. The total number of servers M is ﬁxed and giv en but each server can be turned on/off to execute the workload. W e normalize L t by the processing capability of each server i.e. L t denotes the number of serv ers required to e xecute the workload at time t . W e assume for all t , L t ≤ M . Let x i,d,t be the portion of the released workload L t that is assigned to be executed at server i at time slot t + d where d represents the deferral with 0 ≤ d ≤ D . Let m t be the number of activ e servers during time slot t . Then P m t i =1 P D d =0 x i,d,t = L t and 0 ≤ x i,d,t ≤ 1 . Let x i,t be the total workload assigned at time t to server i and x t be the total assignment at time t . Then we can think of x i,t as the utilization of the i th server at time t i.e. 0 ≤ x i,t ≤ 1 . Thus P D d =0 x i,d,t − d = x i,t and P m t i =1 x i,t = x t . From the data center perspecti ve, we focus on two important decisions during each time slot t : (i) determining m t , the number of acti ve servers, and (ii) determining x i,d,t , assignment of workload to the servers. C. Cost Model The goal of this paper is to minimize the cost (price) of energy consumption in data centers. The energy cost function consists of two parts: operating cost and switching cost. Operating cost is the cost for executing the workload which in our model is proportional to the assigned workload. W e use the common model for energy cost for typical servers which is an afﬁne function: C ( x ) = e 0 + e 1 x where e 0 and e 1 are constants (e.g. see [15]) and x is the assigned workload (utilization) of a server at a time slot. Although we use this general model for cost function, other models considering nonlinear parameters such as temperature, frequency can easily be adopted in the model which will make it a nonlinear optimization problem. Our algorithms can be applied for such nonlinear models by using techniques for solving nonlinear optimizations as each optimization is considered as a single independent step in the algorithms. Switching cost β is the cost incurred for changing state (on/off) of a serv er . W e consider the cost of both turning on and turning off a server . Switching cost at time t is deﬁned as follows: S t = β | m t − m t − 1 | where β is a constant (e.g. see [7], [16]). D. Optimization Pr oblem Giv en the models above, the goal of a data center is to choose the number of activ e servers (capacity) m t and the dispatching rule x i,d,t to minimize the total cost during [1 , T ] , which is captured by the following optimization: min x t ,m t T X t =1 m t X i =1 C ( x i,t ) + β T X t =1 | m t − m t − 1 | (1) subject to m t X i =1 D X d =0 x i,d,t = L t ∀ t m t X i =1 D X d =0 x i,d,t − d ≤ m t ∀ t D X d =0 x i,d,t − d ≤ 1 ∀ i, ∀ t 0 ≤ m t ≤ M ∀ t x i,d,t ≥ 0 ∀ i, ∀ d, ∀ t. Since the servers are identical, we can simplify the problem by dropping the index i for x . More speciﬁcally , for any feasible solution x i,d,t , we can make another solution by x i,d,t = P m t i =1 x i,d,t /m t (i.e., replacing every x i,d,t by the av erage of x i,d,t for all i ) without changing the v alue of the objectiv e function while satisfying all the constraints after this con version. Then we have the following optimization equiv alent to (1): min x t ,m t T X t =1 m t C ( x t /m t ) + β T X t =1 | m t − m t − 1 | (2) subject to D X d =0 x d,t = L t ∀ t D X d =0 x d,t − d ≤ m t ∀ t 0 ≤ m t ≤ M ∀ t x d,t ≥ 0 ∀ d, ∀ t. where x d,t represents the portion of the workload L t to be ex ecuted at a server at time t + d . W e further simplify the problem by sho wing that any optimal assignment for (2) can be conv erted to an equiv alent assignment that uses earliest deadline ﬁrst (EDF) policy (see Figure 2). More formally , we hav e the following lemma: Lemma 1: Let x ∗ t r and x ∗ t s be the optimal assignments of workload obtained from the solution of optimization (2) at times t r and t s respectiv ely where t s > t r and t s − t r = θ < D . If ∃ δ with P δ − 1 d =0 x ∗ d,t r − d 6 = 0 and P D d = θ + δ +1 x ∗ d,t s − d 6 = 0 for any 0 < δ < ( D − θ ) then we can obtain another assignments x e t r = x ∗ t r and x e t s = x ∗ t s where P δ − 1 d =0 x e d,t r − d = 0 and P D d = θ + δ +1 x e d,t s − d = 0 . Pr oof: W e prov e it by constructing x e t r and x e t s from x ∗ t r and x ∗ t s . W e change the assignments x ∗ d,t r , 0 ≤ d ≤ ( D − θ ) and x ∗ d,t s , θ ≤ d ≤ D to obtain x e t r and x e t s as illustrated in Figure 2. W e no w determine δ . Note that all the workloads Fig. 2. Assignments can be determined from their release times and EDF policy . released between (including) time slots t s − D to t r can be ex ecuted at time t r without violating deadline since t r − D < t s − D < t r − δ < t r . Also all the workloads released between (including) time slots t s − D to t r can be e xecuted at time t s without violating deadline since t s − D < t r − δ < t r < t s . Hence the new assignment of workloads cannot violate an y deadline. W e determine δ at a point where P D − θ d = δ +1 x e d,t r − d = P D − θ d = δ +1 x ∗ d,t r − d + P D d = θ + δ +1 x ∗ d,t s − d and P δ − 1 d =0 x e d,t r − d = 0 and x e δ,t r − δ = P D − θ d =0 x ∗ d,t r − P D − θ d = δ +1 x e d,t r − d such that x e t r = x ∗ t r . Similarly for x e t s , we hav e the ne w assignment as: P θ + δ − 1 d = θ x e d,t s − d = P δ − 1 d =0 x ∗ d,t r − d + P θ + δ − 1 d = θ x ∗ d,t s − d and P D d = θ + δ +1 x e d,t s − d = 0 and x e θ + δ,t s − θ − δ = P D d = θ x ∗ d,t s − P θ + δ − 1 d = θ x e d,t s − d such that x e t s = x ∗ t s . According to lemma 1, we do not need both t and d as indices of x . W e can use the release time t to determine (a) Of ﬂine optimal (b) VFW( δ ) Fig. 3. Illustration of (a) ofﬂine optimal solution and (b) VFW( δ ) for arbitrary workload generated randomly; time slot length = 2 min, D = 15 , δ = 10 . the deadline t + D and differentiate between the jobs using their deadlines. Thus, we drop the index d of x . At time t , unassigned workload from L t − D to L t is executed according to EDF policy while minimizing the objective function. T o formulate the constraint that no assignment violates any dead- line we deﬁne delayed workload l t with maximum deadline D . l t = ( 0 if t ≤ D , L t − D otherwise. W e call the delayed curve l t for the workload as deadline curve . Thus we hav e two fundamental constraints on the assignment of workload for all t : (C1) Deadline Constraint: P t j =1 l j ≤ P t j =1 x j (C2) Release Constraint: P t j =1 x j ≤ P t j =1 L j Condition (C1) says that all the workloads assigned up to time t cannot violate deadline and Condition (C2) says that the assigned workload up to time t cannot be greater than the total released workload up to time t . Using these constraints we reformulate the optimization (2) as follows: min x t ,m t T X t =1 m t C ( x t /m t ) + β T X t =1 | m t − m t − 1 | (3) subject to t X j =1 l j ≤ t X j =1 x j ≤ t X j =1 L j ∀ t T X j =1 x j = T X j =1 L j 0 ≤ x t ≤ m t ≤ M ∀ t Since the operating cost function C ( . ) is an afﬁne function, the objectiv e function is linear as well as the constraints. Hence it is clear that the optimization (3) is a linear program. Note that capacity m t in this formulation is not constrained to be an integer . This is acceptable because data centers consists of thousands of acti ve servers and we can round the resulting solution with minimal increase in cost. Figure 3(a) illustrates the ofﬂine optimal solutions for x t and m t for a dynamic workload generated randomly . The performance of the optimal ofﬂine solution on two realistic workload are provided in Section VI. I I I . V A L L E Y F I L L I N G W I T H W O R K L O A D In this section, we consider the online case, where at any time t , we do not have information about the future workload L t 0 for t 0 > t . At each time t , we determine the x t and m t by applying optimization ov er the already released unassigned workload which has deadline in future D slots. Note that the workload released at or before t , can not be delayed to be assigned after time slot t + D . Hence we do not optimize ov er more than D + 1 slots. W e simplify the online optimization by solving only for m t and determine x t by making x t = m t at time t . This makes the online algorithm not to waste any execution capacity that cannot be used later for executing workload. But the cost due to switching in the online algorithm may be higher than the ofﬂine algorithm. Thus our goal is to design strategies to reduce the switching cost. In the online algorithm, we reduce the switching cost by optimizing the total cost for the interval [ t, t + D ] . When the deadline is uniform, we can reduce the switching cost ev en more by looking beyond D slots. W e do that by accumulating some w orkload from periods of high loads and ex ecute that amount of workload later in valle ys without violating constraints (C1) and (C2). Note that by accumulation we do not violate deadline as at each slot, we execute a portion of the accumulated workload by swapping with the newly released workload by EDF policy . T o determine the amount of accumulation and execution we use ‘ δ -delayed workload’. Thus the online algorithm namely V alley Filling with W orkload (VFW( δ )) looks ahead δ slots to determine the amount of execution. Let l δ t be the δ -delayed curve with delay of δ slots for 0 < δ < D . l δ t = ( 0 if t ≤ δ, L t − δ otherwise. Then we can call the deadline curve as D -delayed curve and represent it by l D t . W e determine the amount of accumulation and execution by controlling the set of feasible choices for m t in the optimization. For this purpose, we use the δ -delayed curve to restrict the amount of accumulation. By having a lower bound on m t for the valle y (low workload) and an upper bound it for the high w orkload, we control the ex ecution in the valle y and accumulation in the other parts of the curve. Thus in the online algorithm, we have two types of optimizations: Local Optimization and V alley Optimization. Local Optimization is used to smooth the ‘wrinkles’ (we deﬁne wrinkles as the small variation in the workload in adjacent slots e.g. see Figure 4) within D consecutiv e slots and accumulate some workload. On the other hand, V alley Optimization ﬁlls the valleys with the accumulated workload. A. Local Optimization The local optimization applies optimization over future D slots and ﬁnds the optimum capacity for current slot by ex ecuting no more than δ -delayed workload. Let t be the current time slot. At this slot we apply a slightly modiﬁed version of of ﬂine optimization (3) in the interval [ t, t + D ] . W e apply the following optimization LOPT( l t , l δ t , m t − 1 , M ) to determine m t in order to smooth the wrinkles by optimizing ov er D consecutive slots. W e restrict the amount of ex ecution to be no more than the δ -delayed workload while satisfying Fig. 4. The curv es L t and l δ t and their intersection points. The peak from the l δ t curve is cut and used to ﬁll the valley of the same curve. The amount of w orkload that is accumulated/delayed is bounded by m t D . the deadline constraint (C1). min m t ( e 0 + e 1 ) t + D X j = t m j + β t + D X j = t | m j − m j − 1 | (4) subject to t X j =1 l D j ≤ t X j =1 m j t + D X j =1 m j = t X j =1 l δ j 0 ≤ m k ≤ M t ≤ k ≤ t + D After solving the local optimization, we get the value of m t for the current time slot and assign x t = m t . For the next time slot t + 1 we solve the local optimization again to ﬁnd the values for x t +1 and m t +1 . Note that the deadline constraint (C1) and the release constraint (C2) are satisﬁed at time t , since from the formulation P t j =1 l D j ≤ P t j =1 m j ≤ P t j =1 l δ j ≤ P t j =1 L j . B. V alle y Optimization In valle y optimization , the accumulated workload from the local optimization is executed in ‘global valleys’. Before giving the formulation for the valle y optimization we need to detect a valle y . Let p 1 , p 2 , . . . , p n be the sequence of intersection points of L t and l δ t curves (see Figure 4) in nondecreasing order of their x-coordinates ( t v alues). Let p 0 1 , p 0 2 , . . . , p 0 n be the sequence of points on l δ t with delay δ added with each intersection point p 1 , p 2 , . . . , p n on l δ t such that t 0 s = t s + δ for all 1 ≤ s ≤ n . W e discard all the intersection points (if any) between p s and p 0 s from the sequence such that t s +1 ≥ t 0 s . Note that at each intersection point p s , the curve from p s to p 0 s is known. T o determine whether the curve l δ t between p s and p 0 s is a v alley , we calculate the area A = t 0 s X t = t s ( l δ t − l δ t s ) If A is negativ e, then we regard the curve between p s and p 0 s as a global valle y though it may contain several small peaks and valle ys. If the curve between p s and p 0 s is a global valle y , we ﬁll the v alley with some (possibly all) of the accumulated workload by ex ecuting more than the δ -delayed workload while satisfying the release constraint (C2). For each t , we apply the following optimization VOPT( l t , L t , m t − 1 , M ) in the interv al [ t, t + D ] to ﬁnd the value of m t where t s ≤ t ≤ t 0 s . min m t ( e 0 + e 1 ) t + D X j = t m j + β t + D X j = t | m j − m j − 1 | (5) subject to t X j =1 l D j ≤ t X j =1 m j t + D X j =1 m j = t X j =1 L j 0 ≤ m k ≤ M t ≤ k ≤ t + D Note that the deadline constraint (C1) and the release con- straint (C2) are satisﬁed at time t , since P t j =1 l D j ≤ P t j =1 m j ≤ P t j =1 L j . W e apply the valle y optimization (5) for each t s ≤ t ≤ t 0 s and local optimization (4) for each time slot t where t ∈ { [1 , T − D − 1] − [ t s , t 0 s ] } for all t s . For each t ∈ [ T − D , T ] we apply the valley optimization (5) for global valle y in the interval [ t, T ] in order to execute all the accumulated workload. Algorithm 1 summarizes the procedures for VFW( δ ). F or each new time slot t , Algorithm 1 detects a valle y by checking whether the curves l δ t and L t intersects. If t is inside a v alley , Algorithm 1 applies valle y optimization (V OPT); local optimization (LOPT), otherwise. Figure 3(b) illustrates the nature of solutions from VFW( δ ) for x t and m t . Note that δ is a parameter for the online algorithm VFW( δ ). Algorithm 1 VFW( δ ) 1: v alley ← 0 ; m 0 ← 0 2: l D [1 : D ] ← 0 ; l δ [1 : δ ] ← 0 3: for each new time slot t do 4: l D [ t + D ] ← L [ t ] 5: l δ [ t + δ ] ← L [ t ] 6: if v alley = 0 and l δ intersects L then 7: Calculate Area A = P t 0 s t = t s ( l δ t − l δ t s ) 8: if A < 0 then 9: v alley ← 1 10: end if 11: else if val ley > 0 and v alley ≤ δ then 12: v alley ← v alley + 1 13: else 14: v alley ← 0 15: end if 16: if v alley = 0 then 17: m [ t : t + D ] ← LOPT( l [1 : t ] , l δ [1 : t ] , m t − 1 , M ) 18: else 19: m [ t : t + D ] ← V OPT( l [1 : t ] , L [1 : t ] , m t − 1 , M ) 20: end if 21: x t ← m t 22: end for C. Analysis of the Algorithm W e ﬁrst prove the feasibility of the solutions from the VFW( δ ) algorithm and then analyze the competitiv e ratio of this algorithm with respect to the ofﬂine formulation (3). First, we hav e the following theorem about the feasibility . Theorem 2: The VFW( δ ) algorithm gi ves feasible solution for any 0 < δ < D . Pr oof: W e pro ve this theorem inducti v ely by sho wing that the choice of any feasible m t from an optimization applied in the interval [ t, t + D ] do not result in infeasibility in the optimization applied in the next time slot [ t + 1 , t + D + 1] . Initially , the optimization in VFW( δ ) is applied for the interv al [1 , D + 1] with P k j =1 l D j = 0 for 1 ≤ k ≤ D . Hence the optimization applied in the intervals [1 , D + 1] gi ves feasible m 1 because P k j =1 l D j ≤ P k j =1 l δ j ≤ P k j =1 L j for 1 ≤ k ≤ D . Now suppose the VFW( δ ) giv es feasible m t in an interv al [ t, t + D ] . W e have to prove that there exists feasible choice for m t for the optimization applied at [ t + 1 , t + D + 1] . The deadline constraint (C1) and the release constraint (C2) are satisﬁed for m t . Hence, P t j =1 l D j ≤ P t j =1 l δ j ≤ P t j =1 L j . Since 0 < δ < D , P t j =1 l D j ≤ P t +1 j =1 l D j ≤ P t j =1 l δ j ≤ P t +1 j =1 l δ j ≤ P t j =1 L j ≤ P t +1 j =1 L j . Thus for any feasible choice of m t , we can always obtain feasible solution for m t +1 such that the above inequality holds. W e now analyze the competiti ve ratio of the online algo- rithm with respect to the of ﬂine formulation (3). W e denote the operating cost of the solution vectors X = ( x 1 , x 2 , . . . , x T ) and M = ( m 1 , m 2 , . . . , m T ) by cost o ( X, M ) = P T t =1 m t · C ( x t /m t ) , switching cost by cost s ( X, M ) = β P T t =1 | m t − m t − 1 | and total cost by cost ( X, M ) = cost o ( X, M ) + cost s ( X, M ) . W e have the follo wing lemma. Lemma 3: cost s ( X, M ) ≤ 2 β P T t =1 m t Pr oof: Switching cost at time t is S t = β | m t − m t − 1 | ≤ β ( m t + m t − 1 ) , since m t ≥ 0 . Then cost s ( X, M ) ≤ β · P T t =1 ( m t + m t − 1 ) ≤ 2 β P T t =1 m t where m 0 = 0 . Let X ∗ and M ∗ be the ofﬂine solution vectors from optimization (3). The following theorem proves that the competitiv e ratio of the VFW( δ ) algorithm is bounded by a constant with respect to the ofﬂine formulation (3). Theorem 4: cost ( X , M ) ≤ e 0 + e 1 +2 β e 0 + e 1 cost ( X ∗ , M ∗ ) . Pr oof: Since the of ﬂine optimization assigns all the w ork- load in the [1 , T ] interv al, P T t =1 x ∗ t = P T t =1 L t ≤ P T t =1 m ∗ t , where we used x ∗ t ≤ m ∗ t for all t . Hence cost ( X ∗ , M ∗ ) ≥ cost o ( X ∗ , M ∗ ) = P T t =1 m ∗ t C ( x ∗ t /m ∗ t ) = P T t =1 ( e 0 m ∗ t + e 1 x ∗ t ) ≥ P T t =1 ( e 0 + e 1 ) L t . In the online algorithm, we set x t = m t and P t j =1 m j ≤ P t j =1 L j for all t ∈ [1 , T ] . Hence by lemma 3, we have cost ( X, M ) = cost o ( X, M ) + cost s ( X, M ) ≤ P T t =1 ( e 0 + e 1 ) m t + 2 β P T t =1 m t ≤ ( e 0 + e 1 ) P T t =1 L t + 2 β P T t =1 L t = ( e 0 + e 1 + 2 β ) P T t =1 L t . I V . G E N E R A L I Z E D C A PAC I T Y P RO V I S I O N I N G W e now consider the general case where the deadline requirements are not same for all the jobs in a workload. Let ν be the maximum possible deadline. W e decompose the workload according to their associated deadline. Suppose L d,t ≥ 0 be the portion of the workload released at time t and has deadline d for 0 ≤ d ≤ ν . W e have P ν d =0 L d,t = L t . The workload to be ex ecuted at an y time slot t can come from different previous slots t − d where 0 ≤ d ≤ ν as illustrated in Figure 5(a). Hence we redeﬁne the deadline curve l t and represent it by l 0 t . Assuming L d,t = 0 if t ≤ 0 , we deﬁne l 0 t = P ν d =0 L d, ( t − d ) . Then the of ﬂine formulation Fig. 5. Illustration of workload with different deadline requirements. (a) workload released at different times have different deadlines, (b) the delayed workload l 0 t , may increase the switching cost due to large v ariation, (c) distribution of workload in adjacent slots by GCP to reduce the variation in w orkload. remains the same as formulation (3) with the deadline curve l t replaced by l 0 t . min x t ,m t T X t =1 m t C ( x t /m t ) + β T X t =1 | m t − m t − 1 | (6) subj. to t X j =1 l 0 j ≤ t X j =1 x j ≤ t X j =1 L j ∀ t T X j =1 x j = T X j =1 L j 0 ≤ x t ≤ m t ≤ M ∀ t W e now consider the online case. Delaying the workload up to their maximum deadline may increase the switching cost since it may increase the v ariation in the workload compared to the original w orkload (see Figure 5(b)). Hence at each time we need to determine the optimum assignment and capacity that reduces the switching cost from the original workload while satisfying each indi vidual deadline. W e can apply the VFW( δ ) algorithm from the pre vious section with D = D min where D min is the minimum deadline for the workload. If D min is small, VFW( δ ) does not work well because δ < D min becomes too small to detect a v alley . Hence we use a nov el approach for distributing the workload L t ov er the D t slots such that the change in the capacity between adjacent time slots is minimal (see Figure 5(c)). W e call this algorithm as Generalized Capacity Provisioning (GCP) algorithm. In the GCP algorithm, we apply optimization to determine m t at each time slot t and make x t = m t . The optimization is applied over the interv al [ t, t + ν ] since at time slot t we can ha ve workload that has deadline up to t + ν slots. Hence at each time t , the released workload is a vector of ν + 1 dimension. Let, L t = ( L 0 ,t , L 1 ,t , . . . , L ν,t ) where L d,t = 0 if there is no workload with deadline d at time t . Let y t be the vector of unassigned workload released up to time t . The vector y t is updated from y t − 1 at each time slot by subtracting the capacity m t − 1 and then adding L t . Note that m t − 1 is subtracted from the vector y t − 1 in order to use unused capacity to execute already released workload at time t − 1 by follo wing EDF policy (see lines 4-17 in Algorithm 2). Let y 0 t − 1 = ( y 0 0 ,t − 1 , y 0 1 ,t − 1 , y 0 2 ,t − 1 , . . . , y 0 ν,t − 1 ) be the vector after subtracting m t − 1 with y 0 0 ,t − 1 = 0 and y 0 j,t − 1 ≥ 0 for 1 ≤ j ≤ ν . Then y t = L t + ( y 0 1 ,t − 1 , y 0 2 ,t − 1 , . . . , y 0 ν,t − 1 , 0) where y t = (0 , 0 , . . . , 0) if t < = 0 . Then the optimization GCP-OPT( y t , m t − 1 , M ) applied at each t ov er the interval Algorithm 2 GCP 1: y [0 : ν ] ← 0 2: m 0 ← 0 3: for each new time slot t do 4: uc ← m t − 1 { uc represents the unused capacity } 5: f or i = 0 to ν do 6: if uc ≤ 0 then 7: y 0 [ i ] ← y [ i ] 8: else 9: uc ← uc − y [ i ] 10: if uc ≤ 0 then 11: y 0 [ i ] ← − uc 12: else 13: y 0 [ i ] ← 0 14: end if 15: end if 16: end for 17: y [0 : ν ] = { y 0 [1 : ν ] , 0 } + L t [0 : ν ] 18: m [ t : t + ν ] ← GCP-OPT( y [0 : ν ] , m t − 1 , M ) 19: x t ← m t 20: end for [ t, t + ν ] is as follows: min m t ( e 0 + e 1 ) t + ν X j = t m j + β t + ν X j = t | m j − m j − 1 | (7a) subject to ν X j =0 m t + j = ν X j =0 y j,t (7b) j X k =0 m t + k ≥ j X k =0 y k,t 0 ≤ j ≤ ν − 1 (7c) 0 ≤ m t + j ≤ M 0 ≤ j ≤ ν (7d) Note that the optimization (7) solves for ν + 1 values. W e only use m t as the capacity and assignment of workload at time t . Algorithm 2 summarizes the procedures for GCP . The GCP algorithm giv es feasible solutions because it works with the unassigned w orkload and constraint (7c) ensures deadline constraint (C1) and constraint (7b) ensures the release con- straint (C2). The competitiv e ratio for the GCP algorithm is same as the competitiv e ratio for VFW( δ ) because in GCP , m t = x t and release constraint (C2) holds at e very t making P T t =1 m t = P T t =1 x t ≤ P T t =1 L t . V . E X T E N S I O N F O R L O N G J O B S In this section, we extend our model for ‘Long Jobs’. By long jobs we mean the jobs that have length greater than the time slot length τ and each of which has a given deadline requirement greater than its length. T o extend the model for long jobs we estimate the length of the long jobs, decompose the long jobs into small pieces ( ≤ τ ) and assign deadline to each of them. A. Estimation of Execution T ime If all the jobs are short interacti ve jobs with e xecution time less than the time-slot length, then we do not need to estimate the completion time of the jobs. But for a mix of short and long jobs we need estimation. Since our targeted workload is MapReduce, we present a method for estimating ex ecution time of a MapReduce job . The MapReduce performance model is illustrated in Fig. 6. MapReduce Performance Model. Figure 6. A MapReduce job J is deﬁned in [18] as a 7 tuple ( S, S 0 , S 00 , X , Y , f ( x ) , g ( x )) , where S is the size of map input data; S 0 is the size of intermediate shuf ﬂe data; S 00 is the size of output reduce data; X is the number of mappers that J is divided into; Y is the number of reducers assigned for J to output; f ( x ) is the running time of a mapper with input size x ; g ( x ) is the running time of a reducer with input size x . W e compute the number of map and reduce tasks by dividing the input size S and output size S 00 by the HDFS (Hadoop Distributed File System) block size respectively . Let V i , V o and V n be the data read rate, data output rate and network transfer rate respectiv ely . Then the execution times of map, shufﬂe and reduce tasks can be estimated by the following equations [18]. T m = S X V i + f ( S X ) + S 0 X V o T s = S 0 X Y V n T r = g ( S 0 Y ) + S 00 Y V o The map and reduce tasks run in parallel and may need sev eral rounds to ﬁnish if the total number of av ailable task slots is less than the number of mappers or reducers. If at most M mappers and R reducers are completed at each round, then the number of rounds of map and reduce tasks are λ m = d X/ M e and λ r = d Y /R e respecti vely . Based on whether the reducers have to wait for data transfer to complete, we hav e two cases. If t m ≥ M t s , reducers hav e to wait for data transfers to complete and when t m < M t s , reducers do not need to wait for completion of mappers except during the ﬁrst round. Then the estimated execution time for job J is, T J = ( T m + λ r ( X T s + T r ) , T m < M T s λ m T m + M T s + λ r ( X T s + T r ) , T m ≥ M T s B. Decomposition and Deadline Assignment The long jobs can be preemptiv e and non-preemptive. Based on these two types we have two ways to decompose and assign deadlines to them. Pr eemptive Jobs: Let J be a preempti ve job with execution time T J ( > τ ), release time t J and deadline D (in terms of number of slots). Then the length of the job J is ` = dT J /τ e time slots. Since the job is preemptiv e, we safely decompose it into small pieces J 1 , J 2 , . . . J ` where T J i ≤ τ for 1 ≤ i ≤ ` . W e assign the deadline for each of the small pieces to b D /` c − 1 (see Figure 7(a)). W e set the release time of the ﬁrst piece t J 1 = t J and the release times of the other pieces to t J i = t J i − 1 + b D /` c for 1 < i ≤ ` . Since a piece J i is released after its preceding piece J i − 1 , this technique satisﬁes any precedence constraint the particular job may have. (a) Preempti ve Jobs (b) Non-preempti ve Jobs Fig. 7. Release times and deadlines for pieces of Long Jobs. Non-pr eemptive Jobs: T o make scheduling decision for a job J , we consider it to be consisting of J 1 , J 2 , . . . J ` small pieces where T J i = τ for 1 ≤ i < ` and T J ` ≤ τ . W e schedule the ﬁrst chunk J 1 with deadline D − ` and suppose that our algorithm schedules it at time t 0 J 1 (see Figure 7(b)). Then we assign the release time of the other pieces J i for 1 < i ≤ ` to be t J i = t 0 J i − 1 + 1 and assign deadline D = 0 for each of them. Since the deadline is zero for the later pieces and they are released as soon as the pre vious piece ﬁnishes, the algorithm ensures enough capacity for the execution of the job without preemption. V I . S I M U L A T I O N In this section, we e v aluate the cost incurred by the VFW( δ ) and GCP algorithm relati ve to optimal solution in the context of workload generated from realistic data. First, we motiv ate our ev aluation by a detailed analysis of simulation results. Then in Section VII, we validate the simulation results by performing experiments on a Hadoop cluster . A. Simulation Setup W e use realistic parameters in the simulation setup and provide conserv ati ve estimates of cost savings resulting from our proposed VFW( δ ) and GCP algorithms. Cost benchmark: Currently data centers typically do not use dynamic capacity provisioning based on the v ariation of the workload [7]. A naive approach for capacity provisioning is to follo w the workload curv e and determine the capacity and assignment of workload accordingly . Clearly it is not a good approach because for capacity provisioning it does not take into account the cost incurred due to switching. Y et this is a very conservati ve estimate as it does not waste any execution capacity and meets all the deadline. W e compare the total cost from the VFW( δ ) and GCP algorithms with the ‘follo w the workload’ ( x = m = L ) strategy and e valuate the cost reduction. Cost function parameter s: The total cost is characterized by e 0 and e 1 for the operating cost and β for the switching cost. In the operating cost, e 0 represents the proportion of the ﬁxed cost and e 1 represents the load dependent energy consumption. The energy consumption of the current servers is dominated by the ﬁxed cost [17]. Therefore we choose (a) W orkload A (b) W orkload B Fig. 8. Illustration of the MapReduce traces as dynamic workload used in the e xperiments. The active jobs are shown with job length estimation. e 0 = 1 and e 1 = 0 . The switching cost parameter β represents the wear -and-tear due to changing power states in the servers. W e choose β = 12 for slot length of 5 minutes such that it works as an estimate of the time a serv er should be po wered down (typically one hour [7], [16]) to outweigh the switching cost with respect to the operating cost. W orkload description: W e use two publicly available MapReduce traces as examples of dynamic workload. The MapReduce traces were released by Chen et al. [11] which are produced from real Facebook traces for one day (24 hours) from a cluster of 600 machines. W e count the number of different types of job submissions ov er a time slot length of 5 minutes to ﬁnd the released workload curve (Figure 8). W e then use the length estimation to ﬁnd the actual workload curve sho wing active jobs over time and use that as a dynamic workload (Figure 8) for simulation. The two samples we use represent strong diurnal properties and have variation from typical workload (W orkload A) to bursty workload (W orkload B). The released jobs will be delayed to reduce the v ariations in the activ e workload curve. Length estimation parameter s: Since we do not kno w any parameters from the Facebook cluster, we assume that the cluster has enough task slots so that all the map and reduce tasks are completed in the ﬁrst round. Hence λ m = 1 and λ r = 1 . HDFS block size is typically 128MB. So X = d S 128 MB e and Y = d S 00 128 MB e . W e assume f ( x ) and g ( x ) to be linear , f ( x ) = α 1 x and g ( x ) = α 2 x where α 1 = 0 . 8 s/MB and α 2 = 0 . 9 s/MB [18]. W e use typical v alues for data rates, V i = 100 MB/s, V o = 100 MB/s and V n = 10 MB/s [8], [18]. For our experiments, we consider the jobs to be non- preemptiv e. Hence we do not need to decompose the jobs and only assign deadlines to them. Figure 8 shows the variation in activ e workload ov er time with estimation of job lengths. Deadline assignment: For VFW( δ ), the deadline D is uniform and is assigned in terms of number of slots the workload can be delayed. For our simulation, W e vary D from 1 − 12 slots which gi ves latency from 5 minutes upto 1 hour . This is realistic as deadlines of 8-30 minutes for MapReduce workload have been used in the literature [9], [23]. For GCP , we use k-means clustering to classify the workload into 10 groups based on the map, shufﬂe and reduce bytes ( S, S 0 , S 00 ). The characteristics of each group are depicted in T able II. From T able II, it is evident that smaller jobs dominate the workload mix, as discussed in Section IIA. For each new class of jobs we assign a deadline from 1 − 10 slots such that smaller class has larger deadline and larger class of jobs has smaller deadline. The deadline for a job should not be T ABLE II C L US T E R S I ZE S A N D D EA D L I NE S F O R W OR K L OA D C L A SS I FI CAT IO N F O R G CP Cluster W orkload A W orkload B Deadline # Jobs % Jobs S (MB) S 0 (MB) S 00 (MB) # Jobs % Jobs S (MB) S 0 (MB) S 00 (MB) (# slots) 1 5691 96.56 0.02 0.00 0.67 6313 95.10 0.02 0.00 0.48 1 2 116 1.97 44856.77 15493.69 83.89 223 3.36 39356.46 6594.93 99.26 2 3 27 0.46 57121.85 148012.87 16090.40 41 0.62 110076.24 282.08 1.60 3 4 23 0.39 125953.59 0.00 51.89 25 0.38 379363.01 0.00 521.45 4 5 19 0.32 0.33 0.00 49045.29 16 0.24 0.04 0.00 40355.53 5 6 8 0.14 207984.10 414045.45 3095.56 7 0.11 132529.27 383548.19 31344.38 6 7 5 0.08 541522.77 0.00 0.05 4 0.06 258152.65 1020741.05 22631.52 7 8 3 0.05 0.05 0.00 203880.59 3 0.05 0.29 0.00 311410.40 8 9 1 0.02 7201446.27 48674.26 0.10 3 0.05 1182734.09 3.93 0.01 9 10 1 0.02 934594.27 8413335.44 0.06 3 0.05 0.56 0.00 622103.12 10 (a) W orkload A (b) W orkload B Fig. 9. Impact of deadline on cost incurred by GCP-U, Ofﬂine and VFW( δ ) with δ = D / 2 . less than the length of a job . If the assigned deadline is less than the job length, we update the deadline to be equal to the length of the job . B. Analysis of the Simulation W e now analyze the impact of different parameters on cost savings provided by VFW( δ ) and GCP . W e then compare VFW( δ ) and GCP for uniform deadline (GCP-U). Impact of deadline: The ﬁrst parameter we study is the impact of different deadline requirements of the workload on the cost savings. Figure 9 shows that ev en for deadline D as small as 2 slots, the cost is reduced by ∼ 40% for GCP- U, ∼ 20% for VFW( δ ) while the ofﬂine algorithm gi ves a cost saving of ∼ 60% compared to the naive algorithm. It also shows that for all the algorithms, large D gives more cost savings as more workload can be delayed to reduce the variation in the workload. As D grows larger the cost reduction from GCP-U and VFW( δ ) approaches ofﬂine cost saving which is as much as 70%. F or VFW( δ ), the cost saving is always less than GCP-U for both the workload. Impact of δ for VFW( δ ): The parameter δ is used as a lookahead to detect a valle y in the VFW( δ ) algorithm. If δ is large, v alley detection performs well but it may be too late to ﬁll the v alley due to the deadlines. On the other hand if δ is small, valley detection does not work well because the capacity has already gone do wn to the lowest v alue. Figure 10 illustrates the v alley detection for small δ and large δ . Although the cost savings from VFW( δ ) largely depends on the nature of the workload curve, Figure 11 shows that δ ∼ D/ 2 is a conserv ativ e estimate for better cost savings. P erformance of GCP: W e ev aluated the cost savings from GCP by assigning different deadline by classifying the work- load as shown in T able II. For conservati ve estimates of deadline requirements (1-10), we found 47.66% cost reduction Fig. 10. V alley detection for (a) small δ and (b) large δ for VFW( δ ). (a) W orkload A (b) W orkload B Fig. 11. Impact of δ for VFW( δ ) with deadline D = 12 . for W orkload A and 45.65% cost reduction for W orkload B each of which remains close to the of ﬂine optimal solutions. Comparision of VFW( δ ) and GCP: W e compare GCP for uniform deadline (GCP-U) with VFW( δ ) for δ = D / 2 . Figure 9 illustrates the cost reduction for VFW( δ ) and GCP- U with different deadlines D = 1 − 12 . F or both the workload, GCP-U performs better than VFW( δ ). Ho wev er for some workload, valley ﬁlling with workload as in VFW( δ ) can be more beneﬁcial than pro visioning capacity for D consecuti ve slots as in GCP . Hence we conclude that the comparative performance of the online algorithms depends largely on the nature of the workload. Since both the algorithms are based on linear program, they take around 10-12ms to compute schedule at each step. V I I . E X P E R I M E N TA T I O N In this section, we validate our algorithms on MapReduce workload by provisioning capacity on a Hadoop cluster . W e ev aluate the cost-savings by energy consumption calculated from common power model using dif ferent measured metrics. A. Experimental Setup W e setup a Hadoop cluster (version 0.20.205) consisting of 35 nodes on Amazon’ s Elastic Compute Cloud (EC2) [19], [20]. Each node in the cluster is a small instance with 1 virtual core, 1.7 GB memory , 160 GB storage. W e conﬁgured one node as master and four core nodes to contain the Hadoop DFS and the other 30 nodes as task nodes. The provisioning is done on the task nodes dynamically . W e used the Amazon Elastic MapReduce service for provisioning capacity on the Hadoop cluster . Amazon Elastic MapReduce takes care of provisioning machines and migration of tasks between machines while keeping all data av ailable. W e used the Statistical W orkload Injector for MapReduce (SWIM) [11] to generate the MapReduce workload for our cluster using the Facebook traces from Figure 8(a). W e run our experiment for 4 hours with slot length of 5 minutes. For the traces of Figure 8(a), 602 jobs were released in the ﬁrst 48 slots. W e ﬁrst schedule the jobs and provision the task nodes by the ‘follo w the workload’ strategy . W e then schedule the same jobs and provision the task nodes using our algorithms as illustrated in Figure 12. In order to make comparison between VFW( δ ) and GCP algorithms we used a uniform deadline of 10 minutes ( D = 2 ). In each of the experiments, we measured the sev en metrics (av ailable from Amazon Cloudwatch) for each of the ‘running’ nodes in each time slot over the time interval of 4 hours and 10 minutes (50 slots). In the last 2 slots, the capacity of the task nodes were provisioned to zero for the ‘follow the workload’ algorithm while our algorithms ex ecute the delayed workload in those slots. All the jobs released in the ﬁrst 48 slots were completed before the end of 50th slot. The sev en metrics that are a vailable for measurement for each virtual machines are: CPUUtilization, DiskRead- Bytes, DiskReadOps, DiskWriteBytes, DiskWriteOps, Net- workIn and NetworkOut. B. Experimental Results W e now discuss the results from the experimentation and compare the energy consumption between different algo- rithms. P ower Metering: W e use the general power model to ev aluate energy consumption for the algorithms [21], [22]. The energy consumed by a virtual machine is represented as the sum of the energy consumption for utilization, disk operations and network IO, E v m ( T ) = E util,v m ( T ) + E disk,v m ( T ) + E net,v m ( T ) (8) where the energy consumption is ov er the duration T . The energy consumption for each of the components ov er a time slot t (of length τ ) can be computed by the follo wing equations. E util,v m ( t ) = α cpu u cpu ( t ) + γ cpu (9) E disk,v m ( t )= α rb b r ( t ) + α wb b w ( t ) + α ro n r ( t ) + α wo n w ( t ) + γ disk E net,v m ( t ) = α ni b in ( t ) + α no b out ( t ) + γ net where u cpu ( t ) is the average utilization, b r ( t ) and b w ( t ) are the total bytes read and written to disk, n r ( t ) and n w ( t ) are the total number of disk read and writes and b in ( t ) and b out ( t ) are the total bytes of network IO for the virtual machine ov er the time interval t . Since the difference in energies for disk read and write and network input and output are negligible [21], we use common parameters b db ( t ) , b do ( t ) , and b net ( t ) by taking the sum of the respecti v e values. W e normalize each of these v alues with their respective maximum values (in the interval T ) so that each of these become a fraction between 0 and 1 and can be put in equation (8), E v m ( t )= α cpu u cpu ( t ) + γ cpu (10) + α disk u disk ( t ) + γ disk + α dops u dops ( t ) + γ dops + α net u net ( t ) + γ net where u disk ( t ) , u dops ( t ) and u net ( t ) represent the normalized values of b db ( t ) , b do ( t ) , and b net ( t ) respectively . If m t ma- chines are active at time slot t , then the total energy consumed ov er the time interval T can be computed using the following equation: E ( T ) = T X t =1 m t X i =1 E i ( t ) ∗ τ 3600 W att-hour (11) where E i ( t ) is the energy consumed at machine i ov er time slot t . T o compute energy consumption, we used parameters from [22] listed in T able III. T ypical values are used for cpu utilization, disk I/O and network I/O. Idle disk/network powers are ne gligible with respect to dynamic po wer and scale of workload. T ABLE III P OW E R M O DE L P A R A ME T E R S Parameter Comment V alue α cpu Scaling f actor: Utilization 25.70 α disk Scaling f actor: Disk Rd/Wr . 7.21 α dops Scaling f actor: Disk Op. 0 α net Scaling f actor: Network IO 0.66 γ cpu Idle cpu power consump. 60.30 γ disk , γ dops Idle disk power consump. 0 γ net Idle netw ork po wer consump. 0 The total energy consumption and the % reduction with respect to ‘follow the workload’ in each of the metrics for different schedules are illustrated in T able IV. For the period of 4 hours 10 minutes (50 slots), GCP algorithm gi ves energy reduction of ∼ 12% which is signiﬁcantly better than the reduction of ∼ 6.02% from the VFW( δ ) algorithm. The reductions from both the algorithms are far better with respect to workload schedule without pro visioning. T able IV also shows that v ariation in CPU utilization, Disk I/O and Network I/O across different algorithms. This variation results from the dif ference in capacity provisioning across algorithms that changes migration of jobs and disk I/O in the cluster . Fig- ure 13 illustrates the av erage ener gy consumption within each slot over the time interval showing signiﬁcant reduction in the peak energy consumption. As the provisioning algorithms cut off peaks from the workload and provision the machines without wasting computation capacity , they reduce the peak energy consumption for the data center . V I I I . R E L A T E D W O R K W ith the importance of energy management in data centers, many scholars hav e applied energy-aw are scheduling because of its low cost and practical applicability . In energy-aware scheduling, most work tries to ﬁnd a balance between energy cost and performance loss through D VFS (Dynamic V oltage (a) F ollow the workload (b) VFW( δ ) (c) GCP-U Fig. 12. The solutions for (a) Follow the workload, (b) VFW( δ ) and (c) GCP-U algorithms with uniform deadline D = 2 slots, δ = 1 and time slot = 5 mins. T ABLE IV T OT A L E N E RG Y C O NS U M PT I O N A N D T H E T OT A L V A L UE S F O R D IFF E R E NT M E T RI C S F RO M T H E C L U ST E R F O R D I FF ER E N T S C HE D U L E Metrics No Provisioning Follo w VFW( δ ) % Reduction GCP-U % Reduction Energy Consumption(kWh) 8.60 4.46 4.19 6.02 3.93 11.96 CPUUtilization(sum) 32505.95 22805.98 21014.51 7.86 20400.02 10.55 DiskReadBytes(GB) 0.25 12.95 7.56 41.64 3.85 70.29 DiskWriteBytes(GB) 10.42 8.01 8.44 -5.48 6.55 18.19 DiskReadOps(count) 18883 1109320 710451 35.96 396070 64.30 DiskWriteOps(count) 1746347 1134108 1020343 10.03 901860 20.48 NetworkIn(GB) 45.69 42.30 43.69 -3.29 42.88 -1.38 NetworkOut(GB) 44.21 42.45 38.64 8.97 41.48 2.29 Fig. 13. A verage energy consumption from the cluster with time slots of 5 minutes, o ver a period of 4 hours. and Frequency Scaling) and DPM (Dynamic Power Man- agement), which are the most common system-level power saving methods. Beloglazov et al. [25] give the taxonomy and survey on ener gy management in data centers. Dynamic capacity provisioning is part of DPM technique. Chase et al. [26] introduce the ex ecutable utility functions to quantify the value of performance and use economic approach to achie ve resource provisioning. Pinheiro et al. [27] consider resource provisioning in both application and operating system le vel. They dynamically turn on or turn off nodes to adapt to the changing load, but do not consider the switching cost. Most work on dynamic capacity provisioning for indepen- dent workload uses models based on queueing theory [28], [29], or control theory [30], [31]. Recently Lin et al. [7] used more general and common energy model and delay model and designed a provisioning algorithm for service jobs (e.g. HTTP requests) considering switching cost for the machines. They proposed a lazy capacity provisioning (LCP) algorithm which dynamically turns on/off servers in a data center to minimize energy cost and delay cost for scheduling workload. Howe v er their algorithm does not perform well for high peak- to-mean ratio (PMR) of the workload and does not provide bound on maximum delay . Moreov er , LCP aims at minimizing the av erage delay while we reg ard latency as the deadline constraint. Instead of penalizing the delay , we purposely defer jobs within deadline in order to reduce the switching cost of the servers. Many applications in real world require delay bound or deadline constraint e.g. see Lee et al. [24]. In the context of energy conservation, deadline is usually a critical adjusting tool between performance loss and energy consumption. En- ergy efﬁcient deadline scheduling was ﬁrst studied by Y ao et al. [32]. They proposed algorithms, which aim to minimize energy consumption for independent jobs with deadline con- straints on a single v ariable-speed processor . After that, a se- ries of w ork was done to consider online deadline scheduling in different scenarios, such as discrete-voltage processor , tree- structured tasks, processor with sleep state and overloaded system [33], [34]. In the context of data center , most prior work on energy management merely talks about minimizing the average delay without any bound on the delay . Recently , Mukherjee et al. [6] proposed online algorithms considering deadline constraints to minimize the computation, cooling and migration energy for machines. Goiri et al. [35] considered only batch jobs and proposed GreenSlot, a scheduler with deadline requirements for the jobs. GreenSlot predicts the amount of solar energy that will be av ailable in near future and schedules the workload to maximize the green energy consumption while meeting the jobs’ deadline. Howe ver these works are on job assignment problem and not on dynamic resource pro visioning problem, where the number of needed servers is gi ven in adv ance. Recently researchers hav e used scheduling with deferral to improv e performance of MapReduce jobs [10], [23]. Although MapReduce was designed for batch jobs, it has been increas- ingly used for small time-sensitiv e jobs. Delay scheduling with performance goals was proposed by Zaharia et al. [10] for scheduling jobs inside a Hadoop cluster with giv en resources. V erma et al. introduced a SLA-driven scheduling and resource provisioning frame work considering giv en soft- deadline requirements for the MapReduce jobs [8], [9]. In contrast to these works, we consider hard-deadlines and schedule jobs within those deadlines and provision capacity to sav e energy . Recently , Chen et al. [12] identiﬁed a large class of interacti ve MapReduce workload and proposed policies for scheduling batch and small interactive jobs in separate cluster without any pro visioning mechanism for the machines in the cluster . In contrast, we propose provisioning algorithms for the mix of batch and interactiv e jobs under bounded latency with constant competitiv e ratio. I X . C O N C L U S I O N W e have shown that signiﬁcant reduction in energy con- sumption can be achie ved by dynamic deferral of workload for capacity provisioning inside data centers. W e hav e proposed two new algorithms, VFW( δ ) and GCP , for provisioning the capacity and scheduling the w orkload while guaranteeing the deadlines. The algorithms take adv antage from the ﬂexibility in the latency requirements of the workloads for energy savings and guarantee bounded cost and bounded latency under very general settings - arbitrary workload, general deadline and general energy cost models. Further both the VFW( δ ) and GCP algorithms are simple to implement and do not require signiﬁcant computational overhead. Additionally , the algorithms have constant competitiv e ratios and offer notew orthy cost savings as prov ed by theory and demon- strated by simulations respectively . W e have validated our algorithms on MapReduce w orkload by provisioning capacity on a Hadoop cluster . For a small interv al of 4 hours, we found ∼ 6.02% total energy savings for VFW( δ ) and ∼ 12% for GCP with respect to the naive ‘follow the workload’ approach. Both the algorithms achiev e more than 50% reduction in energy consumption with respect to ‘no provisioning’ which is a common practice for the current data center providers. Although we ha ve used MapReduce workload for validation, our algorithms can be applied for an y workload as data centers hav e separate (physical/virtual) clusters for MapReduce and non-MapReduce jobs. The pro visioning can be done on each such cluster . In order to reduce the energy consumption, the data center providers should provision their capacity (physical/virtual) and utilize the ﬂexibilities from SLAs via dynamic deferral. A C K N O W L E D G M E N T This work was sponsored in part by the Multiscale Systems Center (MuSyC) and NSF V ariability Expedition. R E F E R E N C E S [1] Server and Data Center Ener gy Efﬁciency , Final Report to Congress, U.S. En vironmental Protection Agency , 2007. [2] Z. Liu, M. Lin, A. W ierman, S. Lo w , and L. Andrew , G reening Geo- graphical Load Balancing, in Proc. A CM SIGMETRICS, 2011. [3] C. Ste wart and K. Shen, Some Joules Ar e More Pr ecious Than Other s: Managing Renewable Energy in the Datacenter , in Proc. Power A ware Comput. and Sys., October 2009. [4] E. Pakbaznia and M. Pedram, Minimizing data center cooling and server power costs , in Proc. ISLPED, 2009. [5] R. K. Sharma, C. E. Bash, C. D. Patel, R. J. Friedrich, J. S. Chase, Balance of P ower: Dynamic Thermal Manag ement for Internet Data Centers, IEEE Internet Computing, 9(1), pp. 42-49, 2005. [6] T . Mukherjee, A. Banerjee, G. V arsamopoulos, and S. K. S. Gupta, Spatio-T emporal Thermal-Awar e Job Scheduling to Minimize Ener gy Consumption in V irtualized Heterog eneous Data Centers , Computer Networks, 53(17), 2009. [7] M. Lin, A. W ierman, L. H. Andrew , E. Thereska, Dynamic right-sizing for power -pr oportional data centers , in Proc. IEEE INFOCOM, 2011. [8] A. V erma, L. Cherkasov a, R. Campbell, SLO- Driven Right-Sizing and Resour ce Pr ovisioning of MapReduce Jobs , in Proc. LADIS, 2011. [9] A. V erma, L. Cherkasov a, R. Campbell, Resource Pr ovisioning F r ame- work for MapReduce Jobs with P erformance Goals, in Proc. Middleware, 2011. [10] M. Zaharia, D. Borthakur , J. S. Sarma, K. Elmeleey , S. Shenker , and I. Stoica, Delay Scheduling: A Simple T echnique for Achieving Locality and F airness in Cluster Scheduling , in Proc. EuroSys, 2010. [11] Y . Chen, A. Ganapathi, R.Grifﬁth, R. Katz, The Case for Evaluating MapReduce P erformance Using W orkload Suites, in Proc. IEEE MAS- CO TS, 2011. [12] Y . Chen, S. Alspaugh, D. Borthakur, R. Katz, Energy Efﬁciency for Lar ge-Scale MapReduce W orkloads with Signiﬁcant Interactive Analysis, in Proc. EuroSys, 2012. [13] D. Xu and X. Liu, Geographic tr ough ﬁlling for Internet datacenters, In Proc. of IEEE INFOCOM, 2012. [14] J. Dean and S. Ghemawat, MapReduce: Simpliﬁed Data Processing on Lar ge Cluster s , Comm. of the A CM, 51(1), pp. 107-113, 2008. [15] SPEC power data on SPEC website at http://www .spec.org. [16] P . Bodik, M. P . Armbrust, K. Canini, A. Fox, M. Jordan, and D. A. Patterson, A case for adaptive datacenters to conserve energy and impr ove reliability , UCBerkeley , T ech. Report UCB/EECS-2008-127, 2008. [17] L. A. Barroso, and U. H ¨ olzle, The case for ener gy-pr oportional com- puting. IEEE Computer , 40(12), pp. 33-37, 2007. [18] Y . Xiao, S. Jianling, Reliable Estimation of Execution Time of MapRe- duce Pr o gram , China Communications, 8(6), pp 11-18, 2011. [19] Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce/. [20] Apache Hadoop. http://hadoop.apache.org/. [21] A. Kansal, F . Zhao, J. Liu, N. K othari, and A. Bhattacharya, V irtual Machine P ower Metering and Pr ovisioning, in Proc. SoCC, 2010. [22] R. Lent, Evaluating the performance and power consumption of systems with virtual machines, in Proc. IEEE CloudCom, Nov . 2011. [23] K. Kc and K. Anyanwu, Sc heduling Hadoop Jobs to Meet Deadlines, in Proc. IEEE CloudCom, 2010. [24] C. B. Lee, and A. Sna vely , Precise and r ealistic utility functions for user-centric performance analysis of schedulers , in Proc. IEEE HPDC, 2007. [25] A. Beloglazo v , R. Buyya, Y . C. Lee, A. Zomaya, A taxonomy and surve y of energy-ef ﬁcient data centers and cloud computing systems , Adv ances in Computers, Elsevier: Amsterdam, 2011. [26] J. S. Chase, D. C. Anderson, P . N. Thakar , A. M. V ahdat, and R. P . Doyle, Managing ener gy and server r esour ces in hosting centers , in Proc. A CM SOSP , pp. 103-116, 2001. [27] E. Pinheiro, R. Bianchini, E. Carrera, and T . Heath, Load balancing and unbalancing for power and performacne in cluster -based systems , in Proc. Compilers and Operating Sys. for Low Power , 2001. [28] A. Gandhi, M. Harchol-Balter , R. Das, and C. Lefurgy , Optimal power allocation in server farms , in Proc. A CM Sigmetrics, 2009. [29] D. Meisner, B. T . Gold, and T . F . W enisch, The P owerNap Server Ar chitectur e , A CM trans. Computer systems (TOCS), 29(1), 2011. [30] Y . Chen, A. Das, W . Qin, A. Siv asubramaniam, Q. W ang, and N. Gautam, Managing server ener gy and operational costs in hosting centers , in Proc. A CM Sigmetrics, 2005. [31] R. Ur gaonkar , U. C. K ozat, K. Igarashi, and M. J. Neely , Dynamic r esource allocation and power management in virtualized data centers, in Proc. IEEE/IFIP NOMS, 2010. [32] F . Y ao, A. Demers, and S. Shenker , A sc heduling model for r educed CPU ener gy , in Proc IEEE FOCS, pp. 374-382, 1995. [33] H. L. Chan, J. W . Chan, T . W . Lam, L. K. Lee, K. S. Mak, P . W . W ong, Optimizing thr oughput and energy in online deadline scheduling , A CM T rans. Algorithms 6(1), 1-10, 2009. [34] X. Han, T . W . Lam, L. K. Lee, I. K. T o and P . W . W ong, Deadline scheduling and power management for speed bounded processor s , Theor . Comput. Sci. 411(40-42), 3587-3600, 2010. [35] I. Goiri et al. GreenSlot: Scheduling Energy Consumption in Gr een Datacenters. In Proc. of Supercomputing, Nov ember 2011.

Dynamic Deferral of Workload for Capacity Provisioning in Data Centers

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment