a hierarchical framework of cloud resource allocation and ... · abstract—automatic...

A Hierarchical Framework of Cloud ResourceAllocation and Power Management Using Deep

Reinforcement LearningNing Liu∗, Zhe Li∗, Jielong Xu∗, Zhiyuan Xu, Sheng Lin, Qinru Qiu, Jian Tang, Yanzhi Wang

Department of Electrical Engineering and Computer Science,Syracuse University, Syracuse, NY 13244, USA

{nliu03, zli89, zxu105, jxu21, shlin, qiqiu, jtang02, ywang393}@syr.edu

Abstract—Automatic decision-making approaches, such as re-inforcement learning (RL), have been applied to (partially)solve the resource allocation problem adaptively in the cloudcomputing system. However, a complete cloud resource allocationframework exhibits high dimensions in state and action spaces,which prohibit the usefulness of traditional RL techniques. Inaddition, high power consumption has become one of the criticalconcerns in design and control of cloud computing systems, whichdegrades system reliability and increases cooling cost. An effectivedynamic power management (DPM) policy should minimizepower consumption while maintaining performance degradationwithin an acceptable level. Thus, a joint virtual machine (VM)resource allocation and power management framework is criticalto the overall cloud computing system. Moreover, novel solutionframework is necessary to address the even higher dimensionsin state and action spaces.

In this paper, we propose a novel hierarchical framework forsolving the overall resource allocation and power managementproblem in cloud computing systems. The proposed hierarchicalframework comprises a global tier for VM resource allocation tothe servers and a local tier for distributed power management oflocal servers. The emerging deep reinforcement learning (DRL)technique, which can deal with complicated control problemswith large state space, is adopted to solve the global tierproblem. Furthermore, an autoencoder and a novel weightsharing structure are adopted to handle the high-dimensionalstate space and accelerate the convergence speed. On the otherhand, the local tier of distributed server power managementscomprises an LSTM based workload predictor and a model-freeRL based power manager, operating in a distributed manner.Experiment results using actual Google cluster traces showthat our proposed hierarchical framework significantly savesthe power consumption and energy usage than the baselinewhile achieving no severe latency degradation. Meanwhile, theproposed framework can achieve the best trade-off betweenlatency and power/energy consumption in a server cluster.

Keywords—Deep reinforcement learning; hierarchical frame-work; resource allocation; distributed algorithm

I. INTRODUCTION

Cloud computing has emerged as the most popular com-puting paradigm in today’s computer industry. Cloud comput-ing with virtualization technology enables computational re-sources (including CPU, memory, disk, communication band-width, etc.) in data centers or server clusters to be shared byallocating virtual machines (VMs) in an on-demand manner.∗Ning Liu, Zhe Li and Jielong Xu contributed equally to this work

The effective adoption of VMs in data centers is one of thekey enablers of the large-scale cloud computing paradigm. Tosupport such a feature, an efficient and robust scheme of VMallocation and management is critical. Due to the time-varianceof workloads [1,2], it is desirable to perform cloud resourceallocation and management in an online adaptive manner.Automatic decision-making approaches, such as reinforcementlearning (RL) [3], have been applied to (partially) solve theresource allocation problem in the cloud computing system [4–6]. Key properties of RL-based methods are suitable for thecloud computing systems because they do not require a priorimodeling of state transition, workload, and power/performanceof the underlying system, i.e., they are essentially model-free. Instead, the RL-based agents learn the optimal resourceallocation decision and control the system operation in anonline fashion as the system runs.

However, a complete resource allocation framework in thecloud computing systems exhibits high dimensions in state andaction spaces. For example, a state in the state space may bethe Cartesian product of characteristics and current resourceutilization level of each server (for hundreds of servers) aswell as current workload level (number and characteristics ofVMs for allocation). An action in the action space may be theallocation of VMs to the servers (a.k.a. physical machines)and allocating resources in the servers for VM execution.The high dimensions in state and action spaces prohibit theusefulness of traditional RL techniques to the overall cloudcomputing system in that the convergence speed of traditionalRL techniques is in general proportional to the number ofstate-action pairs [3] and will be impractically long with highstate and action dimensions. As a result, previous works onlyused RL to dynamically adjust the power status of a singlephysical server [7] or the number of homogeneous VMs toan application [5,8], in order to restrict the state and actionspaces.

In addition, power consumption has become one of thecritical concerns in design and control of cloud computingsystems. High power consumption degrades system reliabilityand increases the cooling cost for high-performance systems.Dynamic power management (DPM), defined as the selectiveshut-off or slow-down of system components that are idle

arX

iv:1

703.

0422

1v2

[cs

.DC

] 1

1 A

ug 2

017

or underutilized, has proven to be an effective technique forreducing power dissipation at system level [9]. An effectiveDPM policy should minimize power consumption while main-taining performance degradation to an acceptable level. Thedesign of such DPM policies has been an active researcharea, while a variety of adaptive power management tech-niques including machine learning have been applied [9–11].The DPM policies have mainly been applied to the serverlevel, and therefore depend on the results of VM resourceallocations. Therefore, a joint VM resource allocation andpower management framework, which targets at the wholedata center or server cluster, would be critical to the overallcloud computing system. And obviously, the joint managementframework exhibits even higher dimensions in state and actionspaces and requires a novel solution framework.

Recent breakthroughs of deep reinforcement learning (DRL)in AlphaGo and playing Atari set a good example in handlinglarge state space of complicated control problems [12,13], andcould be potentially utilized to solve the overall cloud resourceallocation and power management problem. The convolutionalneural network, one example of deep neural networks (DNN),was used to effectively extract useful information from high-dimensional image input and build a correlation betweeneach state-action pair (s, a) and the associated value functionQ(s, a), which is the expected accumulative rewards (or costs)that the agent aims to maximize (or minimize) in RL. Foronline operations, a deep Q-learning framework was alsoproposed to derive the optimal action a at each state s inorder to maximize (or minimize) the corresponding Q(s, a)value. Although promising, both the DNN and the deep Q-learning framework need to be modified for applications incloud resource allocation and power management. Moreover,the DRL framework requires a relatively low-dimensionalaction space [13] because in each decision epoch the DRLagent needs to enumerate all possible actions at current stateand perform inference using DNN to derive the optimalQ(s, a) value estimate, which implies that the action spacein cloud resource allocation and power management needs tobe significantly reduced.

To fulfill the above objectives, in this paper we propose anovel hierarchical framework for solving the overall resourceallocation and power management problem in cloud computingsystems. The proposed hierarchical framework comprises aglobal tier for VM resource allocation to the servers and alocal tier for power management of local servers. Besidesthe enhanced scalability and reduced state/action space dimen-sions, the proposed hierarchical framework enables to performthe local power managements of servers in an online anddistributed manner, which further enhances the parallelismdegree and reduces the online computational complexity.

The global tier of VM resource allocation exhibits largestate and action spaces, and thus the emerging DRL techniqueis adopted to solve the global tier problem. In order to signif-icantly reduce the action space, we adopt a continuous-timeand event-driven decision framework in which each decisionepoch coincides with the arrival time of a new VM request. Inthis way the action at each decision epoch is simply the target

Environment Agent

State

Action

k+1r

k+1s

k+1a

ks

kr

Fig. 1. Illustration of the agent-environment interaction system.

server for VM allocation, which ensures that the availableactions are enumerable. Furthermore, an autoencoder [14] anda novel weight sharing structure are adopted to handle thehigh-dimensional state space and accelerate the convergencespeed, making use of the specific characteristics of cloudcomputing systems. On the other hand, the local tier of serverpower managements comprises a workload predictor and apower manager. The workload predictor is responsible forproviding future workload predictions to facilitate the powermanagement algorithm, and we adopt the long short-termmemory (LSTM) network [15] due to its ability to capturelong-term dependencies in time-series prediction. Based on theworkload prediction results and the current information, thepower manager adopts the model-free RL technique to adap-tively determine the suitable action for turning ON/OFF of theservers in order for simultaneous reductions of power/energyconsumption and job (VM) latency.

Experiment results using actual Google cluster traces [16]show that proposed hierarchical framework significantly savethe power consumption/energy usage than the baseline whileachieves similar average latency. In a 30-server cluster, with95, 000 job requests, the proposed hierarchical framework cansave 53.97% power and energy consumptions. Meanwhile, theproposed framework can achieve the best trade-off betweenlatency and power/energy consumption in a server cluster. Inthe same case, the proposed framework gives the average per-job latency saving with the same energy usage up to 16.16%,and the average power/energy saving with the same latencyup to 16.20%.

II. BACKGROUND OF THE AGENT-ENVIRONMENTINTERACTION SYSTEM AND CONTINUOUS-TIME

Q-LEARNING

A. Agent-Environment Interaction System

As shown in Fig. 1, the general agent-environment interac-tion modeling (of both traditional RL and the emerging DRL)consists of an agent, an environment, a finite state space S, aset of available actions A, and a reward function: S×A→ R.The decision maker is called the agent, and should be trainedas the interaction system runs. The agent needs to interact withthe outside, which is called the environment. The interactionbetween the agent and the environment is a continual process.At each decision epoch k, the agent will make decision akbased on the current state sk of the environment. Once the

decision is being made, the environment would receive thedecision and make corresponding changes, and the updatednew state sk+1 of the environment would be presented tothe agent for making future decisions. The environment alsoprovides reward rk to the agent depending on the decision ak,and the agent tries to maximize some notion of the cumulativerewards over time. This simple reward feedback mechanism isrequired for the agent to learn its optimal behavior and policy.

B. Continuous-Time Q-Learning for Semi-Markov DecisionProcess (SMDP)

In the Q-learning procedure [17], a representative algorithmin general RL, the agent aims to maximize a value functionQ(s, a), which is the expected accumulated (with discounts)reward function when system starts at state s and followsaction a (and certain policy thereafter). Q(s, a) is given as:

Q(s, a) = E[ ∫ ∞

t0

e−β(t−t0)r(t)dt

∣∣∣∣s0 = s, a0 = a

](1)

for continuous-time systems where r(t) is the reward ratefunction and β is the discount rate. The Q(s, a) for discrete-time systems can be defined similarly.

The Q-learning for SMDP is an online adaptive RL tech-nique that operates in continuous time domain in an event-driven manner [18], which could reduce the overheads as-sociated with periodic value updates in discrete-time RLtechniques. Please note that the name of the technique hasthe term “SMDP” ONLY because it is proven to achieve theoptimal policy under SMDP environment. In fact, it could beutilized to non-stationary environments as well with excellentresults [19]. In Q-learning for SMDP, the definition of valuefunction Q(s, a) is continuous-time based and given in Eqn.(1). At each decision epoch tk, the RL agent selects the actionak using certain policy, e.g., the ε-greedy policy [20], similarto discrete-time RL techniques. At the next decision epochtk+1 triggered by state transition, the value updating rule (fromthe k-th estimate to the (k + 1)-th estimate) is given by thefollowing:

Q(k+1)(sk, ak)← Q(k)(sk, ak) + α ·(1− e−βτk

βr(sk, ak)+

maxa′

e−βτkQ(k)(sk+1, a′)−Q(k)(sk, ak)

)(2)

where Q(k)(sk, ak) is the value estimate at decision epoch tk,r(sk, ak) is the reward function, τk is the sojourn time thatsystem remains in state sk before a transition occurs, α ≤ 1is the learning rate, and β is discount rate.

III. SYSTEM MODEL AND PROBLEM STATEMENT

We consider a server cluster with M physical servers thatoffer D types of resources with respect to the cloud resourceallocation and power management framework in this paper.Usually, a server can be in active mode or sleep mode forpower saving. We denoteM as the set of physical servers, andD as the set of resources. Fig. 2 provides an illustration of the

This work can be generally applied to multiple server clusters or an overalldata center.

Incoming Jobs

...Server

Jobs Broker

PowerManagement

Fig. 2. Cloud resource allocation and power management framework,comprising both the global tier and the local tier.

25

50

75

100

Job 1latency: t4 − t1



t1 t2 t3 t4 t5 t6

timeC

PU(%

)Fig. 3. An example of job execution on a server when the server is active.

hierarchical cloud resource allocation and power managementframework, comprising a global tier and a local tier. A jobbroker, controlled by the global tier in the proposed hierar-chical framework, dispatches jobs that request resource forprocessing to one of the servers in the cluster at its arrival time,as shown in Fig. 2. Each server queues all assigned jobs andallocates resources for them in a first-come-first-serve (FCFS)manner. If a server has insufficient resource to process a job,it waits until sufficient resource is released by completed jobs.On the other hand, the local tier performs power managementand turns ON/OFF of each server, in a distributed manner.Both the job scheduling and the local power managementwill significantly affect the overall power consumption andperformance of the server cluster.

We demonstrate how jobs are executed on an active serverwith only CPU resource usage in Fig. 3 as an example. Job1 consumes 50% of CPU, while each of job 2 and 3 requires40%. Job 1, 2 and 3 arrive at t1, t2 and t3, respectively, andcomplete at t4, t5 and t6, respectively. We assume that theserver is in active mode at time 0. When job 1 and 2 arrive,there are enough CPU resources so their requirements aresatisfied immediately. When job 3 arrives, it waits until thejob 1 is completed, and the waiting time is t4 − t3. The joblatency is defined as the duration between the job arrival andcompletion. Therefore the latency of job 3 is t6 − t3, whichis longer than the job duration. To reduce the job latency,the job broker should avoid overloading servers. A schedulingscheme must be developed for dynamically assigning the jobsto servers and allocating resources in each server.

When a job is assigned to a server in sleep mode, it takesTon time to change the server into the active mode. Similarly,

t1 t2 t3 t40

P (0%)

P (100%)

Job 1P (50%)

Job 2P (70%)

t1 + Ton t2 + Toff

t2 + Toff + Ton

t4 + Toff

time

Power Usage

(a) Power Management in ad hoc manner

t1 t2 t3 t′40

P (0%)

P (100%)

Job 1P (50%)

Job 2P (70%)

t1 + Ton t2 + Ttimeout t′4 + Ttimeout

t′4 + Ttimeout + Toff

time

Power Usage

(b) Power Management with DPM technique

Fig. 4. Illustration of the effectiveness of server power management in thelocal tier.

it takes Toff time to change the server back to the sleep mode,which decision is made by the local tier of power manager (ina distributed manner). We assume the power consumption ofa server in the sleep mode is zero, and the power consumptionof a server at time t in the active mode is a function of theCPU utilization [21],

P (xt) = P (0%) + (P (100%)− P (0%))(2xt − x1.4t ) (3)

where xt denotes the CPU utilization of the server at time t,P (0%) denotes the power consumption of the server in theidle mode, and P (100%) denotes the power consumption ofthe server in full load. In general, the power consumption ofthe server during the sleep to active transition is higher thanP (0%) [21,22].

We explain the effectiveness of power management in thelocal tier by Fig. 4. Assume that at time 0 the server isin the sleep mode, and the arrival times and CPU resourceutilizations of Jobs 1 and 2 are < t1, 50% > and < t3,70% >, respectively. Job 1’s arrival triggers the server toswitch to the active mode and start serving job 1 from t1+Tonto t2. Fig. 4 (a) illustrates the case when power managementis performed in an ad hoc manner. In this case, the serverturns back to the sleep mode from t2, and the expectedcompletion time of power mode switching is t2 + Toff, whichis later than the arrival time of job 2 (t3). Thus the serverswitches back to active mode immediately after t2 + Toff. Att2 + Toff + Ton, the server starts serving job 2 and completesat time t4. In this case, a portion of power consumption iswasted on the frequent turning ON/OFF of the server, andalso extra latency is incurred on job 2. On the other hand,Fig. 4 (b) illustrates the case when effective DPM techniqueis in place. In this case, an effective timeout is set aftertime t2 and the server will stay in the idle state. If job 2arrives before the timeout expires, the server will process job

System

WorkloadPredictor

Power Manager

Previous INFO

Priori Prediction

ActionCurrent INFO

Fig. 5. Power management at each local server, comprising the workloadpredictor and the power manager.

2 immediately and finish at an earlier time t′4, with t′4 < t4.A simultaneous reduction in power/energy consumption andaverage job latency could be achieved through the effectiveDPM technique at the server level, and the DPM techniquefor each server could be performed in a distributed manner.

The effective DPM technique in the local tier properlydetermines the most desirable timeout values in an onlineadaptive manner, based on the VM allocation results fromthe global tier. The DPM framework at the local server levelis illustrated in Fig. 5. A workload predictor is incorporatedin the DPM framework, which is critical to the performanceof DPM by providing predictions of future workloads forthe power manager. The prediction together with the currentinformation such as the number of pending jobs in the queue isfed into the power manager, and serve as the state of the powermanager for selecting corresponding actions and learning inthe observation domain of the system.

The DPM framework of local servers relies heavily on aconfident workload prediction and a properly designed powermanager. In [19], a Naıve Bayes classifier is adopted toperform the workload prediction. In this paper, we want toperform a more accurate (time-series) prediction with con-tinuous values, and thus a recurrent neural network (RNN)[23] or even a long short-term memory (LSTM) network [15]becomes a good candidate for the workload predictor. Withaccurate prediction and current information of the systemunder management, the power manager has to derive the mostappropriate actions (timeout values) to help simultaneouslyreduce the power consumption of the server and the joblatency, and the model-free RL technique [19,24] serves as agood candidate for the adaptive power management algorithm.

In Sections V and VI, we will describe the global andlocal tiers of the overall cloud resource allocation and powermanagement framework. The global tier of cloud resource(VM) allocation exhibits high dimensions in state and actionspaces, which prohibit the usefulness of traditional RL tech-niques. This is because the convergence speed of traditionalRL techniques is in general proportional to the number ofstate-action pairs [3]. Therefore, we adopt the emerging DRL

technique [13], which has the potential of handling large statespace of complicated control problems, to solve the globaltier problem. For the local tier of power management, weadopt a model-free RL-based DPM technique integrated withan effective workload predictor using the LSTM network.

IV. OVERVIEW OF DEEP REINFORCEMENT LEARNING

In this section, we present a generalized form of DRLtechnique compared with the prior work, which could beutilized for resource allocation and other problems as well. TheDRL technique is comprised of an offline DNN constructionphase and an online deep Q-learning phase [13,25]. The offlinephase adopts DNN to derive the correlation between eachstate-action pair (s, a) of the system under control and its valuefunction Q(s, a). The offline DNN construction phase needsto accumulate enough samples of Q(s, a) value estimatesand corresponding (s, a) for constructing an enough accurateDNN, which can be a model-based procedure or from actualmeasurement data [12,26]. For the game playing applications[13], this procedure includes the pre-processing of gameplaying profiles/replays and obtaining the state transitioningprofile and Q(s, a) value estimates (e.g., win/lose or the scoreachieved). For the cloud resource allocation applications, wewill make use of the real job arrival traces to obtain enoughstate transitioning profiles and Q(s, a) value estimates, whichcan be a composite of power consumption, performance (joblatency), and resiliency metrics, for the DNN construction.Arbitrary policy and gradually refined policy can be appliedin this procedure. The transition profiles are stored in an ex-perience memory D with capacity ND. The use of experiencememory can smooth out learning and avoid oscillations ordivergence in the parameters. [13]. Based on the stored statetransition profiles and Q(s, a) value estimates, the DNN isconstructed with weight set θ trained using standard trainingalgorithms [27].

We summarize the key steps in the offline procedure, whichare shown in line 1-4 in Algorithm 1.

The deep Q-learning technique is adopted for the onlinecontrol based on the offline-trained DNN. More specifically,at each decision epoch tk of an execution sequence, the systemunder control is in a state sk. The DRL agent performs infer-ence using the DNN to derive the Q(sk, a) estimate of eachstate-action pair (sk, a), and uses ε-greedy policy to derivethe action with the highest Q(sk, a) with probability 1 − εand choose the other actions randomly with total probabilityε. The chosen action is denoted by ak. At the next decisionepoch tk+1, the DRL agent performs Q-value updating basedon the total reward (or cost) rk(sk, ak) observed during thistime period [tk, tk+1). At the end of the execution sequence,the DRL agent updates the DNN using the newly observed Q-value estimates, and the updated DNN will be utilized in thenext execution sequence. More detailed procedures are shownin Algorithm 1.

It can be observed from the above procedure that the DRLframework is highly scalable for large state space and coulddeal with the case of continuous state space, which is distinctfrom traditional RL techniques. On the other hand, the DRL

Algorithm 1 Illustration of the General DRL FrameworkEnsure:

1: Extract real data profiles using certain control policiesand obtain the corresponding state transition profiles andQ(s, a) value estimates;

2: Store the state transition profiles and Q(s, a) value esti-mates in experience memory D with capacity ND;

3: Iterations may be needed in the above procedure;4: Pre-train a DNN with features (s, a) and outcome Q(s, a);

Require:5: for each execution sequence do6: for at each decision epoch tk do7: With probability ε select a random action, otherwise

ak = argmaxaQ(sk, a), in which Q(sk, a) isderived (estimated) from DNN;

8: Perform system control using the chosen action;9: Observe state transition at next decision epoch tk+1

with new state sk+1, receive reward rk(sk, ak) duringtime period [tk, tk+1);

10: Store transition (sk, ak, rk, sk+1) in D;11: Updating Q(sk, ak) based on rk(sk, ak) and

maxa′ Q(sk+1, a′) based on Q-learning updating

rule;12: end for13: Update DNN parameters θ using new Q-value esti-

mates;14: end for

framework requires a relatively low-dimensional action spacebecause in each decision epoch the DRL agent needs toenumerate all possible actions at current state and performinference using DNN to derive the optimal Q(s, a) valueestimate, which implies that the action space in the cloudresource allocation framework needs to be reduced.

V. THE GLOBAL TIER OF THE HIERARCHICALFRAMEWORK – DRL-BASED CLOUD RESOURCE

ALLOCATION

In this paper, we develop a scalable hierarchical frameworkfor the overall cloud resource allocation and power manage-ment problem, comprising a global tier of cloud VM resourceallocation and a local tier of power managements. The globaltier adopts the emerging DRL technique to handle the high-dimensional state space in the VM resource allocation frame-work. In order to significantly reduce the action space, weadopt a continuous-time and event-driven decision frameworkin which each decision epoch coincides with the arrival timeof a new VM (job) request. In this way the action at eachdecision epoch is simply the target server for VM allocation,which ensures that the available actions are enumerable ateach epoch. The continuous-time Q-learning for SMDP [18] ischosen as the underlying RL technique in the DRL framework.

In the following we present the proposed DRL-based globaltier of cloud resource allocation including novel aspects bothin the offline and online phases.

A. DRL-based Global Tier of Resource Allocation

In the DRL-based global tier of cloud resource allocation,the job broker is controlled by the DRL agent, and the servercluster is the environment. The DRL-based cloud resourceallocation framework is continuous-time based and event-driven, in that the DRL agent selects an action at the timeof each VM (job) request arrival, i.e., a decision epoch. Thestate space, action space, and reward function of the DRL-based global tier of resource allocation are defined as follows:

State Space: In the DRL-based resource allocation, wedefine the state at job j’s arrival time tj , stj , as the unionof the server cluster state at job j’s arrival time stjc and thejob j’s state sj , i.e., stj = s

tjc ∪ sj . All the M servers can

be equally divided into K groups, G1, · · · , GK . We define thestate of servers in group Gk at time t as gtk. We also define theutilization requirement of resource type p of job j by ujp, andthe utilization level of server m at time t as utmp. Therefore, thesystem state stj of the DRL-based cloud resource allocationtier can be represented using ujp’s and utjmp’s as follows:

stj =[stjc , sj

]=[gtj1 , · · · , g

tjK , sj

]= [u

tj11, · · · , u

tj1|D|, · · · , u

tj|M ||D|, uj1, · · · , uj|D|, dj ],

where dj is the (estimated) job duration. The state spaceconsists of all possible states and has a high dimension.

Action space: The action of the DRL agent for cloudresource allocation is defined as the index of server for VM(job) allocation. The action space for a cluster with M serversis defined as follows.

A = {a|a ∈ {1, 2, · · · , |M |}}

It can be observed that the action space is significantly reduced(to the same size as the total number of servers) by using anevent-driven and continuous-time DRL-based decision frame-work.

Reward: The overall profit of the server cluster equals tothe total revenue of processing all the incoming jobs minusthe total energy cost and the reliability penalty. The incomeachieved by processing a job decreases with the increase injob latency, including both waiting time in the queue andprocessing time in the server. Hot spot avoidance is employedin physical servers because the overloading situations caneasily lead to resource shortage and affect hardware lifetime,and thereby undermining data center reliability. Similarly, forthe sake of reliability, a cloud provider can introduce anti co-location objectives to ensure spatial distances between VMsand the use of disjoint routing paths in the data center, so as toprevent a single failure from affecting multiple VMs belongingto the same cloud customer. It is clear that both load balancingand anti-colocation objectives are partially in conflict withpower saving (and maybe job latency), as they actually tryto avoid the usage of high VM consolidation ratios. Througha joint consideration of power consumption, job latency, andreliability issues, we define the reward function r(t) that the

agent of the global tier receives as follows:

r(t) =− w1 · Total Power(t) (4)− w2 ·Number VMs(t)− w3 ·Reli Obj(t),

where the three terms are the negatively weighted values ofinstantaneous total power consumption, number of VMs in thesystem, and reliability objective function value, respectively.Please note that according to the Little’s Theorem [28], theaverage number of VMs pending in the system is proportionalto the average VM (job) latency. Therefore, the DRL agent ofthe global tier optimizes a linear combination of total powerconsumption, VM latency, and reliability metrics when usingthe above instantaneous reward function.

Offline DNN Construction: The DRL-based global tier ofcloud resource allocation comprises an offline DNN construc-tion phase and an online deep Q-learning phase. The offlineDNN construction phase derives the correlation between Qvalue estimates with each state-action pair (stj , a). A straight-forward approach is to use a conventional feed-forward neuralnetwork to directly output Q value estimates. This works wellwith problems with relatively small state and action spaces(e.g., in [13], the number of actions ranges from 2 to 18).Alternatively, we can train multiple neural networks in whicheach of them estimates the Q value for a subset of actions, forexample, the Q values for assigning the job j to one subset ofservers Gk. However, this procedure will significantly increasethe training time of the neural network, by up to a factor ofK, compared with the single network solution. Moreover, thissimple multiple neural networks method cannot well stressthe importance of the related servers of a target action, whichslows down the training process.

In order to address these issues, we harness the power ofrepresentation learning and weight sharing for DNN construc-tion, with basic procedure shown in Fig. 6. Specifically, wefirst use an autoencoder to extract a lower-dimensional high-level representation of server group state gtjk for each possiblek value, denoted by gtjk . Next, for estimating the Q value of theaction of allocating VM to servers in Gk, the neural networkSub-Qk takes gtjk , sj and all gtjk′ (k′ 6= k) as input features.The dimension difference between g

tjk and g

tjk′ (k′ 6= k)

reflects the importance of the targeting server group’s ownstate compared with the other server groups, which determinesthe degree of reduction in the state space.

In addition, we introduce weight sharing among all Kautoencoders, as well as all Sub-Qk’s. Weight sharing hasbeen successfully applied to convolutional neural networks andimage processing [29]. It is used in our model due to followingreasons: 1) With weight sharing, any training samples can beused to train the Sub-Qk’s and autoencoders, compared to thecase without weight sharing, where only the training samplesallocated to a server in Gk can be used to train Sub-Qk.This usually leads to higher degree of scalability. 2) Weightsharing reduces the total number of required parameters andthe training time.

Online Deep Q-Learning: The online deep Q-learningphase, which is an integration of the DRL discussed in Section

stj =[gtj1 g

tj2 sj

]

Q value estimates for all actions

Auto-encoder1

Auto-encoder2

ShareWeight

Sub-Q1 Sub-Q2Share

Weight

Fig. 6. Deep neural network for Q-value estimating with autoencoder andweight sharing scheme (K = 2).

III and the Q-learning for SMDP, is utilized for action selectionand value function/DNN updating. At each decision epochtj , it performs inference using the DNN (with autoencoderincorporated) to derive the Q(stj , a) value estimate of eachstate-action pair (stj , a), and use ε-greedy policy for actionselection. At the next decision epoch tj+1, it performs valueupdating using Eqn. (2) in Q-learning for SMDP. At theend of each execution sequence, it updates the DNN withautoencoders based on the updated Q value estimates.

B. Convergence and Computational Complexity Analysis ofthe Global Tier

It has been proven that the Q-learning technique will grad-ually converge to the optimal policy under stationary Markovdecision process (MDP) environment and sufficiently smalllearning rate [17]. Hence, the proposed DRL-based resourceallocation framework will converge to the optimal policywhen (i) the environment evolves as a stationary, memorylessSMDP and (ii) the DNN is sufficiently accurate to return theaction associated with the optimal Q(s, a) estimate. In reality,the stationary property is difficult to satisfy as actual cloudworkloads are time-variant. However, simulation results willdemonstrate the effectiveness of the DRL-based global tier inrealistic, non-stationary cloud environments.

The proposed DRL-based global tier of resource allocationexhibits low online computational complexity, i.e., the com-putational complexity is proportional to the number of actions(number of servers) at each decision epoch (upon arrival ofeach VM), which is insignificant for cloud computing systems.

VI. THE LOCAL TIER OF THE HIERARCHICALFRAMEWORK – RL-BASED POWER MANAGEMENT FOR

SERVERS

In this section, we describe the proposed local tier of powermanagement in the hierarchical framework, which is respon-sible for controlling the turning ON/OFF of local servers inorder to simultaneously reduce the power consumption andthe average job latency. The local tier includes the workloadpredictor using the LSTM network and the adaptive powermanager based on the model-free, continuous-time Q-learningfor SMDP. Details will be described next.

LSTM Initial State LSTM LSTM

InputHidden Layer

Output Hidden Layer

... Final State

Input

Output

InputHidden Layer

Input

Output Hidden Layer

Output

Output Hidden Layer

Output

InputHidden Layer

Input

Fig. 7. Unrolled LSTM neural network for workload prediction at the localserver.

A. Workload Predictor Using the Long Short-Term Memory(LSTM) Network

The workload predictor is responsible for providing partialobservation of the actual future workload characteristics (i.e.,inter-arrival times of jobs) to the power manager, which willbe utilized for defining system states in the RL algorithm.The workload characteristics of servers are in fact the resultof the global tier of VM resource allocation. Previous workson workload prediction in [30,31] assume that a linear com-bination of previous idle times (or request inter-arrival times)may be used to infer the future ones, which is not alwaystrue. For example, one very long inter-arrival time can ruin aset of subsequent predictions. In order to overcome the aboveeffect and achieve much higher prediction accuracy, in thiswork we adopt the LSTM neural network [15] for workloadprediction, which is a recurrent neural network architectureand can be applied for prediction of time series sequences. TheLSTM network can eliminate the vanishing gradient problemand capture the long-term dependencies in the time series,and works efficiently where back propagation through time(BPTT) technique is adopted [32].

The LSTM network structure is shown in Fig. 7. Thenetwork has three layers, input hidden layer, LSTM cell layerand output hidden layer. In the proposed LSTM network forworkload prediction, we predict the next job inter-arrival timebased on the past 35 inter-arrival times as the look-back timesteps. The size of both input and output of an LSTM cellshould be 1 according to the dimension of the measured trace.We set 30 hidden units in each LSTM cell, and all LSTM cellshave shared weights. In our LSTM-based workload predictionmodel, we discretize the output inter-arrival time prediction bysetting n predefined categories, corresponding to n differentstates in the RL algorithm utilized in the power manager.

In the training process, first we initialize the weights for theinput layer and output layer as a normal distribution with amean value of 0 and standard deviation of 1. The bias for bothlayers is set as a constant value 0.1. The initial state of LSTMcell is set as 0 for all cells. In response to the back propagatederrors, the network is updated by adopting Adam optimization[27], a method for efficient stochastic optimization that onlyrequires first-order gradients with little memory requirement.The method computes individual adaptive learning rates from

estimates of the first and second moments of the gradients[33]. The state of the LSTM cell and weights will be trainedfor minimizing the propagated errors.

B. Distributed Dynamic Power Management for Local Servers

In this subsection, we describe the proposed adaptive powermanager for local servers using model-free RL, based oneffective workload prediction results. The continuous-time Q-learning for SMDP is adopted as the RL technique in orderto reduce the decision frequency and overheads. The powermanagement for local servers is performed in a distributedmanner. First, we provide a number of key definitions andnotations. Let tGj (j = 1, 2, 3...) denote the time when the j-th job (VM) arrives in the local machine. Suppose that we arecurrently at time t with tGj ≤ t < tGj+1. Note that the powermanager has no exact information on tGj+1 at the current timet. Instead, it only receives an estimation of inter-arrival timefrom the LSTM-based workload predictor.

The power manager monitors the following state parametersof the power managed system (server):

1) The machine power state M(t), which can be active,idle, sleep, etc.

2) The estimated next job inter-arrival time from the work-load predictor.

To apply RL techniques to DPM frameworks, we definedecision epochs, i.e., when new decisions are made andupdates for the RL algorithm are executed. In our case, thedecision epochs coincide with one of the following three cases:

1) The machine enters the idle state (usage = 0) andJQ(t) = 0 (no waiting job in the buffer queue).

2) The machine is in the idle state and a new job arrives.3) The machine is in the sleep state and a new job arrives.The proposed RL-based DPM operates as follows. At each

decision epoch, the power manager finds itself in one of thethree aforesaid conditions and it will make a decision. If itfinds itself in case (1), it will use the RL-based timeout policy.A list of timeout periods, including the immediate shutdown(timeout value = 0), serve as the action set A in this case,and the power manager learns to choose the optimal action byusing RL technique. If it is in case (2), it will start executingthe new job and turn from idle to active. If it is in case (3), theserver will turn active from the sleep state and start the newjob as soon as it is ready. Updates are no need to perform incases (2) and (3) because there is just one action to choose.As pointed out in reference [34], the optimal policy whenthe machine is idle for non-Markov environments is often atimeout policy, wherein the machine is turned off to sleep ifit is idle for more than a specified timeout period.

In this local RL-based DPM framework, we use the rewardrate defined as follows:

r(t) = −wP (t)− (1− w)JQ(t) (5)

The reward rate function is the negative of a linearly-weightedcombination of power consumption P (t) of server and thenumber of jobs JQ(t) buffered in the service queue, wherew is the weighting factor. This is a reasonable reward rate

because as authors in [35] has pointed out, the average numberof buffered service requests/jobs is proportional to the averagelatency for each job, which is defined as the average time foreach job between the moment it is generated and the momentthat the server finishes processing it, i.e., it includes thequeueing time plus execution time. The above observation isbased on the Little’s Law [36]. In this way, the value functionQ(s, a) for each state-action pair (s, a) is a combination ofthe expected total discounted energy consumption and totallatency experienced by all jobs. Since the total number ofjobs and the total execution time are fixed, the value functionis equivalent to a linear combination of the average powerconsumption and average per-job latency. The relative weightw between the average power consumption and latency canbe adjusted to obtain the power-latency trade-off curve. Thedetailed procedure of the RL algorithm is in Algorithm 2.Algorithm 2 The RL-based DPM framework in the local tier.

1: At each decision epoch tD, denote the RL state as s(tD)for the power-managed system.

2: for each decision epoch tDk do3: With probability ε select a random action from

action set A (timeout values), otherwise ak =argmaxaQ(s(tDk ), a).

4: Apply the chosen timeout value ak.5: If job arrives during the timeout period, turn active to

process the job until the job queue is empty. Otherwiseturn sleep until the next job arrives.

6: Then we arrive at the next decision epoch tDk+1.7: Observe state transition at next decision epoch tDk+1

with new state s(tDk+1), and calculate reward rate duringtime period [tDk , t

Dk+1).

8: Updating Q(s(tDk ), ak) based on reward rate andmaxa′ Q(s(tDk+1), a′) based on the updating rule of Q-learning for SMDP (Eqn. (2)).

9: end for

VII. EXPERIMENTAL RESULTS

In this section, we first describe the simulation setup. Then,from perspectives of power consumption and job latency wecompare our proposed hierarchical framework for resourceallocation and power management with the DRL-based frame-work for resource allocation ONLY, and the baseline round-robin VM allocation policy. Finally, we evaluate our proposedframework by investigating the optimal trade-off between thepower consumption and latency.

A. Simulation Setups

Without loss of generality, in this paper, we assume ahomogeneous server cluster. The peak power of each serveris P (100%) = 145W, and the idle power consumption isP (0%) = 87W [21]. We set the server power mode transitiontimes Ton = 30s and Toff = 30s. The number of machines ineach cluster M is set as 30, 40 respectively. Please note thatour proposed framework is applicable to the case with moreservers as well.

We use real data center workload traces from Googlecluster-usage traces [16], which provide the server clusterusage data over a month-long period in May 2011. Theextracted job traces include job arrival time (absolute timevalue), job duration, and resource requests of each job, whichinclude CPU, memory and disk requirements (normalized bythe resource of one server). All the extracted jobs are with aduration between 1 minute and 2 hours, ordered increasinglybased on their arrival time. To simulate the workload on a30-40 machines cluster, we split the traces into 200 segments,and each segment contains about 100,000 jobs, correspondingto the workload for a M -machine cluster in a week.

In this work, for the global tier, we perform offline training,including the experience memory initialization and trainingof the autoencoder, using the whole Google cluster traces.To obtain DNN model in the global tier of the proposedframework, we use workload traces for five different M -machine clusters. We generate four new state transition profilesusing the ε-greedy policy [20] and store the transition profilesin the memory before sampling the minibatch for training theDNN. In addition, we clip the gradients to make their normvalues less than or equal to 10. The gradient clipping methodhas been introduced to DRL in [37].

In DNN construction, we use two layers of fully-connectedExponential Linear Units (ELUs) with 30 and 15 neurons,respectively, to build an autoencoder. Each Sub-Qk mentionedin Section V contains a single fully-connected hidden layerwith 128 ELUs, and a fully-connected linear layer with asingle output for each valid action in the group. The number ofgroups varies between 2 and 4. We use β = 0.5 in simulationsfor Q-learning discount rate.

B. Comparison with Baselines

We simulate five different one-week job traces using theproposed hierarchical framework with different parameters andcompare the performance against the DRL-based resourceallocation framework and the round-robin allocation policy(denoted as the baseline) in terms of the (accumulative) powerconsumption and the accumulated job latency.

Fig. 8 shows the experimental results when the number ofmachines in the cluster M = 30. Compared with the round-robin method, the hierarchical framework and DRL-basedresource allocation result in longer latency than the baselineround-robin method shown in Fig. 8(a). This is because inthe round-robin method jobs are dispatched evenly to eachmachine, and generally, the jobs do not need to wait in eachmachine job queue. However, from the perspective of energy(accumulative power consumption) usage shown in Fig. 8(b),the round robin method gives a larger increase rate thanthe hierarchical framework or DRL-based resource allocation,which implies the round robin method has a larger powerconsumption. On the other hand, compared with the DRL-based resource allocation, the proposed hierarchical frameworkshows a reduced job latency shown in Fig. 8(a) as well as alower energy usage shown in Fig. 8(b). From Fig. 8(b), we canalso observe the energy chart for the proposed framework isalways lower than that for DRL-based resource allocation or

(a) Accumulated job latency versus the number of jobs

(b) Energy usage versus the number of jobsFig. 8. Comparison among the proposed hierarchical framework, the DRL-based resource allocation framework, and the round-robin baseline in the caseof M = 30.

round robin baseline, which means our proposed hierarchicalframework achieves the lowest power consumption amongthree. Shown in Table I, given the number of jobs of 95, 000,compared with the round-robin method, our proposed hierar-chical framework saves 53.97% power and energy. It also saves16.12% power/energy and 16.67% latency compared with theDRL-based resource allocation framework.

Similarly, in the case of 40 machines in the cluster asshown in Table I, given the number of jobs of 95, 000, theproposed hierarchical framework consumes 59.99% less powerand energy compared with round-robin baseline, as well as17.89% less power and energy compared with the DRL-basedresource allocation framework. The latency of the hierarchicalframework reduces by 13.32% compared with the DRL-based

TABLE ISUMMARY OF THE SERVER CLUSTER PERFORMANCE METRICS

(ACCUMULATED ENERGY, ACCUMULATED LATENCY, AND AVERAGEPOWER CONSUMPTION) WITH JOB NUMBER = 95000

Round-Robin Energy (kWh) Latency(106s) Power (W)M = 30 441.47 85.20 2627.79M = 40 561.13 85.20 3340.06

DRL-based Energy (kWh) Latency(106s) Power (W)M = 30 242.25 109.73 1441.96M = 40 273.41 108.76 1627.44

Hierarchical Energy (kWh) Latency(106s) Power (W)M = 30 203.21 92.53 1209.58M = 40 224.51 94.26 1336.37

(a) Accumulated job latency versus the number of jobs

(b) Energy usage versus the number of jobsFig. 9. Comparison among the proposed hierarchical framework, the DRL-based resource allocation framework, and the round-robin baseline in the caseof M = 40.

resource allocation framework. From Fig. 8(a), 9(a), we canobserve that the corresponding hierarchical frameworks’ la-tency increase rates in two figures differ very little and so as tothe corresponding DRL-based resource allocation frameworks.In Fig. 8(b) and Fig. 9(b), the trend of energy usage of cor-responding hierarchical frameworks and DRL-based resourceallocation frameworks remains close which means their powerconsumption remains as the number of machines increases.However, the increase rate of energy usage (power) for round-robin methods becomes larger as the number of machines Mincreases. Thus, when the number of jobs increases, in a larger-size server cluster, the round robin baseline system consumesan unacceptable amount of power/energy. These facts implythat the DRL-based resource allocation framework is capableof managing a large-size server cluster with an enormousnumber of jobs. With the help of local and distributed powermanager, the proposed hierarchical framework achieves shorterlatency and less power/energy consumption.

C. Trade-off of Power/Energy Consumption and Average La-tency

Next, we explore the power (energy) and latency trade-offcurves for the proposed framework as shown in Fig. 10. Thelatency and energy usage are averaged for each job. We firstset three baselines with the DRL-based resource allocation tierand the local tier with different fixed timeout values, whichis discussed in Section VI-B. The timeout values of the fixed

Fig. 10. Trade-off curves between average latency and average energyconsumption of the proposed hierarchical framework and baseline systems.

timeout baselines are set to be 30s, 60s, and 90s, respectively.Please note when the timeout value is fixed, the baselinesystem cannot reach every point in the energy and latencyspace, so that the curves for baselines are not complete. Wecan observe the proposed hierarchical framework can achievethe smallest area against the axes of power and latency, whichdenotes that our proposed hierarchical framework gives thebest trade-off than any of those with a fixed timeout value. Forinstance, compared with a baseline with fixed timeout valueof 60, the maximum average latency saving with the sameenergy usage is 14.37%, while the maximum power/energysaving with the same average latency is 16.13%; comparedwith a baseline with fixed timeout value of 90, the maximumaverage latency saving with the same energy usage is 16.16%,while the maximum average power/energy saving with thesame latency is 16.20%

VIII. CONCLUSION

In this paper, a hierarchical framework is proposed to solvethe resource allocation problem and power management prob-lem in the cloud computing. The proposed hierarchical frame-work comprises a global tier for VM resource allocation to theservers and a local tier for power management of local servers.Besides the enhanced scalability and reduced state/actionspace dimensions, the proposed hierarchical framework en-ables to perform the local power managements of serversin an online and distributed manner, which further enhancesthe parallelism degree and reduces the online computationalcomplexity. The emerging DRL technique is adopted to solvethe global tier problem while an autoencoder and a novelweight sharing structure are adopted for acceleration. For localtier, an LSTM based workload predictor helps a model-free RLbased power manager to determine the suitable action of theservers. Experiment results using actual Google cluster tracesshow that proposed hierarchical framework significantly savethe power consumption/energy usage than the baseline whileachieves similar average latency. Meanwhile, the proposedframework can achieve the best trade-off between latency andpower/energy consumption in a server cluster.

REFERENCES

[1] A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das, “Towardscharacterizing cloud backend workloads: insights from google compute

clusters,” ACM SIGMETRICS Performance Evaluation Review, vol. 37,no. 4, pp. 34–41, 2010.

[2] A. Khan, X. Yan, S. Tao, and N. Anerousis, “Workload characterizationand prediction in the cloud: A multiple time series approach,” in 2012IEEE Network Operations and Management Symposium. IEEE, 2012,pp. 1287–1294.

[3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press Cambridge, 1998, vol. 1.

[4] X. Dutreilh, S. Kirgizov, O. Melekhova, J. Malenfant, N. Rivierre,and I. Truck, “Using reinforcement learning for autonomic resourceallocation in clouds: towards a fully automated workflow,” in ICAS 2011,The Seventh International Conference on Autonomic and AutonomousSystems, 2011, pp. 67–74.

[5] E. Barrett, E. Howley, and J. Duggan, “Applying reinforcement learningtowards automating resource allocation and application scalability inthe cloud,” Concurrency and Computation: Practice and Experience,vol. 25, no. 12, pp. 1656–1674, 2013.

[6] A. Galstyan, K. Czajkowski, and K. Lerman, “Resource allocation inthe grid using reinforcement learning,” in Proceedings of the ThirdInternational Joint Conference on Autonomous Agents and MultiagentSystems-Volume 3. IEEE Computer Society, 2004, pp. 1314–1315.

[7] G. Tesauro, R. Das, H. Chan, J. Kephart, D. Levine, F. Rawson,and C. Lefurgy, “Managing power consumption and performance ofcomputing systems using reinforcement learning,” in Advances in NeuralInformation Processing Systems, 2007, pp. 1497–1504.

[8] X. Lin, Y. Wang, and M. Pedram, “A reinforcement learning-basedpower management framework for green computing data centers,” in2016 IEEE International Conference on Cloud Engineering (IC2E).IEEE, 2016, pp. 135–138.

[9] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of design tech-niques for system-level dynamic power management,” IEEE transactionson very large scale integration (VLSI) systems, vol. 8, no. 3, pp. 299–316, 2000.

[10] G. Dhiman and T. S. Rosing, “Dynamic power management usingmachine learning,” in Proceedings of the 2006 IEEE/ACM internationalconference on Computer-aided design. ACM, 2006, pp. 747–754.

[11] H. Jung and M. Pedram, “Dynamic power management under uncertaininformation,” in 2007 Design, Automation & Test in Europe Conference& Exhibition. IEEE, 2007, pp. 1–6.

[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of go with deep neural networksand tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.

[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602, 2013.

[14] Y. Bengio, “Learning deep architectures for ai,” Foundations andtrends R© in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.

[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[16] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Google cluster-usagetraces: format + schema,” Google Inc., Mountain View, CA, USA,Technical Report, Nov. 2011, revised 2012.03.20. Posted at url-http://code.google.com/p/googleclusterdata/wiki/TraceVersion2.

[17] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.3-4, pp. 279–292, 1992.

[18] S. J. Duff and O. Bradtke Michael, “Reinforcement learning methodsfor continuous-time markov decision problems,” Adv Neural Inf ProcessSyst, vol. 7, p. 393, 1995.

[19] Y. Wang, Q. Xie, A. Ammari, and M. Pedram, “Deriving a near-optimalpower management policy using model-free reinforcement learning andbayesian classification,” in Proceedings of the 48th Design AutomationConference. ACM, 2011, pp. 41–46.

[20] R. S. Sutton, “Generalization in reinforcement learning: Successfulexamples using sparse coarse coding,” Advances in neural informationprocessing systems, pp. 1038–1044, 1996.

[21] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for awarehouse-sized computer,” vol. 35, no. 2, pp. 13–23, 2007.

[22] D. Meisner, B. T. Gold, and T. F. Wenisch, “Powernap: eliminatingserver idle power,” in ACM Sigplan Notices, vol. 44, no. 3. ACM,2009, pp. 205–216.

[23] J. T. Connor, R. D. Martin, and L. E. Atlas, “Recurrent neural net-works and robust time series prediction,” IEEE transactions on neuralnetworks, vol. 5, no. 2, pp. 240–254, 1994.

[24] Y. Tan, W. Liu, and Q. Qiu, “Adaptive power management usingreinforcement learning,” in Proceedings of the 2009 International Con-ference on Computer-Aided Design. ACM, 2009, pp. 461–467.

[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[26] J. Rao, X. Bu, C.-Z. Xu, L. Wang, and G. Yin, “Vconf: a reinforcementlearning approach to virtual machines auto-configuration,” in Proceed-ings of the 6th international conference on Autonomic computing.ACM, 2009, pp. 137–146.

[27] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[28] D. Gross, Fundamentals of queueing theory. John Wiley & Sons, 2008.[29] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,

and time series,” The handbook of brain theory and neural networks,vol. 3361, no. 10, p. 1995, 1995.

[30] M. B. Srivastava, A. P. Chandrakasan, and R. W. Brodersen, “Predictivesystem shutdown and other architectural techniques for energy efficientprogrammable computation,” IEEE Trans. Very Large Scale Integr.Syst., vol. 4, no. 1, pp. 42–55, Mar. 1996. [Online]. Available:http://dx.doi.org/10.1109/92.486080

[31] C.-H. Hwang and A. C. H. Wu, “A predictive system shutdown methodfor energy saving of event-driven computation,” in 1997 Proceedings ofIEEE International Conference on Computer Aided Design (ICCAD),Nov 1997, pp. 28–32.

[32] N. Kalchbrenner, I. Danihelka, and A. Graves, “Grid long short-termmemory,” arXiv preprint arXiv:1507.01526, 2015.

[33] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:http://arxiv.org/abs/1412.6980

[34] T. Simunic, L. Benini, P. Glynn, and G. D. Micheli, “Event-drivenpower management,” IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, vol. 20, no. 7, pp. 840–857, Jul 2001.

[35] Q. Qiu, Y. Tan, and Q. Wu, “Stochastic modeling and optimizationfor robust power management in a partially observable system,” inProceedings of the conference on Design, automation and test in Europe.EDA Consortium, 2007, pp. 779–784.

[36] J. D. Little and S. C. Graves, “Little’s law,” in Building intuition.Springer, 2008, pp. 81–100.

[37] Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architecturesfor deep reinforcement learning,” arXiv preprint arXiv:1511.06581,2015.

http://dx.doi.org/10.1109/92.486080

http://arxiv.org/abs/1412.6980

a hierarchical framework of cloud resource allocation and ... · abstract—automatic...

Documents