incentive-based demand response for smart grid with ......sources to strengthen the stability, as...

13
Contents lists available at ScienceDirect Applied Energy journal homepage: www.elsevier.com/locate/apenergy Incentive-based demand response for smart grid with reinforcement learning and deep neural network Renzhi Lu, Seung Ho Hong Department of Electronic Engineering, Hanyang University, Ansan 15588, Republic of Korea HIGHLIGHTS Propose an incentive-based demand response algorithm with articial intelligence. Reinforcement learning is employed to obtain the optimal incentive rates. Achieve real-time performance with the aid of deep neural network. Customer diversity is taken into account by provision of dierent incentive rates. Service provider payment under cases without and with demand response is compared. ARTICLE INFO Keywords: Articial intelligence Reinforcement learning Deep neural network Incentive-based demand response Smart grid ABSTRACT Balancing electricity generation and consumption is essential for smoothing the power grids. Any mismatch between energy supply and demand would increase costs to both the service provider and customers and may even cripple the entire grid. This paper proposes a novel real-time incentive-based demand response algorithm for smart grid systems with reinforcement learning and deep neural network, aiming to help the service provider to purchase energy resources from its subscribed customers to balance energy uctuations and enhance grid reliability. In particular, to overcome the future uncertainties, deep neural network is used to predict the un- known prices and energy demands. After that, reinforcement learning is adopted to obtain the optimal incentive rates for dierent customers considering the prots of both service provider and customers. Simulation results show that this proposed incentive-based demand response algorithm induces demand side participation, pro- motes service provider and customers protabilities, and improves system reliability by balancing energy re- sources, which can be regarded as a win-win strategy for both service provider and customers. 1. Introduction Nowadays, continued growth in energy demand imposes major burdens on power systems, and a mismatch between electricity demand and supply would cause high costs for both the service provider (SP) and customers (CUs) and may even cripple the whole grid [1,2]. This thus requires the SP to purchase large amounts of energy resources to deal with power uctuations and enhance grid reliability. Traditionally, conventional generators are used to mitigate the power imbalances or shortages; however, such an approach features heavy introjection and high carbon emissions [3]. Smart grid represents a modern power grid which is ecient, reliable, economical, and sustainable of electricity generation, transmission, distribution, and consumption. In smart grid systems, with the modern information and communication technology (ICT) and advanced metering infrastructure (AMI), demand response (DR) has become an ecient way to provide balancing energy re- sources to strengthen the stability, as well as increase the economic eciency of the entire power system [46]. According to the United States (US) Department of Energy, DR refers to a tarior program established to motivate changes in the price of electricity over time, or to give incentive payments designed to induce lower electricity usage at times of high market prices or when grid reliability is jeopardized[7]. DR programs are broadly divided into two categories: price-based and incentive-based. Price-based DR motivates CUs to change their energy consumption patterns according to time-varying electricity prices. Incentive-based DR provides xed or time-varying incentives to participating CUs for reducing their electricity usage [8]. Price-based DR is non-dispatchable and thus oers less exibility from SPs per- spective. Besides, price-based DR imposes wholesale electricity price risks on CUs; risk-averse CUs may vacillate to join such programs [9]. https://doi.org/10.1016/j.apenergy.2018.12.061 Received 22 August 2018; Received in revised form 29 November 2018; Accepted 11 December 2018 Corresponding author. E-mail address: [email protected] (S.H. Hong). Applied Energy 236 (2019) 937–949 0306-2619/ © 2018 Elsevier Ltd. All rights reserved. T

Upload: others

Post on 26-Sep-2020

3 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

Contents lists available at ScienceDirect

Applied Energy

journal homepage: www.elsevier.com/locate/apenergy

Incentive-based demand response for smart grid with reinforcementlearning and deep neural network

Renzhi Lu, Seung Ho Hong⁎

Department of Electronic Engineering, Hanyang University, Ansan 15588, Republic of Korea

H I G H L I G H T S

• Propose an incentive-based demand response algorithm with artificial intelligence.

• Reinforcement learning is employed to obtain the optimal incentive rates.

• Achieve real-time performance with the aid of deep neural network.

• Customer diversity is taken into account by provision of different incentive rates.

• Service provider payment under cases without and with demand response is compared.

A R T I C L E I N F O

Keywords:Artificial intelligenceReinforcement learningDeep neural networkIncentive-based demand responseSmart grid

A B S T R A C T

Balancing electricity generation and consumption is essential for smoothing the power grids. Any mismatchbetween energy supply and demand would increase costs to both the service provider and customers and mayeven cripple the entire grid. This paper proposes a novel real-time incentive-based demand response algorithmfor smart grid systems with reinforcement learning and deep neural network, aiming to help the service providerto purchase energy resources from its subscribed customers to balance energy fluctuations and enhance gridreliability. In particular, to overcome the future uncertainties, deep neural network is used to predict the un-known prices and energy demands. After that, reinforcement learning is adopted to obtain the optimal incentiverates for different customers considering the profits of both service provider and customers. Simulation resultsshow that this proposed incentive-based demand response algorithm induces demand side participation, pro-motes service provider and customers profitabilities, and improves system reliability by balancing energy re-sources, which can be regarded as a win-win strategy for both service provider and customers.

1. Introduction

Nowadays, continued growth in energy demand imposes majorburdens on power systems, and a mismatch between electricity demandand supply would cause high costs for both the service provider (SP)and customers (CUs) and may even cripple the whole grid [1,2]. Thisthus requires the SP to purchase large amounts of energy resources todeal with power fluctuations and enhance grid reliability. Traditionally,conventional generators are used to mitigate the power imbalances orshortages; however, such an approach features heavy introjection andhigh carbon emissions [3]. Smart grid represents a modern power gridwhich is efficient, reliable, economical, and sustainable of electricitygeneration, transmission, distribution, and consumption. In smart gridsystems, with the modern information and communication technology(ICT) and advanced metering infrastructure (AMI), demand response

(DR) has become an efficient way to provide balancing energy re-sources to strengthen the stability, as well as increase the economicefficiency of the entire power system [4–6]. According to the UnitedStates (US) Department of Energy, DR refers to “a tariff or programestablished to motivate changes in the price of electricity over time, orto give incentive payments designed to induce lower electricity usage attimes of high market prices or when grid reliability is jeopardized” [7].

DR programs are broadly divided into two categories: price-basedand incentive-based. Price-based DR motivates CUs to change theirenergy consumption patterns according to time-varying electricityprices. Incentive-based DR provides fixed or time-varying incentives toparticipating CUs for reducing their electricity usage [8]. Price-basedDR is non-dispatchable and thus offers less flexibility from SP’s per-spective. Besides, price-based DR imposes wholesale electricity pricerisks on CUs; risk-averse CUs may vacillate to join such programs [9].

https://doi.org/10.1016/j.apenergy.2018.12.061Received 22 August 2018; Received in revised form 29 November 2018; Accepted 11 December 2018

⁎ Corresponding author.E-mail address: [email protected] (S.H. Hong).

Applied Energy 236 (2019) 937–949

0306-2619/ © 2018 Elsevier Ltd. All rights reserved.

T

Page 2: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

On the contrary, incentive-based DR is dispatchable and is thus muchmore flexible with regard to aiding the SP to attain the required DRresources [10]. Also, incentive-based DR is a reward program and thusmore attractive to CUs compared with the price-based DR [11]. About93% of the peak load reductions from existing DR resources in the US isachieved via incentive-based DR programs [12].

This work proposes a novel real-time incentive-based DR programwith reinforcement learning (RL) and deep neural network (DNN) in ahierarchical electricity market, aiming to help the SP to purchase en-ergy resources from its different subscribed CUs to balance powerfluctuations and enhance grid reliability. Specifically, due to the in-herent nature of real-time electricity market, the SP can only access theprice from wholesale electricity market and energy demands from itsdifferent CUs for the current hour. To deal with the future uncertainties,DNN is used to predict the unknown prices and energy demands. Eachtime a new price and energy demand are obtained, the DNN predictsfuture prices and energy demands, and this process is repeated hourlyto the end of the day. Furthermore, in cooperation with the forecastedfuture prices and energy demands, RL is adopted to obtain the optimalincentive rates for different CUs considering the profits of both the SPand CUs. Employing RL to determine the incentive rates provides sev-eral advantages. First, RL is model-free. The SP does not require apredefined rule or prior knowledge about how to select the incentiverates, but instead, the SP discovers the optimal incentive rates by“learning” from direct interaction with the CUs. Second, RL is adaptive.The SP can autonomously acquire the incentive rates in an onlinefashion adapted to different CUs, considering uncertainties and flex-ibilities of the energy management system. Third, RL is concise. Theentire computational process of the algorithm is based on a look-uptable, which is much easier to implement in the real world.

To our best knowledge, this is the first work to address real-timeincentive-based DR with RL and DNN. The main contributions of thispaper are summarized as follows:

(1) This paper proposes a novel real-time incentive-based DR algorithmwith artificial intelligence (AI) techniques, considering both SP andCUs profitabilities in a hierarchical electricity market.

(2) Different with the conventional optimization methods, this studyadopts RL to obtain the optimal incentive rates. RL is adaptive andmodel-free, enabling the SP to autonomously determine the

incentive rates in an online manner without knowing the systemdynamics.

(3) In contrast to the day-ahead DR, this work achieves real-time per-formance. To overcome the future uncertainties, DNN is used topredict the unknown wholesale electricity prices and CUs loadpatterns.

(4) This paper also takes into account the diversity of CUs that differentCUs will be provided with different incentive rates. Since each CUhas its own specific requirements and reactions to load adjustmentwhen it enrolls in the incentive-based DR program.

(5) The SP payments under two different cases without and with DR arecompared, indicating the presented incentive-based DR algorithmcan help SP to procure energy resources at significantly low cost.

The remainder of this paper is organized as follows. Section 2 offersa complete literature review. In Section 3, mathematical formulationsof the system model are presented. Sections 4 and 5 introduce the RLused to obtain optimal incentive rates and DNN employed for futureprice and load forecasting, respectively. The detailed algorithm com-bining DNN and RL is described in Section 6. Section 7 discusses thesimulation results to evaluate the proposed algorithm. In Section 8, aguideline for real system implementation is provided. Finally, conclu-sions and future work are given in Section 9.

2. Literature review

Until now, a great deal of work has been devoted to the design ofincentive-based DR mechanisms for smart grids. The study described in[13] presented a coupon incentive-based DR for the load serving enti-ties to promote its profit, CU surpluses and social welfare, throughproviding the retail CU incentives to induce demand reduction. In [14],an incentive-based DR algorithm was proposed for the residential CU toreduce network peaks, where the incentive rates were paid based on theload shifts and voltage improvements in the day-ahead electricitymarket. The authors of [15] presented a day-ahead risk managementscheme for the retail electricity provider to reduce its financial losses,by employing an incentive-based DR program with its CUs as well asutilizing the distributed generation and energy storage systems. Simi-larly, the work in [16] developed an incentive-based DR program tomanage the market price risks and load variations by the electricity

Nomenclature

Abbreviations

SP service providerCU customerICT information and communication technologyAMI advanced metering infrastructureDR demand responseRL reinforcement learningDNN deep neural networkAI artificial intelligenceMDP Markov decision processGO grid operatorMILP mixed integer linear programmingEV electric vehicleMAE mean absolute errorMAPE mean absolute percentage errorOpenADRopen automated demand response

Variables and parameters

n index for customer

N total number of customersh index for hourH final hour of a dayph wholesale electricity price at hour h

EΔ n h, demand reduction of customer n at hour hλn h, incentive rate for customer n at hour hλmin lower bound of incentive rateλmax upper bound of incentive rateϕn h, dissatisfaction cost of customer n at hour hρ weighting factor between customer incentive income and

discomfortEn h, energy demand of customer n at hour hξh elasticity at hour hKmin lower bound of demand reductionKmax upper bound of demand reductionμn customer preference parameter of dissatisfaction cost

functionωn an auxiliary coefficient of dissatisfaction cost functioni index for iteration in Q-learningγ a discounting factor in Q-learningθ a learning rate in Q-learningε a tuning parameter in ε-greedy policy

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

938

Page 3: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

retailer, wherein the energy generation and storage units were alsoconsidered. The study described in [17] considered an alternative in-centive-based DR program and its application to a discrete manu-facturing facility based on a mixed integer linear programming (MILP)approach, in which the load reduction patterns were generated a priorioutlining the potential load reduction during the DR process. In [18],the authors developed an incentive-based DR contract between an ag-gregator and a single CU to explore individual rationality and asymp-totic inventive compatibility. The studies of [19,20] proposed in-centive-based DR models from the perspective of a grid operator (GO),where the models were spanned with three hierarchical levels includingthe GO, multiple SPs and CUs, then Stackelberg game was adopted tocapture the interactions among these different participants.

However, despite the above mentioned efforts on incentive-basedDR, three significant limitations still remain. First, most studies ex-plored DR in the context of day-ahead market, but real-time DR affordsa greater potential for balancing energy resources, given the highlydynamic and less forecasting constraints associated with energy gen-eration and consumption [21]. Second, most studies only considered asingle type of load, e.g., either industrial or residential load, withoutconsidering the diversity of real-world CUs, while individual CUs havetheir own specific requirements and reactions when they engage in theincentive-based DR programs. Third, the implementation method stillrely on conventional techniques such as MILP or game theory, whichare model-based approaches and lack both scalability and flexibility.When building the model-based controller, it must carefully select theappropriate models and estimate the model parameters. As a result,different CUs are expected to have different model parameters and evendifferent models, and a large-scale implementation of model-basedcontrollers requires a robust and stable approach that is able to identifythe appropriate model and the corresponding model parameters, whichmake setting up the model becomes more challenging, considering theheterogeneity of real-world CUs and differences in their behavior pat-terns. Also, the models are usually only estimates of reality and thusmay be unrealistic compared to actual energy systems, as the perfor-mance of the model-based methods is greatly restricted by the operatorknowledge and experience. Given this, it is desirable to develop in-novative incentive-based DR mechanisms for smart grid systems.

Over the past few years, with the rapid evolution of AI, much in-terest has been generated in adopting AI for decision making problems.Two outstanding research advancements are AlphaGO and AlphaGOZero announced by Google, which have illustrated that RL has superbdecision marking ability due to its unique features of “model-free” and“no need for prior domain knowledge”. RL is a computational approachto get the agent acts to maximize its expected cumulative rewards viaiterative interaction with an uncertain, unknown, and complex en-vironment [22]. Fig. 1 shows the general architecture of RL; the agentand its environment interact over a sequence of time steps. At each timestep, the agent observes the state of environment and selects an action.As a consequence of this action, the agent receives a numerical rewardand enters into a new state derived from the current state and thechosen action. RL maps perceived environmental states to the prob-abilities of selecting any possible actions, to maximize the total amountof rewards over the long run. In this regard, RL endows the agent (adecision-maker) with a powerful capacity to learn how to act optimallywhen the explicit and accurate system modeling is hard or even im-possible to obtain.

Some research has also been attempted on adopting RL to solve theDR problems in energy management. For example, in [23], a fully au-tomated energy management system based on RL was formulated toschedule the residential devices, with the aim of minimizing user dis-satisfaction costs and energy bills. The authors of [24,25] applied RL-based algorithms to energy trading games among different strategicplayers with day-ahead price information, enabling each player to learna strategy to trade energy so as to maximize the average revenue. In[26,27], RL algorithms were employed to obtain energy management

plans for electric vehicles (EVs) aiming at minimizing energy losses, inwhich the EVs control policy was acquired through RL. Similarly, theauthors of [28,29] presented two multi-agent learning-based DR algo-rithms to enable more efficient energy consumption in smart homes,wherein the electrical devices (i.e., EVs) were also controlled by RL.The studies in [30,31] introduced batch RL algorithms to schedulecontrollable loads, such as the heat pump thermostat or electric waterheater, under a day-ahead pricing scheme without any expert knowl-edge of the system dynamics. In [32,33], the authors demonstrated aprediction-based multi-agent RL approach to minimize the effect ofnon-stationarity in smart grid scenarios, in which the prediction andpattern change detection abilities were integrated into multi-agent RL.The work in [34] proposed a dynamic price-based DR scheme from theSP perspective via RL methodology, where the scheme was modeled asa Markov decision process (MDP), then Q-leaning was adopted to de-termine the optimal retail prices. Recently, RL has become a promisingtool to realize DR in energy management systems that must deal withcontinuous changes in several factors [35], e.g., intermittent renewableresources, dynamic electricity prices and energy consumption amounts.However, although there have been several successful examples illus-trating the effectiveness of RL in smart grid systems, most of existingstudies were focused on day-ahead price-based DR, applying RL to in-centive-based DR has not been investigated yet. Thus, in order to bridgethe aforementioned research gap, a new real-time incentive-based DRalgorithm utilizing RL and DNN methodologies is presented in thiswork.

3. Problem formulation

Fig. 2 shows the general architecture of a hierarchical electricitymarket including a GO, an SP and CUs. This study focuses on the in-centive-based DR algorithm between the SP and CUs. When SP entersthe wholesale electricity market to sell electricity to its retail CUs, itmust ensure to fulfill its responsibility to address the peak demands andenergy imbalances that might be caused by the CUs. Now, incentive-based DR is widely regarded as a profound approach mitigating thesupply-demand mismatches.

Next, the models of each participator are introduced.

3.1. Service provider model

An SP is located between the GO and CUs, operating an incentive-based DR program with its CUs to encourage them to sell demand re-duction in exchange for incentive payments. On the other side, the SPalso participates in the wholesale electricity market to sell the procuredenergy resources (aggregated load reductions from its CUs) to the GO.Therefore, as a profit-seeking organization, the objective of the SP is tomaximize the revenue obtained by trading energy resources with theGO, while minimize the incentive payments to its CUs, as shown below:

Fig. 1. Reinforcement learning setup.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

939

Page 4: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

∑ ∑ −= =

p E λ Emax ( ·Δ ·Δ )n

N

h

H

h n h n h n h1 1

, , ,(1)

⩽ ⩽λ λ λn hmin , max (2)

In (1), ∈ …n N{1, 2, 3, } indicates the CU n N, is the total number ofCUs; ∈ …h H{1, 2, 3, } represents the hour h H, is the final hour of aday (i.e., =H 24 if the price is updated every one hour). ph is the pricefrom wholesale electricity market at hour h E, Δ n h, and λn h, are thedemand reduction offered by, and incentive rate paid to, CU n at hour h,respectively. In (2), λmax and λmin are the upper and lower bounds ofincentive rate λn h, , derived by mutual agreement or regulatory re-quirement between the SP and CUs, maintaining incentive rates fair andprotecting each profit [19,36].

3.2. Customer model

CUs enroll in the incentive-based DR program. When informed ofthe incentive rate by the SP, CUs try to maximize their incentive in-comes by decreasing their energy consumption. However, the curtailedenergy can cause discomfort for the CUs, and such discomfort is com-monly modeled as a dissatisfaction cost. Thus, the goal of the CU is tomaximize the mixture of incentive income and incurred dissatisfactioncost as follows:

∑ − −=

ρ λ E ρ ϕ Emax [ · ·Δ (1 )· (Δ )]h

H

n h n h n h n h1

, , , ,(3)

=−

E E ξλ λ

λΔ · ·n h n h h

n h, ,

, min

min (4)

⩽ ⩽K E KΔ n hmin , max (5)

In (3), the first term represents the incentive income of CU n at hourh by providing demand reduction EΔ n h, , the second term is the incurreddiscomfort ϕ E(Δ )n h n h, , . ∈ρ [0, 1] is a weighting factor indicating therelative importance between the CU incentive income and discomfort.In (4), EΔ n h, is the demand reduction amount of CU n at hour h [37], inwhich En h, indicates the energy demand of CU n at hour h, and ξh is theelasticity coefficient at hour h that denotes the ratio of energy demandchange to incentive rate variation [38]. In (5), Kmin and Kmax are theranges of demand reduction when the incentive rates are in effect, asdetermined by the CU depending on individual characteristics or re-quirements [20].

The dissatisfaction cost function ϕ E(Δ )n h n h, , models the degree ofdiscomfort that a CU may experience when decreasing its energy de-mand, and is defined to be convex that will increase remarkably to-wards a larger demand reduction [5].

= +ϕ Eμ

E ω E(Δ )2

(Δ ) ·Δn h n hn

n h n n h, , ,2

, (6)

>μ 0n (7)

>ω 0n (8)

In (6), μn and ωn are CU-dependent parameters, where μn is a CUpreference that varies among different CUs [5], ωn is an auxiliarycoefficient of the dissatisfaction cost function; a larger ωn is associatedwith higher discomfort [19]. μn reflects the attitude of a CU with re-spect to electricity demand reduction: a greater value of μn indicatesthat the CU prefers less demand reduction to suffer less discomfort, andvice versa.

3.3. Objective function

In this study, both SP and CUs profits are considered as below:

∑ ∑⎧

⎨⎪

⎩⎪

− +

− −

⎬⎪

⎭⎪= =

p E λ E

ρ λ E ρ ϕ Emax

·Δ ·Δ

· ·Δ (1 )· (Δ )n

N

h

Hh n h n h n h

SP s profit

n h n h n h n h

CU s profit1 1

, , ,

, , , ,

(9)

4. Reinforcement learning for decision making

RL is a machine-learning algorithm allowing an agent to auto-matically determine the ideal behavior in a stochastic environment, soas to maximize the cumulative reward. In this section, we first for-mulate the decision-making problem as a discrete finite MDP, after-wards, Q-learning is adopted to obtain the optimal policy.

4.1. The Markov decision process

Fig. 3 illustrates the decision-making problem with RL metho-dology, where the SP serves the agent, CUs are the environment; theincentive rate denotes the action, the state is represented by the CUenergy reduction, and the objective function (9) defined in Section 3indicates the reward.

RL is commonly studied through an MDP framework [39], whichexhibits the Markov property that the state transitions are dependentonly on the current state and current action taken, and independent ofall prior environmental states or agent actions. In the MDP model of thisstudy, the reward (profit) and state (energy reduction) depend only onthe action (incentive rate) at the corresponding hour, not on the his-torical data. The key components to be modeled in the MDP include: adiscrete hour ∈h H , a state ∈s S E(Δ )n h n h, , , an action ∈a A λ( )n h n h, , ,and a reward ∈r s a R E λ( , ) (Δ , )n h n h n h n h, , , , . One episode of this MDPconstitutes a finite sequence of hours, states, actions and rewards:

s a r s a s a r s a1, , , ( , );2, , , ( , )n n n n n n n n,1 ,1 ,1 ,1 ,2 ,2 ,2 ,2 ; …h s a r s a, , , ( , )n h n h n h n h, , , , ;…H s a r s a, , , ( , )n H n H n H n H, , , , .

Taking into the long term returns, the agent must consider not onlycurrent but also future rewards. The further into the future, the morediscounts the rewards may attract. Thus, the future rewards are

Fig. 2. Hierarchical electricity market model.

Fig. 3. Reinforcement learning for optimal incentive rate making.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

940

Page 5: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

multiplied with a discount factor γ . Then, the cumulative discountedreward from hour h can be expressed as:

= ++ +⋯+

= +

+ +

+ +−

+

R r s a γ r s aγ r s aγ r s a

r s a γ R

( , ) · ( , )· ( , )

· ( , )( , ) ·

n h n h n h n h n h

n h n hH h

n H n H

n h n h n h

, , , , 1 , 12

, 2 , 2

, ,

, , , 1 (10)

where ∈γ [0, 1] indicates the relative importance of future versuspresent rewards. Specifically, when =γ 0, the agent is shortsighted andonly consider the current reward, while a factor of 1 will make theagent strive for the future rewards. If the agent wishes to balance thecurrent and future rewards, γ is set to a fraction between 0 and 1.

We use υ to denote a stochastic policy that maps states to actions:→ ×υ s a S E A λ( , ) (Δ ) ( )n h n h n h n h, , , , . The goal of the MDP problem is to

find a policy υ that maximizes the cumulative long-term rewards. Anoptimal policy ∗υ is the solution of an MDP, i.e., a policy that alwayschooses the actions maximizing the discounted reward for every state.

4.2. The Q-learning algorithm

Q-learning is a model-free off-policy algorithm in RL [40]. It can beused to find an optimal action selection policy ∗υ (a sequence of in-centive rates for each CU in this work) for any given MDP. One strengthof Q-learning is that it can directly learn from environment withoutknowing a model of the environment. Additionally, Q-learning canhandle the problems featuring stochastic transitions and rewards,without requiring any adaptations.

The basic principle of Q-learning is the assignment of a Q-valueQ s a( , )n h n h, , to each state-action pair at hour h, and updating of thisvalue at each iteration in a manner that optimizes the result. The op-timal Q-value ∗Q s a( , )υ n h n h, , , denotes the maximum discounted futurereward r s a( , )n h n h, , when taking action an h, at state sn h, , while con-tinuing to follow the optimal policy ∗υ , which satisfies the Bellmanequation based on Eq. (10) shown below:

= +∗+ +Q s a r s a γ Q s a( , ) ( , ) ·max ( , )υ n h n h n h n h n h n h, , , , , 1 , 1 (11)

The Q-value Q s a( , )n h n h, , is stored in a state-action table, with eachcell reflecting the performance of a specific action in a specific state.Every hour, the agent performs an action, and the Q-value of the cor-responding cell is updated based on the Bellman equation, Eq. (11), asfollows:

← +⎡

⎢⎢

+−

⎥⎥+ +Q s a Q s a θ

r s aγ Q s aQ s a

( , ) ( , )( , )

·max ( , )( , )

n h n h n h n h

n h n h

n h n h

n h n h

, , , ,

, ,

, 1 , 1

, , (12)

where ∈θ [0, 1] is the learning rate representing to what extent the newoverrides the old Q-values. A value of 0 indicates that the agent learnsnothing, exploiting prior knowledge exclusively, whereas a value of 1denotes that the agent takes into account only the current estimate,ignoring prior knowledge to explore possibilities. For trading off thenewly acquired information and old information, θ should be set to adecimal between 0 and 1.

In the Q-learning mechanism, the agent directly interacts with thedynamic environment by executing actions. Then, the agent obtains areward and moves to a new environmental state based on the currentstate and the selected action. Learning occurs via trial-and-error duringthis course. And in such learning process, the Q-value of every state-action pair is stored and updated. After updating an adequate numberof iterations, the Q-value gradually converges to a maximum. Detailedproofs of convergence are provided in [41–43].

Since the Q-value represents the maximum reward with action an h,at state sn h, , in which we can directly construct the optimal policy

=∗υ Q s aargmax ( , )n h n h, , (13)

wherein the optimal incentive rates for each CU are acquired.

5. Deep neural network for price and load forecasting

Load and price prediction have become popular topics in electricalengineering over the past few years, and several implementationmethods have been attempted. Recently, Neural Network (NN) iswidely used to forecast the wholesale electricity prices [44] and CUenergy demands [45,46]. The NN approach is comparatively easy toimplement and shows good performance due to its ability to handlenon-linear relationship problems more accurately [47], and being lesstime consuming than other techniques, such as the ARIMA model[48,49]. NN is inspired on the human brain, featuring a number ofhighly interconnected neurons, to approximate the complex nonlinearproblems when the input-output relationship is neither well defined noreasily computable [50]. NN is organized in sequential layers, includingan input layer, at least one hidden layer, and an output layer; each layeris interconnected by numeric weights (Wi ) and biases (bi) as shown inFig. 4. NN consisting of four or more layers is referred as DNN [45].

Next, the detailed information about DNN model in this work areintroduced.

5.1. Input and output parameters

Adequate NN input selection is critical for successful forecasting.The input data must contain maximally correlated historical data thatare appropriately styled and formatted. In this paper, the inputs of DNNare chosen based on correlation analysis and some empirical guidelinesdescribed in [44]. Detailed inputs for price and load forecasting arelisted in Table 1; and the outputs are the forecasted prices and loads. In

Fig. 4. Deep neural network model for price and load forecasting.

Table 1Inputs to the deep neural network.

Inputs for price forecasting Inputs for load forecasting

Index Description Index Description

1 Day of week (1–7) 1 Day of week (1–7)2 Hour stage of day (1–24) 2 Hour stage of day (1–24)3 Is holiday (0 or 1) 3 Is holiday (0 or 1)4 Price of hour −h 1 4 Load of hour −h 15 Price of hour −h 2 5 Load of hour −h 26 Price of hour −h 3 6 Load of hour −h 37 Price of hour −h 24 7 Load of hour −h 248 Price of hour −h 25 8 Load of hour −h 259 Price of hour −h 26 9 Load of hour −h 2610 Price of hour −h 48 10 Load of hour −h 4811 Price of hour −h 49 11 Load of hour −h 4912 Price of hour −h 50 12 Load of hour −h 50

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

941

Page 6: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

general, price and load parameters depend on several factors, particu-larly near- and long-term historical electricity prices and load demands.The relevant near-term price and load data include those from 1, 2, and3 h earlier. The relevant long-term price and load data include those forthe current hour, as well as 1 and 2 h earlier, from 1 and 2 days ago.Additionally, prices and loads can be expected to vary by the hour ofthe day, and whether it is a workday or a weekend day (note the firstthree inputs of Table 1). Such data can aid prediction of future pricesand loads.

5.2. Training and prediction

Forecasting with DNN involves two steps: training and prediction.Training is normally performed in a supervised manner. The trainingset given by the historical data, containing both inputs and the corre-sponding desired outputs. During the training process, DNN constructsan input-output mapping, adjusting the weights and biases at eachiteration to minimize differences between the produced and desiredoutputs [44]. Such error minimization is repeated until an acceptablecriterion is attained. In this work, Sigmoid function [51] served as thetransfer function for each layer, and weight and bias adjustment in eachlayer is trained by Levenberg-marquardt back propagation [52] algo-rithm. After minimum performance requirements are met, the trainedDNN serves as a predictor [53]. Although accurate price and loadpredictions are essential, our main goal in this study is to develop anefficient and scalable decision-making algorithm for SP using RL withprice and load predictors of low computational complexity that couldbe easily implemented. In recent years, price and load forecasting withDNN have been appeared in many relevant studies. For a detailed de-scription of DNN basics, please refer to [44,51,52].

5.3. Accuracy assessment

Mean absolute error (MAE) and mean percentage absolute error(MAPE) are used to assess the prediction accuracy of DNN models[44,51,52], which are the most common indicators in load and priceforecasting studies as shown below:

∑= −=

MAEM

RTV RTV1 | |m

M

m mf

1 (14)

∑=−

×=

MAPEM

RTV RTVRTV

1 | |100

m

Mm m

f

m1 (15)

where M is the total number of forecasted values used to calculate theerror. RTVm is the actual value, and RTVm

f is the forecasted value.

6. Algorithm with deep neural network and reinforcementlearning

In this work, we propose a real-time inventive-based DR algorithmfor smart grid systems with RL and DNN. To overcome the future un-certainties, DNN is used to predict the future wholesale electricityprices and CUs load patterns. Then, RL is adopted to obtain the optimalincentive rates for different CUs. Table 2 shows the detailed DR algo-rithm combining DNN and RL.

To better understand the algorithm in Table 2, a flowchart in Fig. 5is plotted to show how the methodologies are implemented to obtainthe optimal incentives for the CUs. Specifically, the algorithm starts torun as the day begins (i.e., =h 1). The system first initializes the in-centive rate bounds [λ λ,min max], the ranges of load reduction[K K,min max], the dissatisfaction cost parameters μn and ωn, the elasticityξh and the weighting parameter ρ in line 1–4 of Table 2.

After setting these parameters, at each hour h, the SP will update theinputs of price and load forecasting model (i.e., hour of the day, his-torical electricity prices and historical load demands), then use the

trained DNN models to predict prices and energy demands for thesubsequent hours (lines from 6 to 17 in Table 2).

Next, in cooperation with the forecasted future prices and energydemands, the SP will compute the optimal incentive rates for each CUbased on the RL methodology in an iterative way from line 19 to 29 inthe algorithm, i.e., at each iteration i, SP observes the energy in-formation of the CU and then selects an incentive rate action λn h, . Whenselecting such an action, the SP not only exploits what it already knowsto obtain the reward, but also explores for better action selection infuture. Exploration means the evaluation of values on all available ac-tions; and exploitation refers to the utilization of current knowledgealready known about action values, to maximize the reward. The mostcommon way to realize exploration and exploitation mechanism is via aε-greedy policy ( ⩽ ⩽ε0 1), where the SP can either select a randomincentive rate with probability ε or choose an incentive rate withprobability − ε1 by reference to the Q-value table. Here, random se-lection indicates that the SP chooses an incentive rate randomly withinthe incentive rate bounds at that state, and selection from the Q-valuetable indicates that the SP selects the incentive rate for which thecurrent Q-value is “maximum” at the given state. However, the current“maximum” Q-value could be overridden in future iterations. This givesthe system some degree of randomness to explore the incentive ratespaces, but forbids totally random exploration.

After choosing the incentive rate, the SP gains an immediate rewardr s a( , )n h n h, , , observes the CU next state +sn h, 1 and updates the Q-valueQ s a( , )n h n h, , using Eq. (12) on line 24 and 25 in the algorithm; thisprocess is repeated until state +sn h, 1 is terminal. Thereafter, the SPcompares the current and last Q-value to confirm whether convergenceto the maximum Q-value has occurred; if not, the system moves to thenext iteration +i 1 and repeats the process.

The iteration termination criterion on line 27 in Table 2 is calcu-lated as − ⩽−Q Q φ| |i i 1 ; if the difference between Qi and −Qi 1 is no more

Table 2Algorithm combining deep neural network and reinforcement learning

Algorithm: real-time incentive-based DR with DNN and RL

1: Initialize: incentive rate bounds λ λ,min max , elasticity ξh,2: demand reduction ranges K K,min max ,3: dissatisfaction cost parameters μ ω,n n,4: and weighting factor ρ.5: For each hour do6: %%DNN for price forecasting7: DofWeek ← updateDayofWeek()8: HofDay ← updateHourStageofDay()9: IsHoliday ← updateHoliday()10: PrData ← updateHistoricalPriceData()11: FuPrices ← PretrDNN(DofWeek, HofDay, IsHoliday, PrData)12: %%DNN for load forecasting13: DofWeek ← updateDayofWeek()14: HofDay ← updateHourStageofDay()15: IsHoliday ← updateHoliday()16: LoData ← updateHistoricalLoadData()17: FuLoads ← PretrDNN(DofWeek, HofDay, IsHoliday, LoData)18: %%RL for decision making19: Initialize Q-value arbitrarily20: Repeat (for each iteration i)21: Initialize sn h,22: Repeat (for each step in i)23: Choose an h, from current sn h, using ε-greedy policy24: Take action an h, , observe r s a( , )n h n h, , and next +sn h, 1

25: ← +⎡

⎢⎢

+−

⎥⎥+ +Q s a Q s a θ

r s aγ Q s aQ s a

( , ) ( , )( , )

·max ( , )( , )

n h n h n h n h

n h n h

n h n h

n h n h

, , , ,

, ,

, 1 , 1

, ,

26: Until +sn h, 1 is terminal

27: Until Q-value is converged, such that − ⩽−Q Q φ| |i i 1

28: Outputs the optimal actions29: (Only the actions for the current hour will be executed)30: End for

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

942

Page 7: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

than φ, the Q-value has converged to its maximum; φ is a system-de-pendent parameter [54]. Finally, the SP will access the optimal in-centive rates for the upcoming −H h hours, but only the optimal in-centive rates for the current hour h will be sent to the CUs.

Then, the system moves to the next hour +h 1, and repeats theabove procedure until it attains the final hour H.

7. Simulation results

This section presents the numerical simulation results to evaluatethe performance of the proposed real-time incentive-based DR scheme.

7.1. Simulation scenario configuration

To have an ease of illustration, simulations were conducted based

Fig. 5. Flowchart for implementing the algorithm in Table 2.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

943

Page 8: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

on one SP and three CUs. Table 3 shows the related parameters of threeCUs derived from [9,16,19], including the dissatisfaction cost para-meters μn and ωn; the load reduction ranges [K K,min max] and incentiverate bounds [λ λ,min max], which were set to a certain coefficient of theload demand En h, and the minimum wholesale electricity price pmin,separately. Table 4 lists the values of elasticity ξh obtained from[55,56], reflecting different responses in off-/mid-/on-peak hours. Forthe weighting factor ρ between CU incentive income and dissatisfactioncost, it was assumed that each CU featured the same weight at anygiven time, e.g., 0.1, 0.5 and 0.9 were taken as examples in this study todemonstrate the simulation results [9].

All parameter values in the simulation scenario are specific, andthey can change according to the different characteristics of the SP orCUs. However, this does not distort the analysis of the simulation re-sults. Next, the performance of the forecasting model with DNN and DRalgorithm with RL are examined.

7.2. Performance of the forecasting model with DNN

The price and load data used to train and test the DNN model wereobtained from the PJM [57] electricity market. Specifically, the datafrom January 1, 2017 to February 21, 2018 were used to train themodel, which then predicted the prices and loads for February 22–28,2018. The MATLAB NN toolbox was selected to train the DNN modeldue to its flexibility and simplicity. After several accuracy tests, theproposed model finally included five layers, containing one input layerwith 12 neurons; three hidden layers with 30, 20 and 10 neurons; andone output layer with 1 neuron.

Figs. 6 and 7 show a comparison of forecasted results and actualresults of the wholesale price curves and load demand patterns forFebruary 22–28, 2018, respectively, where the blue line represents theactual results and the red1 line denotes the forecasted results. From thefigures, it can be seen that the trend in forecasted results is quite similarto the actual results. In this case, the values of MAE and MAPE are listedin Table 5. Compared with the forecasting results in [44,58,59], theMAEs and MAPEs in this work are lower, indicating the DNN models inthis paper can make reasonable and accurate price and load predictions.However, for effects of CU scale on prediction results, please refer to theexisting studies [46–49].

7.3. Performance of the DR algorithm with RL

To demonstrate the performance of the proposed DR scheme withRL, the detailed simulations for February 23, 2018 are discussed. Inorder to keep the agent visiting all the state-action pairs and learningnew knowledge from the system, the tuning parameter ε of ε-greedypolicy is set to 0.2. To update Q s a( , )n h n h, , from the experimental ex-perience, we set the discounting factor γ to 0.95, and the learning rate θto 0.1. Currently, there are no general guidelines available in the lit-erature on how to select these parameters’ values in the Q-learningalgorithm, different application scenarios will be set with differentvalues. After executing the simulation, the agent converged to themaximum Q-value, as shown in Fig. 8. It can be seen that, at the outset,the agent had limited knowledge that chose poor actions yielding lowerQ-values, however, with each successive iteration, the Q-values in-creased as the agent gained experience from iterations to discover theactions yielding higher Q-values by learning them via trial-and-error,and finally achieving the maximum Q-values, in which the optimalincentive rates for each CU are acquired.

Once the optimal incentive rates are obtained, each CU optimalenergy reduction is also determined based on Eq. (4). In the following,the outcomes and corresponding analysis of the incentive-based DR

algorithm with RL are presented subsequently.

7.3.1. Optimal incentive rate and load reduction of one CU with different ρTaking CU3 as an example, Fig. 9 illustrates the results of optimal

incentive rates and energy reductions (represented by green shadowpart) of a single CU with different weighting factor ρ.

When looking into Fig. 9, we can see the trend of incentive rates wassimilar to that of the wholesale prices, i.e., during low wholesale pricehours, the SP offered smaller incentive rates to the CUs to induce themto provide little demand reduction; whereas, when the wholesale pricesare high indicating the peak power usage that may cause fluctuationsbetween energy supply and demand in the electricity market, this thusrequires the SP to increase the incentive rate to procure more energyresources to balance power mismatches and enhance grid operationalflexibility, relieving the burden of the grid to invest in peak capacityand improving system reliability. However, the incentive rate boundswere not exceeded in Eq. (2). Comparing on-peak hours (17:00–21:00)to off-peak hours (1:00–6:00 and 22:00–24:00), the incentive rates inoff-peak hours were much smaller than those in on-peak hours, but theenergy reductions were nearly same. This is because during off-peakhours, the electricity demand was more elastic (the value of elasticity ξhduring these periods was larger) that the CU would rather sell moredemand reduction to the SP even with low incentives based on Eq. (4).

In general, the total quantity of demand reductions was propor-tional to the incentive rates, i.e., higher incentive rates induced largerdemand reductions. Clearly, the average incentive rate of CU with agreater weighting factor ρ(e.g., ρ=0.9), was bigger than that with asmaller ρ(e.g., ρ=0.1), which results in a more total demand reduction.The reason is obvious: a CU with a greater weighting factor ρ is rela-tively indifferent to the incurred dissatisfaction; the CU put moreweight on the incentive income and less weight on the dissatisfactioncost, as such it would prefer sell more load reduction to the SP to gainmore incentive income, which in turn leading to a larger average

Table 3Three customers related parameters

μn ωn Kmin Kmax λmin λmax

CU1 3.0 5.0 0 E0.3 n h. p0.3 min pminCU2 5.0CU3 8.0

Table 4Elasticity in different hours

Off-peak Mid-peak On-peak(1-6am, 22–24 pm) (7–16 pm) (17–21 pm)

Elasticity ξh 0.5 0.3 0.1

Fig. 6. Wholesale price forecasting for February 22–28, 2018.

1 For interpretation of color in Figs. 6, 7, 9 and 11, the reader is referred to theweb version of this article.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

944

Page 9: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

incentive rate (see Fig. 9c). On the contrary, a smaller weighting factorρ indicates the CU thought more of the incurred dissatisfaction cost,hence, it was reluctant to curtail the demand, and the average incentiverate was also smaller (see Fig. 9a).

7.3.2. Optimal incentive rate and load reduction of different CUs with sameρ

To have a further illustration of how each CU responded to the SPincentives, the optimal incentive rates and energy reductions were alsoplotted for each individual CU in Fig. 10.

In this case, the value of weighting factor ρ was assumed to be 0.1,

indicating the CU discomfort is more important. As shown in Fig. 10,the load demand reduction provided by CU3 was the lowest of the threeCUs, which was understandable: CU had a larger value of dissatisfactioncost parameter μn (e.g., CU3, μn=8.0), reflecting a more conservativeattitude in terms of providing demand reduction compared to otherCUs, as the demand reduction provided was much less than its availablequantity in Eq. (5) to suffer less incurred discomfort (see Fig. 10c). Incontrast, CU with a smaller dissatisfaction cost parameter μn (e.g., CU1,μn=3.0) accepted a greater energy reduction when the proposed in-centive-based DR algorithm was in effect (see Fig. 10a).

7.3.3. Financial analyses of the SP and different CUs with different ρWhen applying an incentive-based DR approach in the electricity

market, it is important to maintain the financial balance of all partici-pants during the resource trading process. Table 6 provides the fi-nancial analyses of the SP and CUs. The optimal incentive rates andload reductions of each CU were also provided to help illustrate theanalyses.

Particularly, the SP payments to CUs were calculated by multiplyingthe SP incentive rates and aggregated demand reduction. The profit ofSP was calculated by Eq. (1) through trading the rate differences withthe GO and CUs, and this profit can be regarded as the net revenue aftereliminating the payments to CUs from the gross revenue obtained viatrading with the GO. For each respective CU, the optimal demand re-duction, as well as the corresponding incentive income and dis-satisfaction cost were also provided in Table 6 with respect to differentweighting factor ρ. As shown in Table 6, the sum of SP payments to theCUs and its profit were exactly the same as the revenue acquired fromthe GO. Moreover, the profit of each CU calculated based on Eq. (3),which equals to the incentive income given by the SP multiplied by theweighting factor ρ that subtracts the dissatisfaction cost multiplied bythe value of 1 minus the weighting factor ρ. Thus, the financial balanceof the entire resource trading framework was maintained.

To confirm the superiority of the proposed algorithm, we alsocompared the payments in two different cases from the perspective ofthe SP. For case 1, it was assumed that no incentive-based DR programwas applied, and the SP had to purchase all required energy resourcesfrom the GO at wholesale prices. In case 2, the payments were calcu-lated based on the optimal incentive rates and optimal total energyreductions; these payments were used by the SP to pay the CUs for theirload demand reductions and buy the remaining energy from GO in thisstudy. Fig. 11 shows the payment comparison of the two cases underdifferent weighting factor ρ, where blue bar represents the payment ofcase 1, and green bar denotes the payment of case 2. As shown in thefigure, the payments under case 2 were lower than those under case 1by 25.1%, 43.4% and 45.0%, respectively, indicating that the proposedincentive-based DR algorithm helps the SP to procure adequate energy

Fig. 7. Load forecasting of the three customers for February 22–28, 2018.

Table 5Accuracy results of the forecasting model

MAE MAPE

Wholesale price forecasting model 2.09 8.34Load forecasting model of CU1 0.68 3.88Load forecasting model of CU2 0.42 2.98Load forecasting model of CU3 0.73 4.84

Fig. 8. Convergence of Q-value with ρ equals 0.1, 0.5 and 0.9.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

945

Page 10: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

resources from demand-side at significantly low cost.

7.4. Computational statistics

The simulations were conducted on a desktop computer with3.30 GHz, 4-core i5-6600 CPU, 8 GB RAM, and running the Window 7OS, Table 7 gives the corresponding computational statistics for thesimulation scenario on February 23, 2018. The computation time fortraining the forecasting model with DNN, and obtaining the optimalactions with RL, is on average about 50 s, 40 s, and 1min, separately,which can fully satisfy the time requirement for deploying the real-time

incentive-base DR scheme in smart grid systems.

8. Guidelines for real system implementation

In smart grid, large scale deployment of AMIs and ICTs, enablingtwo-way communication between different DR entities named GO, SPand CUs in this study. As shown in the hierarchical electricity marketmodel in Fig. 2, the proposed incentive-based DR algorithm using RLand DNN methodologies in this work is installed at the SP side andrealized between SP and CUs to promote SP profitability and reduceCUs cost. To better explore the communication between different

Fig. 9. Results of one customer with different ρ. Fig. 10. Results of different customers with same ρ.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

946

Page 11: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

entities, a sequence diagram is also drew to illustrate the detailed in-formation exchanged during the DR process. As shown in Fig. 12, ateach hour, the GO calculates and announces the hour-ahead wholesaleelectricity price to the SP though its own algorithms within taking intoaccount the power generation capacity and procurement cost. Also, theSP will collect the hour-ahead energy demand from its CUs at theprecondition of taking actions on behalf of its CUs. Upon the receipt ofwholesale electricity price from GO, and load demands from CUs, theSP firstly updates the inputs of DNN (i.e., hour of day, historical elec-tricity prices and load demands) and uses the trained DNN models topredict the following prices and energy demands. Then, the SP willcalculate the optimal incentive rates for CUs by maximizing Eq. (9) viaRL methodology. The detailed steps of obtaining the optimal incentiverates are discussed in Section 6. Once getting the optimal incentiverates, the SP will announce these rates to its CUs and report the energyreduction information to GO. The above procedure will be repeateduntil it reaches to the final hour of the day. During this process, all DRsignals (i.e., incentive rates and amounts of load reductions) exchangedbetween SP and CUs can be achieved by Open Automated DemandResponse (OpenADR) [60], which was created to standardize, automateand simplify DR programs to facilitate interactions and transactionsbetween grid side entities and demand side resources.

9. Conclusions and future work

This paper proposes a novel real-time incentive-based DR algorithmfor smart grid systems with RL and DNN, aiming to assist the SP inpurchasing energy resources from its various CUs to balance energyfluctuations and enhance grid reliability. In particular, due to the in-herent nature of real-time electricity market, the SP can only access theprice from wholesale electricity market and energy demand from itsCUs for the current hour; to overcome the future uncertainties, DNN isused to predict the unknown prices and energy demands. After that, RLis adopted to derive the optimal incentive rates for different CUs con-sidering the profitabilities of both SP and CUs. By employing RL, the SPcan adaptively determine the incentives without knowing a predefinedcomplete model about how to select the incentive rates, but instead, itdiscovers the optimal incentive rates by learning them from direct in-teraction with the CUs. Simulation results show that this proposed in-centive-based DR algorithm, can induce demand side participation,promote SP and CUs profitabilities, and improve system reliability bybalancing energy resources, which can be regarded as a win-win

Table 6Financial analyses of the service provider and three customers

weighting factor ρ=0.1 weighting factor ρ=0.5 weighting factor ρ=0.9

Different CUs CU1 CU2 CU3 CU1 CU2 CU3 CU1 CU2 CU3Dissatisfaction cost parameter μn 3.0 5.0 8.0 3.0 5.0 8.0 3.0 5.0 8.0Average incentive rate ($/MWh) 8.25 7.83 7.54 10.83 10.58 10.25 11.13 11.13 11.13Total energy reduction (kWh) 77.9 49.8 35.1 127.9 98.6 77.6 131.5 103.2 84.6

CU incentive income (¢) 63.7 39.7 26.8 138.2 103.2 77.6 147.3 115.2 93.8CU discomfort (¢) 79.1 53.5 39.4 167.2 152.3 141.3 174.6 163.2 162.3

CU profit (¢) −64.8 −44.2 −32.8 −14.5 −24.6 −31.8 115.1 87.3 68.2SP revenue from GO (¢) 383.1 709.8 747.6SP payment to CU (¢) 130.2 319.0 356.3

SP profit (¢) 252.9 390.8 391.3

Fig. 11. Payments comparison under different ρ for the service provider.

Table 7Computational statistics for the case study

Hardware Software Computation time

Wholesale price forecasting with DNN Windows PC, Matlab 50 sLoad demand forecasting with DNN 4-core i5-6600 CPU 3.30 GHz, 40 s

Decision making with RL 8 GB RAM Java programming, Eclipse IDE 1min

Fig. 12. Diagram of DR information exchanged between different entities.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

947

Page 12: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

strategy for both SP and CUs.In the future, further analysis will be performed towards the

weighting factor ρ to determine the optimal ρ between the CU incentiveincome and incurred dissatisfaction cost from CU side. Also, this workcan be extended to a wholesale capacity resource trading frameworkinvolving the GO and multiple SPs. And, another interesting directionfor extending the current work is to apply RL on both price- and in-centive-based DR to provide integrated DR solution for the smart gridsystems.

Acknowledgements

This work was supported in part by under the framework of an in-ternational cooperation program (Korea-China) managed by theNational Research Foundation of Korea under Grant NRF-2016K2A9A2A11938310; in part by the Human Resources Program inEnergy Technology of the Korea Institute of Energy TechnologyEvaluation and Planning granted financial resource from the Ministry ofTrade, Industry and Energy, Republic of Korea, under Grant20174030201780.

References

[1] Nojavan S, Zare K, Mohammadi-Ivatloo B. Optimal stochastic energy managementof retailer based on selling price determination under smart grid environment in thepresence of demand response program. Appl Energy 2017;187:449–64.

[2] Jin M, Feng W, Marnay C, Spanos C. Microgrid to enable optimal distributed energyretail and end-user demand response. Appl Energy 2018;210:1321–35.

[3] Alasseri R, Tripathi A, Rao TJ, Sreekanth K. A review on implementation strategiesfor demand side management (DSM) in kuwait through incentive-based demandresponse programs. Renew Sustain Energy Rev 2017;77:617–35.

[4] Luo Z, Hong S-H, Kim J-B. A price-based demand response scheme for discretemanufacturing in smart grids. Energies 2016;9(8):650.

[5] Yu M, Lu R, Hong SH. A real-time decision model for industrial load management ina smart grid. Appl Energy 2016;183:1488–97.

[6] Yu M, Hong SH. Supply-demand balancing for power management in smart grid: aStackelberg game approach. Appl Energy 2016;164:702–10.

[7] Qdr Q. Benefits of demand response in electricity markets and recommendations forachieving them[J]. US Dept. Energy, Washington, DC, USA, Tech. Rep, 2006.

[8] Shen B, Ghatikar G, Lei Z, Li J, Wikler G, Martin P, et al. The role of regulatoryreforms, market changes, and technology development to make demand response aviable resource in meeting energy challenges. Appl Energy 2014;130:814–23.

[9] Yu M, Hong SH, Kim JB. Incentive-based demand response approach for aggregateddemand side participation. Smart grid communications (SmartGridComm), 2016IEEE international conference on. IEEE; 2016. p. 51–6.

[10] Eissa M. First time real time incentive demand response program in smart grid withi-energy management system with different resources. Appl Energy2018;212:607–21.

[11] Li Y-C, Hong SH. Real-time demand bidding for energy management in discretemanufacturing facilities. IEEE Trans Ind Electron 2017;64(1):739–49.

[12] Asadinejad A, Tomsovic K. Optimal use of incentive and price based demand re-sponse to reduce costs and price volatility. Electr Power Syst Res 2017;144:215–23.

[13] Zhong H, Xie L, Xia Q. Coupon incentive-based demand response: theory and casestudy. IEEE Trans Power Syst 2013;28(2):1266–76.

[14] Vivekananthan C, Mishra Y, Ledwich G, Li F. Demand response for residential ap-pliances via customer reward scheme. IEEE Trans Smart Grid 2014;5(2):809–20.

[15] Ghazvini MAF, Faria P, Ramos S, Morais H, Vale Z. Incentive-based demand re-sponse programs designed by asset-light retail electricity providers for the day-ahead market. Energy 2015;82:786–99.

[16] Ghazvini MAF, Soares J, Horta N, Neves R, Castro R, Vale Z. A multi-objectivemodel for scheduling of short-term incentive-based demand response programsoffered by electricity retailers. Appl Energy 2015;151:102–18.

[17] Weitzel T, Glock CH. Scheduling a storage-augmented discrete production facilityunder incentive-based demand response. Int J Prod Res 2018:1–21.

[18] Vuelvas J, Ruiz F, Gruosso G. Limiting gaming opportunities on incentive-baseddemand response programs. Appl Energy 2018;225:668–81.

[19] Yu M, Hong SH. Incentive-based demand response considering hierarchical elec-tricity market: a Stackelberg game approach. Appl Energy 2017;203:267–79.

[20] Yu M, Hong SH, Ding Y, Ye X. An incentive-based demand response (DR) modelconsidering composited dr resources. IEEE Trans Ind Electron 2019;66(2):1488–98.

[21] Huang X, Hong SH, Li Y. Hour-ahead price based energy management scheme forindustrial facilities. IEEE Trans Ind Inf 2017;13(6):2886–98.

[22] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature 2015;518(7540):529.

[23] Wen Z, ONeill D, Maei H. Optimal demand response using device-based re-inforcement learning. IEEE Trans Smart Grid 2015;6(5):2312–24.

[24] Wang H, Huang T, Liao X, Abu-Rub H, Chen G. Reinforcement learning in energytrading game among smart microgrids. IEEE Trans Ind Electron2016;63(8):5109–19.

[25] Wang H, Huang T, Liao X, Abu-Rub H, Chen G. Reinforcement learning for con-strained energy trading games with incomplete information. IEEE Trans Cybernetics2017;47(10):3404–16.

[26] Xiong R, Cao J, Yu Q. Reinforcement learning-based real-time power managementfor hybrid energy storage system in the plug-in hybrid electric vehicle. Appl Energy2018;211:538–48.

[27] Wu J, He H, Peng J, Li Y, Li Z. Continuous reinforcement learning of energymanagement with deep Q network for a power split hybrid electric bus. ApplEnergy 2018;222:799–811.

[28] Dusparic I, Harris C, Marinescu A, Cahill V, Clarke S. Multi-agent residential de-mand response based on load forecasting. Technologies for sustainability (SusTech),2013 1st IEEE conference on. IEEE; 2013. p. 90–6.

[29] Dusparic I, Taylor A, Marinescu A, Cahill V, Clarke S. Maximizing renewable energyuse with decentralized residential demand response. Smart cities conference (ISC2),2015 IEEE first international. IEEE; 2015. p. 1–6.

[30] Ruelens F, Claessens BJ, Vandael S, De Schutter B, Babuška R, Belmans R.Residential demand response of thermostatically controlled loads using batch re-inforcement learning. IEEE Trans Smart Grid 2017;8(5):2149–59.

[31] Ruelens F, Claessens BJ, Quaiyum S, De Schutter B, Babuška R, Belmans R.Reinforcement learning applied to an electric water heater: from theory to practice.IEEE Trans Smart Grid 2018;9(4):3792–800.

[32] Marinescu A, Dusparic I, Taylor A, Cahill V, Clarke S. P-marl: prediction-basedmulti-agent reinforcement learning for non-stationary environments. Proceedingsof the 2015 international conference on autonomous agents and multiagent sys-tems. International Foundation for Autonomous Agents and Multiagent Systems;2015. p. 1897–8.

[33] Marinescu A, Dusparic I, Clarke S. Prediction-based multi-agent reinforcementlearning in inherently non-stationary environments. ACM Trans AutonomousAdaptive Syst (TAAS) 2017;12(2):9.

[34] Lu R, Hong SH, Zhang X. A dynamic pricing demand response algorithm for smartgrid: reinforcement learning approach. Appl Energy 2018;220:220–30.

[35] Dusparic I, Taylor A, Marinescu A, Golpayegani F, Clarke S. Residential demandresponse: experimental evaluation and comparison of self-organizing techniques.Renew Sustain Energy Rev 2017;80:1528–36.

[36] Fang X, Hu Q, Li F, Wang B, Li Y. Coupon-based demand response considering windpower uncertainty: a strategic bidding model for load serving entities. IEEE TransPower Syst 2016;31(2):1025–37.

[37] Wang Y, Ai X, Tan Z, Yan L, Liu S. Interactive dispatch modes and bidding strategyof multiple virtual power plants based on demand response and game theory. IEEETrans Smart Grid 2016;7(1):510–9.

[38] Asadinejad A, Rahimpour A, Tomsovic K, Qi H, Chen C-f. Evaluation of residentialcustomer elasticity for incentive based demand response programs. Electr PowerSyst Res 2018;158:26–36.

[39] Sutton RS, Barto AG, et al. Reinforcement learning: an introduction. MIT press;1998.

[40] Watkins CJ, Dayan P. Q-learning. Machine Learning 1992;8(3-4):279–92.[41] Melo FS. Convergence of q-learning: a simple proof. Institute Of Systems and

Robotics; 2001. [Tech. Rep].[42] Tsitsiklis JN. Asynchronous stochastic approximation and Q-learning. Machine

Learning 1994;16(3):185–202.[43] Jaakkola T, Jordan MI, Singh SP. Convergence of stochastic iterative dynamic

programming algorithms. Advances in neural information processing systems.1994. p. 703–10.

[44] Panapakidis IP, Dagoumas AS. Day-ahead electricity price forecasting via the ap-plication of artificial neural network based models. Appl Energy 2016;172:132–51.

[45] Ryu S, Noh J, Kim H. Deep neural network based demand side short term loadforecasting. Energies 2016;10(1):3.

[46] Marinescu A, Dusparic I, Harris C, Cahill V, Clarke S. A dynamic forecasting methodfor small scale residential electrical demand. Neural networks (IJCNN), 2014International joint conference on. IEEE; 2014. p. 3767–74.

[47] Marinescu A, Harris C, Dusparic I, Cahill V, Clarke S. A hybrid approach to verysmall scale electrical demand forecasting. Innovative smart grid technologies con-ference (ISGT), 2014 IEEE PES. IEEE; 2014. p. 1–5.

[48] Marinescu A, Harris C, Dusparic I, Clarke S, Cahill V. Residential electrical demandforecasting in very small scale: an evaluation of forecasting methods. Softwareengineering challenges for the smart grid (SE4SG), 2013 2nd International work-shop on. IEEE; 2013. p. 25–32.

[49] Cavallo J, Marinescu A, Dusparic I, Clarke S. Evaluation of forecasting methods forvery small-scale networks. International workshop on data analytics for renewableenergy integration. Springer; 2015. p. 56–75.

[50] Hagan MT, Demuth HB, Beale MH, De Jesús O. Neural network design vol. 20.Boston: Pws Pub.; 1996.

[51] Keles D, Scelle J, Paraschiv F, Fichtner W. Extended forecast methods for day-aheadelectricity spot prices applying artificial neural networks. Appl Energy2016;162:218–30.

[52] Wang D, Luo H, Grunder O, Lin Y, Guo H. Multi-step ahead electricity price fore-casting using a hybrid model based on two-layer decomposition technique and bpneural network optimized by firefly algorithm. Appl Energy 2017;190:390–407.

[53] Ferreira PVR, Paffenroth R, Wyglinski AM, Hackett TM, Biln SG, Reinhart RC, et al.Multiobjective reinforcement learning for cognitive satellite communications usingdeep neural network ensembles. IEEE J Sel Areas Commun 2018;36(5):1030–41.

[54] Hurtado LA, Mocanu E, Nguyen PH, Gibescu M, Kamphuis R. Enabling cooperativebehavior for building demand response based on extended joint action learning.IEEE Trans Ind Inform 2018;14:127–36.

[55] Asadinejad A, Tomsovic K, Chen C-F. Sensitivity of incentive based demand re-sponse program to residential customer elasticity. North american power

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

948

Page 13: Incentive-based demand response for smart grid with ......sources to strengthen the stability, as well as increase the economic efficiency of the entire power system [4–6]. According

symposium (NAPS), 2016. IEEE; 2016. p. 1–6.[56] Mnatsakanyan A, Kennedy S. Optimal demand response bidding and pricing me-

chanism: application for a virtual power plant. Technologies for sustainability(SusTech), 2013 1st IEEE conference on. IEEE; 2013. p. 167–74.

[57] PJM. Data Miner 2, fponse program in smart grid <http://dataminer2.pjm.com/list>; 2018. [Online; accessed February-2018].

[58] Ghasemi A, Shayeghi H, Moradzadeh M, Nooshyar M. A novel hybrid algorithm forelectricity price and load forecasting in smart grids with demand-side management.

Appl Energy 2016;177:40–59.[59] Hernández L, Baladrón C, Aguiar JM, Carro B, Sánchez-Esguevillas A, Lloret J.

Artificial neural networks for short-term load forecasting in microgrids environ-ment. Energy 2014;75:252–64.

[60] Samad T, Koch E, Stluka P. Automated demand response for smart buildings andmicrogrids: the state of the practice and research challenges. Proc IEEE2016;104(4):726–44.

R. Lu, S.H. Hong Applied Energy 236 (2019) 937–949

949