recurrent deep multiagent q-learning for autonomous agents...

Recurrent Deep Multiagent Q-Learning for Autonomous Agentsin Future Smart Grid

Extended Abstract

Yaodong Yang, Jianye Hao∗, Zan [email protected],{jianye.hao,wangzan}@tju.edu.cn

School of Computer Software, Tianjin UniversityTianjin, China

Mingyang Sun, Goran Strbac{mingyang.sun11,g.strbac}@imperial.ac.uk

Imperial College LondonLondon, England

ABSTRACTThe broker mechanism is widely applied to serve for interestedparties to derive long-term policies to reduce costs or gain prof-its in smart grid. However, brokers are faced with a number ofchallenging problems such as balancing demand and supply fromcustomers and competing with other coexisting brokers to maxi-mize profits. In this paper, we develop an effective pricing strategyfor brokers in local electricity retail market based on recurrent deepmultiagent reinforcement learning and sequential clustering. Weuse real household electricity consumption data to simulate theretail market for evaluating our strategy. The experiments demon-strate the superior performance of the proposed pricing strategyand highlight the effectiveness of our reward shaping mechanism.

KEYWORDSDeepMultiagent Reinforcement Learning; Smart Grid; Pricing Strat-egy

ACM Reference Format:Yaodong Yang, Jianye Hao∗, Zan Wang and Mingyang Sun, Goran Strbac.2018. Recurrent Deep Multiagent Q-Learning for Autonomous Agents in Fu-ture Smart Grid. In Proc. of the 17th International Conference on AutonomousAgents and Multiagent Systems (AAMAS 2018), Stockholm, Sweden, July10–15, 2018, IFAAMAS, 3 pages.

1 INTRODUCTIONTraditional power grid is faced with fundamental changes withthe advent of decentralized power generation technologies andthe increasing number of active electricity customers. One criticalobjective of smart grid is to guarantee its stability and reliabil-ity in terms of the real-time balance between demand and supply.Nevertheless, with the increasing penetration of renewable energyresources, existing centralized control mechanisms are unable tosimultaneously accommodate the vast number of small-scale inter-mittent producers and dynamic changes in demand of customersin response to price variations [5, 13].

One promising way of maintaining real-time balance in localtariff market is to apply electricity retail brokers, which offer tariffcontracts for both local consumers and small-scale producers. Tosatisfy the demand of the contracted customers, retail brokers arechallenged by optimizing their trading strategies to maximize their

* Corresponding author: Jianye Hao.Proc. of the 17th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2018), M. Dastani, G. Sukthankar, E. André, S. Koenig (eds.), July 10–15, 2018,Stockholm, Sweden. © 2018 International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

profits while balancing demand and supply [21]. Power TAC [9], asa rich, competitive, open-source simulation platform, is commonlyused to design and evaluate various broker strategies. However,brokers developed on Power TAC mainly purchase electricity fromremote power plants, in which traditional fossil fuel is still the maingeneration resource via a wholesale market. Small-scale producersare usually overlooked in local retail market [10, 18, 19].

In a retail market, as the environment can be modeled as aMarkov decision process (MDP) [14], reinforcement learning tech-niques have been applied to learn electricity broker strategies forcustomers and retailers [1, 3, 13–16]. However, existing works of re-tail brokers [13, 14, 20] are based on the simple Q-table structure ora linear function approximation, where features are approximatedas discrete values and may need to be constructed manually whichresult in information loss. On the other side, electricity customersexhibit various electricity consumption or producing patterns. In[20], its broker framework assigns each type of customers with anindependent pricing agent. However, they use independent SARSAfor different customers and regard the whole broker’s profit as eachagent’s immediate reward during its Q-value update process. It doesnot distinguish each agent’s unique contribution to the broker’sprofits and thus does not encourage the learning towards optimalstrategy.

To address above problems, we propose a recurrent deep multia-gent reinforcement learning (RDMRL) framework augmented withsequential clustering and reward shaping to coordinate internalsub-brokers. Experimental results show that superior performanceof our RDMRL broker and highlight the effectiveness of our rewardshaping mechanism.

2 RDMRL FRAMEWORKFigure 1 shows the structure of our multiagent-based reinforcementlearning framework. Customers with various electricity consump-tion patterns are clustered into groups, detailed in Section 2.2. Thenan individual recurrent Deep Q-Network (DQN) is employed tohandle the continuous state space explosion problem for each typeof customers detailed in Section 2.1. Finally, a reward shaping mech-anism (Section 2.3) is proposed to allocate the correct reward signalfor each sub-broker.

2.1 Learning Framework of Sub-brokersDeep reinforcement learning (DRL) has recently been shown tomaster numerous complex problem domains that suffer from thecurse of dimensionality [6, 12]. It is expected to learn more efficientpricing policies by employing DRL into the broker pricing domain.

Main Track Extended Abstract AAMAS 2018, July 10-15, 2018, Stockholm, Sweden

2136

Figure 1: RDMRL broker framework

Meanwhile, we apply recurrent neural units such as Long Short-Term Memory (LSTM) [7] to capture the temporal information ofthe tariff market.

2.2 Clustering ConsumersWe cluster consumers according to their temporal electricity con-sumption patterns by K-Means [11] with Dynamic Time Warping(DTW) distance criterion [8], which is a state-of-the-art clusteringalgorithm for measuring temporal sequence similarity.

2.3 RDMRL Broker with Reward ReshapingTo address themultiagent credit assignment problem [2] of a cooper-ative multiagent system, we calculate each sub-broker’s individualcredit by difference rewards [17]:

r it = rt − (∑j,i

pjtψ

jt,C −

∑k,i

pkt ψkt,P − Φit ),j ∈ C,k ∈ P (1)

where i represents the customer type charged by sub-broker i , rt isthe broker’s reward,ψ j

t,C denotes consumptions of consumers oftype j at time t ,ψk

t,P denotes outputs of producers of the type k at

time t . Also, p jt and pkt are the broker’s current tariff prices for Cj

and Pk respectively. Φit is current imbalance fee:

Φt =

{ϕ−(

∑j,i ψt,C −∑

k,i ψt,P ), i f∑j,i ψt,C ≥ ∑

k,i ψt,P

ϕ+(∑j,i ψt,C −∑

k,i ψt,P ), otherwise

(2)

where ϕ− and ϕ+ are the imbalance fee for shortsupply and over-supply respectively.

3 EXPERIMENTS AND ANALYSISIn this section, we first compare a DQN based broker with a Q-table based broker by following the settings in [14] to demonstratethe superior performance of DQN. The other co-existing strategiesare set to be the same as previous work [14]: Balanced Strateдy,Greedy Strateдy, Random Strateдy and Fixed Strateдy. Table 1shows the detailed results of the first experiment. We can see theprofit of DQN based BL is 105% higher than Q-table based BL andits imbalance amount is reduced by 10%.

In the second set of experiments, we compare the RDMRL brokerwith two baseline brokers (single-agent recurrent DQN andmultiple

Table 1: Q-table Based BL and Other Brokers’ Total Profits

Broker Profits ShortSupply OverSupplyTabular −Q 1327482$ -244764kWh 313536kWh

DQN 2721828$ -226826kWh 275564kWh

independent recurrent DQNs) against the same set of competingbrokers as the first experiment. We consider more realistic settingsby introducing real-world data [4] to model consumer consumptionpatterns. First, Figure 2 shows the accumulated profits of singleagent broker during the evaluation episodes, which fails to gainprofits because of the single pricing method. The result indicatesthat the broker using only one recurrent DQN cannot learn effectivepricing strategies in a realistic retail market with various consumersof different electricity consumption patterns.

Figure 2: Brokers’ Profits of the evaluation episodes.

Second, Figure 3(a) shows the profits of RDMRL broker. Wecan observe that RDMRL broker gains the most profits. Lastly,Figure 3(b) shows the profits of an incomplete version of RDMRLbroker (denoted as RDMRL’) by removing reward shaping. Theresult shows that RDMRL’ using the global reward instead of rewardshaping performs worst.

(a) RDMRL with reward shaping (b) RDMRL’ without reward shaping

Figure 3: Brokers’ profits of the evaluation episodes.

4 CONCLUSIONSIn this paper, we firstly apply recurrent DQN for retail brokerdesign to learn more accurate and effective pricing strategies insmart grid domain. Also, we design a multiagent broker frameworkwith reward shaping to publish tariffs for each group of customers.Extensive simulations validate the the superior effectiveness of ourbroker framework.

ACKNOWLEDGMENTSThe work is supported by the National Natural Science Foundationof China under Grant No.: 61702362.


2137

REFERENCES[1] Angelos Angelidakis and Georgios Chalkiadakis. 2015. Factored MDPs for Opti-

mal Prosumer Decision-Making in Continuous State Spaces. In the Proceedings ofthe 13th European Conference on Multi-Agent Systems. 91–107.

[2] Yu Han Chang, Tracey Ho, and Leslie Pack Kaelbling. 2003. All Learning is Local:Multi-agent learning in global reward games. In Proceedings of the 16th Advancesin Neural Information Processing Systems. 807–814.

[3] Moinul Morshed Porag Chowdhury, Russell Y. Folk, Ferdinando Fioretto, Christo-pher Kiekintveld, andWilliam Yeoh. 2015. Investigation of Learning Strategies forthe SPOT Broker in Power TAC. In Proceedings of the 17th International Workshopon Agent-Mediated Electronic Commerce and Trading Agents Design and Analysis.

[4] Energy Consumption Data 2015. Electricity Consumption in a Sample of LondonHouseholds. (2015). https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households.

[5] Xi Fang, Satyajayant Misra, Guoliang Xue, and Dejun Yang. 2012. Smart Grid -The New and Improved Power Grid: A Survey. IEEE Communications Surveysand Tutorials 14, 4 (2012), 944–980.

[6] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. 2017. Deepreinforcement learning for robotic manipulationwith asynchronous off-policy up-dates. In Proceedings of IEEE International Conference on Robotics and Automation.3389–3396.

[7] S Hochreiter and J Schmidhuber. 1997. Long short-term memory. Neural Compu-tation 9, 8 (1997), 1735–1780.

[8] Eamonn Keogh and Chotirat Ann Ratanamahatana. 2005. Exact indexing ofdynamic time warping. Knowledge and Information Systems 7, 3 (2005), 358–386.

[9] Wolfgang Ketter, Markus Peters, and John Collins. 2013. Autonomous agents infuture energy markets: the 2012 power trading agent competition. In Proceedingsof the 27th Conference on Artificial Intelligence. 1298–1304.

[10] Bart Liefers, Jasper Hoogland, and La Poutré Han. 2014. A Successful BrokerAgent for Power TAC. Lecture Notes in Business Information Processing (2014),99–113.

[11] J Macqueen. 1967. Some Methods for Classification and Analysis of MultiVariateObservations. In Proceedings of the 5th Berkeley Symposium on MathematicalStatistics and Probability. 281–297.

[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, GeorgOstrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, HelenKing, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015.Human-level control through deep reinforcement learning. Nature 518 7540(2015), 529–33.

[13] Markus Peters, Wolfgang Ketter, Maytal Saar-Tsechansky, and John Collins. 2013.A reinforcement learning approach to autonomous decision-making in smartelectricity markets. Machine Learning 92 (2013), 5–39.

[14] Prashant P. Reddy and Manuela M. Veloso. 2011. Strategy learning for au-tonomous agents in smart grid markets. In Proceedings of the 20th InternationalJoint Conference on Artificial Intelligence. 1446–1451.

[15] Prashant P. Reddy and Manuela M. Veloso. 2013. Negotiated learning for smartgrid agents: entity selection based on dynamic partially observable features. Inthe Proceedings of 27th AAAI Conference on Artificial Intelligence. 1313–1319.

[16] Valentin Robu, Meritxell Vinyals, Alex Rogers, and Nicholas R. Jennings. 2014.Efficient Buyer Groups for Prediction-of-Use Electricity Tariffs. In the Proceedingsof the 28th AAAI Conference on Artificial Intelligence. 451–457.

[17] Kagan Tumer and Adrian Agogino. 2007. Distributed agent-based air trafficflow management. In Proceedings of the 6th International Joint Conference onAutonomous Agents and Multiagent Systems. 255.

[18] Daniel Urieli and Peter Stone. 2014. TacTex’13: a champion adaptive powertrading agent. In Proceedings of the 13th International Conference on AutonomousAgents and Multiagent Systems. 1447–1448.

[19] Daniel Urieli and Peter Stone. 2016. An MDP-Based Winning Approach to Au-tonomous Power Trading: Formalization and Empirical Analysis. In Proceedingsof the 15th International Conference on Autonomous Agents and Multiagent Systems.827–835.

[20] Xishun Wang, Minjie Zhang, and Fenghui Ren. 2016. A Hybrid-learning BasedBroker Model for Strategic Power Trading in Smart Grid Markets. Knowledge-Based Systems 119 (2016), 142–151.

[21] Kazem Zare, Mohsen Parsa Moghaddam, and Mohammad Kazem Sheikh-El-Eslami. 2011. Risk-Based Electricity Procurement for Large Consumers. IEEETransactions on Power Systems 26, 4 (2011), 1826–1835.


2138

recurrent deep multiagent q-learning for autonomous agents...

Documents