felix m. strnad , wolfram barfuss , jonathan f. donges ...ddqn duel per is interface: combine agent...

1
Deep Reinforcement Learning in World-Earth System Models to Discover Sustainable Management Strategies copan coevolutionary pathways Felix M. Strnad 1,2,* , Wolfram Barfuss 1 , Jonathan F. Donges 1,3 , Jobst Heitzig 1 1) Potsdam Institute for Climate Impact Research, 2) Georg-August Universität Göttingen, 3) Stockholm Resilience Center * [email protected] Reward of taking action a at state s Q target Discounted max Q-value among all actions from next state s t+1 Q-Learning Deep Q-Networks Experience Replay Deep Reinforcement Learning Q(s,a 1 ) Q(s,a 2 ) Q(s,a n ) State s Deep Q-Network Agent Environment Mini-Batch MB Transition Replay Buer 1 2 k DQN- Agent Environment Management Options: - Carbon Tax - Subsidies for renewables - Nature protection policy Denes action set Reward functions - Survival: - Boundary distance: action state reward Agent-Environment Interface Figure: Example pathway in a stylized World-Earth system model. Green lines: attraction basin of sustainable x point. Black lines: attraction basin of carbon based economy without renewables. In color: Example trajectory for path towards sustainable future. The dierent management options are: DG: Degrowth=Restrict resources. ET: Energy-Transformation=Carbon tax + Subsidies on renewables. 1.) Even though dynamics of the environments are unknown to the agent in advance, it is able to nd previously undiscovered trajectories within planetary boundaries. 2.) Learning is stable and successful management is possible in various environments. 0 2000 4000 6000 8000 # learned episodes 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of sucessful tests LAGTPKS AGTPKS GTPKS TPKS PKS KS 3.) Even under partial observability of state s=(LAGTPKS) and noisy measurements, the agent is still capable of detecting solutions. [1] Strnad et al. (2019). Deep reinforcement learning in World-Earth system models to discover sustainable management strategies, Chaos [2] Rockström et al. (2009). A safe operating space for humanity. Nature [3] Steen et al. (2018). Trajectories of the Earth System in the Anthropocene, PNAS [4] Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature [5] Donges et al. (2017). Closing the loop: Reconnecting human dynamics to Earth System science, The Anthropocene Review References Find us on GitHub: https://github.com/fstrnad/pyDRLinWESM Figure: Development of total average reward per episode. Dierent Deep-Q-Network architectures are analyzed: DQN=Deep Q Networks, DDQN=Double DQN, DDQN Duel=Dueling Network Architecture with DDQN, DDQN Duel PER IS = DDQN Duel using prioritized experience replay with importance sampling. Figure: Development of percentage of successful tests for agents with dierent knowledge about the state variables. The dierent variables denote:L=terrestrial carbon, A=atmospheric carbon, G=geological carbon, T=temperature, P=population, K=capital, S=renewable energy knowledge. Motivation Environment: World-Earth Models - Combine socio-economic World dynamics with biophysical Earth system dynamics [5] - Models are stylized, their main focus is set on a qualitative understanding of the complex dynamics of these systems rather than to be used for quantitative predictions. Key Findings 0 2000 4000 6000 8000 10000 # Learned Episodes 50 100 150 200 250 Average Reward per Episode DQN DDQN Dueling PER IS DDQN Duel PER IS Interface: Combine Agent and Environments - Map the World-Earth systems to a Markov Decission Process in terms of concrete states in the environment, actions in the action set and reward functions [3] - Implementation of action set and reward function just depends on the developer and is independent of the agent Agent: Deep Reinforcement Learner - Acts only based on the information of the current state and the rewards signal - Uses DQN - Algorithm for Learning [4], i.e. the combination of Q-learning, neural networks and experience replay The Framework Interface Design Computer models can be used to show possible pathways towards a sustainable future. Adapted from [3] Management Pathways How can the World be steered into a safe and just operating space for humanity [2]? A Safe and Just Operating Space How to nd an intelligent combination of management options in order to reach a sustainable future that stays within planetary boundaries at all times? Deep Reinforcement Learning in Complex Systems Deep Reinforcement Learning (DRL) algorithm has been proven to detect solutions up to super-human performance in various manageable environments Adapted from https://deepmind.com/research/alphago/ Idea: Use DRL to uncover previously unknown appropriate management strategies for the World-Earth system excess atmospheric carbon stock A [GtC] 0 24 53 90 137 200 288 420 640 1080 2400 economic output Y [1e+12 USD/yr] 0 7 15 26 40 58 84 122 186 315 700 renewable knowledge stock S [1e+09 GJ] 0 50 111 187 285 416 600 875 1333 2250 5000 default DG ET DG+ET current state

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Felix M. Strnad , Wolfram Barfuss , Jonathan F. Donges ...DDQN Duel PER IS Interface: Combine Agent and Environments - Map the World-Earth systems to a Markov Decission Process in

Deep Reinforcement Learning in World-Earth System Models to Discover Sustainable Management Strategies

copancoevolutionarypathways

Felix M. Strnad1,2,*, Wolfram Barfuss1, Jonathan F. Donges1,3, Jobst Heitzig1

1) Potsdam Institute for Climate Impact Research, 2) Georg-August Universität Göttingen, 3) Stockholm Resilience Center* [email protected]

Reward of takingaction a

at state s

Q target Discounted maxQ-value among allactions from next

state st+1

Q-Learning Deep Q-Networks Experience Replay

Deep Reinforcement Learning

Q(s,a1) Q(s,a2) Q(s,an)

State s

Deep Q-Network

Agent

Environment

Mini-Batch MB

Transition

Replay Buffer1

2

k

DQN-Agent

Environment

Management Options: - Carbon Tax - Subsidies for renewables - Nature protection policy

Defines action set

Reward functions - Survival:

- Boundary distance:

action statereward

Ag

en

t-En

vir

on

men

t In

terf

ace

Figure: Example pathway in a stylized World-Earth system model. Green lines: attraction basin of sustainable fix point. Black lines: attraction basin of carbon based economy without renewables. In color: Example trajectory for path towards sustainable future. The different management options are: DG: Degrowth=Restrict resources. ET: Energy-Transformation=Carbon tax + Subsidies on renewables.

1.) Even though dynamics of the environments are unknown to the agent in advance, it is able to find previously undiscovered trajectories within planetary boundaries.

2.) Learning is stable and successfulmanagement is possible in various environments.

0 2000 4000 6000 8000# learned episodes

0.0

0.2

0.4

0.6

0.8

1.0

Fract

ion o

f su

cess

ful te

sts

LAGTPKS

AGTPKS

GTPKS

TPKS

PKS

KS

3.) Even under partial observability of state s=(LAGTPKS) and noisy measurements, the agent is still capable of detecting solutions.

[1] Strnad et al. (2019). Deep reinforcement learning in World-Earth system models to discover sustainable management strategies, Chaos[2] Rockström et al. (2009). A safe operating space for humanity. Nature[3] Steffen et al. (2018). Trajectories of the Earth System in the Anthropocene, PNAS[4] Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature [5] Donges et al. (2017). Closing the loop: Reconnecting human dynamics to Earth System science, The Anthropocene Review

References

Find us on GitHub: https://github.com/fstrnad/pyDRLinWESM

Figure: Development of total average reward per episode. Different Deep-Q-Network architectures are analyzed: DQN=Deep Q Networks, DDQN=Double DQN, DDQN Duel=Dueling Network Architecture with DDQN, DDQN Duel PER IS = DDQN Duel using prioritized experience replay with importance sampling.

Figure: Development of percentage of successful tests for agents with different knowledge about the state variables. The different variables denote:L=terrestrial carbon, A=atmospheric carbon, G=geological carbon, T=temperature, P=population, K=capital, S=renewable energy knowledge.

Moti

vati

on

Environment: World-Earth Models - Combine socio-economic World dynamics with biophysical Earth system dynamics [5]- Models are stylized, their main focus is set on a qualitative understanding of the complex dynamics of these systems rather than to be used for quantitative predictions.

Key F

ind

ing

s

0 2000 4000 6000 8000 10000# Learned Episodes

50

100

150

200

250

Avera

ge R

ew

ard

per

Epis

ode

DQN

DDQN

Dueling

PER IS

DDQN Duel PER IS

Interface: Combine Agent and Environments- Map the World-Earth systems to a Markov Decission Process in terms of concrete states in the environment, actions in the action set and reward functions [3]- Implementation of action set and reward function just depends on the developer and is independent of the agent

Agent: Deep Reinforcement Learner- Acts only based on the information of the current state and the rewards signal- Uses DQN - Algorithm for Learning [4], i.e. the combination of Q-learning, neural networks and experience replay

The FrameworkInterface Design

Computer models can be used to showpossible pathways towards a sustainable future.

Adapte

d fro

m [3

]

Management Pathways

How can the World be steered into a safe and just operating space for humanity [2]?

A Safe and Just Operating Space

How to find an intelligent combination of management options in order to reach a sustainable future that stays within planetary boundaries at all times?

Deep Reinforcement Learning in Complex Systems

Deep Reinforcement Learning (DRL) algorithm has been proven to detect solutions up to super-human performance in various manageable environments

Adapted from https://deepmind.com/research/alphago/

Idea:Use DRL to uncover previously unknown appropriate management strategies for the World-Earth system

exce

ss a

tmos

pher

ic c

arbo

n

stoc

k A

[GtC

]

024

5390137

200288

4206401080

2400

economic output Y[1e+12 USD/yr]

071526405884122186315700

renew

able

know

ledge

stock S

[1e+

09

GJ]

0

50

111

187

285

416

600

875

1333

2250

5000

default

DG

ET

DG+ET

current state