deep q-learning, alphago and alphazero

Post on 07-Jul-2022

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Q-Learning, AlphaGo and AlphaZero

Lyle Ungarwith a couple slides from Eric EatonDQN

AlphaGoAlphaZero, MuZero

Remember Q-Learning๐‘„ ๐‘ , ๐‘Ž โ† ๐‘„ ๐‘ , ๐‘Ž + ๐›ผ ๐‘… + ๐›พ๐‘„ ๐‘ โ€ฒ,ยต ๐‘ โ€ฒ โˆ’ ๐‘„ ๐‘ , ๐‘Ž

where๐‘„ ๐‘ โ€ฒ,ยต ๐‘ โ€ฒ = maxaโ€™ ๐‘„ ๐‘ โ€ฒ, ๐‘Žโ€ฒ

Converges when this is zero

Aside: off vs. on policy๐‘„ ๐‘ , ๐‘Ž โ† ๐‘„ ๐‘ , ๐‘Ž + ๐›ผ ๐‘… + ๐›พ๐‘„ ๐‘ โ€ฒ,ยต ๐‘ โ€ฒ โˆ’ ๐‘„ ๐‘ , ๐‘Ž

Q-learning - evaluate next state off policy (greedy)๐‘„ ๐‘ โ€ฒ,ยต ๐‘ โ€ฒ = maxaโ€™ ๐‘„ ๐‘ โ€ฒ, ๐‘Žโ€ฒ

SARSA - evaluate next state on policy (e-greedy)๐‘„ ๐‘ โ€ฒ,ยต ๐‘ โ€ฒ = ๐‘„ ๐‘ โ€ฒ, ๐œ‹ ๐‘ โ€ฒ = maxaโ€™ ๐‘„ ๐‘ โ€ฒ, ๐‘Žโ€ฒ with prob 1-e

randomaโ€™ ๐‘„ ๐‘ โ€ฒ, ๐‘Žโ€ฒ with prob e

Converges when this is zero

Deep Q-Learning (DQN)Inspired by๐‘„ ๐‘ , ๐‘Ž โ† ๐‘„ ๐‘ , ๐‘Ž + ๐›ผ ๐‘… + ๐›พmax ๐‘Žโ€ฒ๐‘„ ๐‘ โ€ฒ, ๐‘Žโ€ฒ โˆ’ ๐‘„ ๐‘ , ๐‘Ž

Represent ๐‘„ ๐‘ , ๐‘Ž by a neural net

Estimate using gradient descent with loss function:

๐‘… + ๐›พmax ๐‘Žโ€ฒ๐‘„ ๐‘ โ€ฒ, ๐‘Žโ€ฒ โˆ’ ๐‘„ ๐‘ , ๐‘Ž 2

The policy, p(a), is then given by maximizing the predicted Q-value

Separate Q- and Target NetworksIssue: Instability (e.g., rapid changes) in the Q-function can cause it to diverge

Idea: use two networks to provide stabilityu The Q-network is updated regularlyu The target network is an older version of the Q-network, updated

occasionally

โ‡ฃโ‡ฃR(s, a, s0) + ๏ฟฝmax

a0Q(s0, a0)

โŒ˜๏ฟฝQ(s, a)

โŒ˜2

computed viatarget network

computed viaQ-network

Deep Q-Learning (DQN) AlgorithmInitialize replay memory DInitialize Q-function weights โœ“for episode = 1 . . .M , do

Initialize state stfor t = 1 . . . T , do

at โ‡ข

random action with probability โœmaxa Qโ‡ค(st, a; โœ“) with probability 1๏ฟฝ โœ

Execute action at, yielding reward rt and state st+1

Store hst, at, rt, st+1i in Dst st+1

Sample random minibatch of transitions {hsj , aj , rj , sj+1i}Nj=1 from D

yj โ‡ข

rj for terminal state sj+1

rj + ๏ฟฝmaxa0 Q (sj+1, a0; โœ“) for non-terminal state sj+1

Perform a gradient descent step on (yj ๏ฟฝQ(sj , aj ; โœ“))2

end forend for

Based on https://arxiv.org/pdf/1312.5602v1.pdf

e-greedy

DQN on Atari Games

Image Sources:https://towardsdatascience.com/tutorial-double-deep-q-learning-with-dueling-network-architectures-4c1b3fb7f756https://deepmind.com/blog/going-beyond-average-reinforcement-learning/https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/

AlphaGo

https://medium.com/@jonathan_hui/alphago-how-it-works-technically-26ddcc085319 2016

AlphaGo - Complicated

1. Train a CNN to predict (supervised learning) moves of human experts

2. Use as starting point for policy gradient (self-play against older self)

Image from DeepMindโ€™s ICML 2016 tutorial on AlphaGo: https://icml.cc/2016/tutorials/AlphaGo-tutorial-slides.pdf

3. Train value network with examples from policy network self-play

4. Use Monte Carlo tree search to explore possible games

Learn policyu a = f(s)u a = where to play (19*19)u s = description of board

State ~ 19*19*48

AlphaGo โ€“ lots of hacksu Bootstrap

l Initialize with policy learned from human playu Self-playu Speed matters

l Rollout network (fast, less accurate game play)l Monte Carlo search

u It still needed fast computersl > 100 GPU weeks

AlphaZerou Self-play with a single, continually updated neural net

l No annotated features - just the raw board positionu Uses Monte Carlo Tree Search

l Using Q(s,a)

u Does policy iterationl Learns V(s) and P(s)

u Beat AlphaGo (100-0) after just 72 hours of trainingl On 5,000 TPUs

https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go 2017

Monte Carlo Tree Search (MCTS)u In each state sroot, select a move, at ~ pt either

proportionally (exploration) or greedily (exploitation)u Pick a move at with

l low visit count (not previously frequently explored) l high move probability (under the policyl high value (averaged over the leaf states of MC plays that

selected a from s) according to the current neural netu The MCTS returns an estimate z of v(sroot) and a

probability distribution over moves, p = p(a|sroot)

AlphaZero loss functionNNet: ( u Minimizes the error between the value function v(s) and

the actual game outcome zu Maximizes the similarity of the policy vector p(s) to the

MCTS probabilities p(s).u L2 regularize the weights q

AlphaZero

https://web.stanford.edu/~surag/posts/alphazero.html

MuZerou Slight generalization of AlphaZero

l Also uses Monte Carlo Tree Search (MCTC), self-playu Learns Go, Chess, Atari โ€ฆu Learns 3 CNNs

l Representation model: observation history โ†’ statet

l Dynamics model: statet * actiont โ†’ statet+1

l Policy: statet โ†’ actiont

StarCraftu StarCraft-playing AI model consists of 18 agents,

each trained with 16 Google v3 TPUs for 14 days.u Thus, at current prices ($8.00 / TPU hour), the

company spent $774,000 on this model

Applied RL Summaryu Superhuman game playing using DQL

l But it is fragile

u Why is DeepMind losing $500 million/year?

top related