dynamic programming and reinforcement learning applied to tetris game

54
Dynamic Programming and Reinforcement Learning applied to Tetris game Suelen Goularte Carvalho Inteligência Artificial 2015

Upload: suelen-goularte-carvalho

Post on 14-Apr-2017

441 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Dynamic Programming and Reinforcement Learning applied to Tetris game

Suelen Goularte Carvalho

Inteligência Artificial 2015

Page 2: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Tetris

Page 3: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Tetris✓ Board 20 x 10 ✓ 7 types of tetronimos

(pieces)

✓ Move to down, left or right

✓ Rotation pieces

Page 4: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Tetris One-Piece Controller

Player knows: ✓ board ✓ current piece.

Page 5: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Tetris Two-Piece Controller

Player knows: ✓ board ✓ current piece ✓ next piece

Page 6: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Tetris EvaluationOne-Piece Controller

Two-Piece Controller

Page 7: Dynamic Programming and Reinforcement Learning applied to Tetris Game

How many possibilities do we have just here?

Page 8: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Tetris indeed contains a huge number of board configurations.Finding the strategy that maximizes

the average score is an NP-Complete problem!

— Building Controllers for Tetris, 2009

7.0 × 2 ≃ 5.6 × 10199 59

Page 9: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Comp

lexity

Tetris

Page 10: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Tetris is a problems of sequential decision making under uncertainty.

In the context of dynamic programming and stochastic control, the most

important object is the cost-to-go function, which evaluates the expected

future cost from current state.

— Feature-Based Methods for Large Scale Dynamic Programming

Page 11: Dynamic Programming and Reinforcement Learning applied to Tetris Game

7000

30002500

1000 4000Si

5000

7000

30002500

10004000best immediate

reward

Si

immediate rewardfuture reward

13000

9000

immediate reward

vs.

5000

best future reward

best immediate reward

Immediate reward

Future reward

Page 12: Dynamic Programming and Reinforcement Learning applied to Tetris Game

7.0 × 2 ≃ 5.6 × 10199 59

Essentially impossible to compute, or even store, the value of the cost-to-go function at every

possible state.

— Feature-Based Methods for Large Scale Dynamic Programming

Page 13: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Compact representation alleviate the computational time and space of dynamic programming, which employs an exhaustive look-up table, storing one value per state.

— Feature-Based Methods for Large Scale Dynamic Programming

S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n

Page 14: Dynamic Programming and Reinforcement Learning applied to Tetris Game

For example, if the state i represents the number of customers in a queueing

system, a possible and often interesting feature f is defined by f(0) = 0 and f(i) = 1 if i > 0. Such a feature focuses on whether

a queue is empty or not.

— Feature-Based Methods for Large Scale Dynamic Programming

Page 15: Dynamic Programming and Reinforcement Learning applied to Tetris Game

— Feature-Based Methods for Large Scale Dynamic Programming

Feature-bases method

S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n

Page 16: Dynamic Programming and Reinforcement Learning applied to Tetris Game

— Feature-Based Methods for Large Scale Dynamic Programming

Features:★ Height of the current wall. ★ Number of holes.

H = {0, ..., 20}, L = {0, ..., 200}.

Feature extraction F : S ~ H x L

10 X 20

Feature-bases method

Page 17: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Using a feature-based evaluation function works better

than just choosing the move that realizes the highest

immediate reward.— Building Controllers for Tetris, 2009

Page 18: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Example of features

— Building Controllers for Tetris, 2009

Page 19: Dynamic Programming and Reinforcement Learning applied to Tetris Game

...The problem of building a Tetris controller comes down to building a good evaluation function. Ideally,

this function should return high values for the good decisions and

low values for the bad ones.

— Building Controllers for Tetris, 2009

Page 20: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Reinforcement Learning context, algorithms aim at

tuning the weights such that the evaluation function approximates well the

optimal expected future score from each state.

— Building Controllers for Tetris, 2009

Page 21: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Reinforcement Learning

Page 22: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Reinforcement Learning by The Big Bang Theory

https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C

Page 23: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Reinforcement Learning

Imagine disputar um novo jogo cuja as regras você não conhece, depois

de aproximadamente uma centena de movimentos, seu oponente anuncia: “Você perdeu!”. Em resumo, isso é

aprendizagem por reforço.

Page 24: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Supervised Learning

input 1 2 3 4 5 6 7 8 ….

output 1 2 9 16 25 36 49 64 ….

y = f(x) -> function approximation

https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo

Map inputs to output

f(x) = x

labels score

s well

2

Page 25: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Unsupervised Learning

xx

x

xx

x

xxx

x

o

o

oo

oo

o

o

f(x) -> clusters description

oo x

xx

xxx

x

xx

x

oo

oo

oo

o oootype

clusters

scores well

Page 26: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Reinforcement Learning

Agent

Environment

ActionReward, State

behaviors sco

res well

Page 27: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Reinforcement Learning

✓ Agents take actions in an environment and receive rewards

✓ Goal is to find the policy π that maximizes rewards

✓ Inspired by research into psychology and animal learning

Page 28: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Reinforcement Learning ModelGiven:S set of states, A set of actions, T(s, a, s') ~ P(s’ | s, a) transitional model, R reward function

5000

7000

30002500

10004000Si immediate rewardfuture reward

13000

9000

Find:π(s) = a policy that maximizes

Page 29: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Needs higher computation, processing and memory.

Page 30: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Dynamic Programming

Page 31: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Dynamic Programming

Solving problems by breaking it down into simpler subproblems. Solving each subproblems just once, and

storing their solutions.

https://en.wikipedia.org/wiki/Dynamic_programming

Page 32: Dynamic Programming and Reinforcement Learning applied to Tetris Game

A G

caminho ótimo

A Bcaminho ótimo

Gcaminho ótimo

Support Property: Optimal Substructure

Page 33: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Fibonacci Sequence

0 1 1 2 3 5 8 13 21

The sum of two numbers before results in the follow number.

Page 34: Dynamic Programming and Reinforcement Learning applied to Tetris Game

0 1 1 2 3 5 8 13 21

f(n) = f(n-1) + f(n-2)Recursive Formula:

v = 0 1 2 3 4 5 6 7 8 n =

Fibonacci Sequence

Page 35: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Fibonacci0 1 1 2 3 5 8 13 210 1 2 3 4 5 6 7 8

f(6) = f(6-1) + f(6-2)f(6) = f(5) + f(4)f(6) = 5 + 3f(6) = 8

v = n =

Page 36: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Fibonacci Sequence - Normal computation6

5 4

4

3 2

2 1

1 0

2 1

2

2 1

3 3

1 0 1 0 1 0

1 0

f(n) = f(n-1) + f(n-2)

Page 37: Dynamic Programming and Reinforcement Learning applied to Tetris Game

6

5 4

4

3 2

2 1

1 0

2 1

2

2 1

3 3

1 0 1 0 1 0

1 0

Fibonacci Sequence - Normal computation

O(n )2

Page 38: Dynamic Programming and Reinforcement Learning applied to Tetris Game

18 of 25 Nodes Are Repeated Calculations!

Page 39: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Dictionary m m[0]=0, m[1]=1

integer fib(n) if m[n] == null m[n] = fib(n-1)+ fib(n-2)

return m[n]

Fibonacci Sequence - Dynamic Programming

Page 40: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Fibonacci Sequence - Dynamic Programming

5

4 3

3

2 1

1 0

2 index value0 1 2 3 4 5

0 1

Page 41: Dynamic Programming and Reinforcement Learning applied to Tetris Game

5

4 3

3

2 1

1 0

2 index value0 1 2 3 4 5

0 1 1

1+0=1

Fibonacci Sequence - Dynamic Programming

Page 42: Dynamic Programming and Reinforcement Learning applied to Tetris Game

5

4 3

3

2 1

1 0

2 index value0 1 2 3 4 5

0 1 1 2

1+0=1

1+1=2

Fibonacci Sequence - Dynamic Programming

Page 43: Dynamic Programming and Reinforcement Learning applied to Tetris Game

5

4 3

3

2 1

1 0

2 index value0 1 2 3 4 5

0 1 1 2 3

1+0=1

1+1=2

2+1=3

Fibonacci Sequence - Dynamic Programming

Page 44: Dynamic Programming and Reinforcement Learning applied to Tetris Game

5

4 3

3

2 1

1 0

2

O(1) memory O(n) running time

index value0 1 2 3 4 5

0 1 1 2 3 51+0=1

1+1=2

2+1=3

3+2=5

Fibonacci Sequence - Dynamic Programming

Page 45: Dynamic Programming and Reinforcement Learning applied to Tetris Game

100 games played31Some scores from time…

Tsitsiklis and van Roy (1996)

Bertsekas and Tsitsiklis (1996)3200 100 games played

Kakade (2001) appliedwithout specifying how many game scores are averaged though6800

Farias and van Roy (2006)90 games played.4700

— Building Controllers for Tetris, 2009

Page 46: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Two-piece controller with some original features of which the weights were tuned by hand. Only 1 game was played and this took a week.

One-piece controller 56 games played.

Tuned by hand. 660Mil

7,2Mi

Currents best!

Dellacherie (Fahey, 2003)

Dellacherie (Fahey, 2003)

— Building Controllers for Tetris, 2009

Page 47: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Experiment…

Page 48: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Experiment

— Feature-Based Methods for Large Scale Dynamic Programming

Experienced human Tetris player would take

about 3 minutes to eliminate 30 rows.

Page 49: Dynamic Programming and Reinforcement Learning applied to Tetris Game

20 jogadores. 3 jogadas cada. 3 minutos cada jogada.

Experiment cont.

30

Média obtida: 24 score

Page 50: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Jogador 7 (eu) jogada 1

1000 scores ~ 1 row

Experiment cont.

Page 51: Dynamic Programming and Reinforcement Learning applied to Tetris Game

• Média 24 score a cada 3 minutos.

• Ou seja, 5.760 a cada 12h de jogo contínuo.

• Um ser-humano jogando começa a ficar próximo a performance dos algoritmos, após algumas otimizações, após mais ou menos 8h de jogo contínuo.

Experiment cont.

Page 52: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Conclusão…

Page 53: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Dynamic Programming

Reinforcement Learning

Tetris

Otimiza a utilização do poder computacional.

Otimiza peso utilizado nas features.

Utiliza feature-based para maximizar o score.

Page 54: Dynamic Programming and Reinforcement Learning applied to Tetris Game

Dúvidas?

Suelen Goularte CarvalhoInteligência Artificial

2015