Dynamic Programming and Reinforcement Learning applied to Tetris game
Suelen Goularte Carvalho
Inteligência Artificial 2015
Tetris
Tetris✓ Board 20 x 10 ✓ 7 types of tetronimos
(pieces)
✓ Move to down, left or right
✓ Rotation pieces
Tetris One-Piece Controller
Player knows: ✓ board ✓ current piece.
Tetris Two-Piece Controller
Player knows: ✓ board ✓ current piece ✓ next piece
Tetris EvaluationOne-Piece Controller
Two-Piece Controller
How many possibilities do we have just here?
Tetris indeed contains a huge number of board configurations.Finding the strategy that maximizes
the average score is an NP-Complete problem!
— Building Controllers for Tetris, 2009
7.0 × 2 ≃ 5.6 × 10199 59
Comp
lexity
Tetris
Tetris is a problems of sequential decision making under uncertainty.
In the context of dynamic programming and stochastic control, the most
important object is the cost-to-go function, which evaluates the expected
future cost from current state.
— Feature-Based Methods for Large Scale Dynamic Programming
7000
30002500
1000 4000Si
5000
7000
30002500
10004000best immediate
reward
Si
immediate rewardfuture reward
13000
9000
immediate reward
vs.
5000
best future reward
best immediate reward
Immediate reward
Future reward
7.0 × 2 ≃ 5.6 × 10199 59
Essentially impossible to compute, or even store, the value of the cost-to-go function at every
possible state.
— Feature-Based Methods for Large Scale Dynamic Programming
Compact representation alleviate the computational time and space of dynamic programming, which employs an exhaustive look-up table, storing one value per state.
— Feature-Based Methods for Large Scale Dynamic Programming
S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n
For example, if the state i represents the number of customers in a queueing
system, a possible and often interesting feature f is defined by f(0) = 0 and f(i) = 1 if i > 0. Such a feature focuses on whether
a queue is empty or not.
— Feature-Based Methods for Large Scale Dynamic Programming
— Feature-Based Methods for Large Scale Dynamic Programming
Feature-bases method
S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n
— Feature-Based Methods for Large Scale Dynamic Programming
Features:★ Height of the current wall. ★ Number of holes.
H = {0, ..., 20}, L = {0, ..., 200}.
Feature extraction F : S ~ H x L
10 X 20
Feature-bases method
Using a feature-based evaluation function works better
than just choosing the move that realizes the highest
immediate reward.— Building Controllers for Tetris, 2009
Example of features
— Building Controllers for Tetris, 2009
...The problem of building a Tetris controller comes down to building a good evaluation function. Ideally,
this function should return high values for the good decisions and
low values for the bad ones.
— Building Controllers for Tetris, 2009
Reinforcement Learning context, algorithms aim at
tuning the weights such that the evaluation function approximates well the
optimal expected future score from each state.
— Building Controllers for Tetris, 2009
Reinforcement Learning
Reinforcement Learning by The Big Bang Theory
https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C
Reinforcement Learning
Imagine disputar um novo jogo cuja as regras você não conhece, depois
de aproximadamente uma centena de movimentos, seu oponente anuncia: “Você perdeu!”. Em resumo, isso é
aprendizagem por reforço.
Supervised Learning
input 1 2 3 4 5 6 7 8 ….
output 1 2 9 16 25 36 49 64 ….
y = f(x) -> function approximation
https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo
Map inputs to output
f(x) = x
labels score
s well
2
Unsupervised Learning
xx
x
xx
x
xxx
x
o
o
oo
oo
o
o
f(x) -> clusters description
oo x
xx
xxx
x
xx
x
oo
oo
oo
o oootype
clusters
scores well
Reinforcement Learning
Agent
Environment
ActionReward, State
behaviors sco
res well
Reinforcement Learning
✓ Agents take actions in an environment and receive rewards
✓ Goal is to find the policy π that maximizes rewards
✓ Inspired by research into psychology and animal learning
Reinforcement Learning ModelGiven:S set of states, A set of actions, T(s, a, s') ~ P(s’ | s, a) transitional model, R reward function
5000
7000
30002500
10004000Si immediate rewardfuture reward
13000
9000
Find:π(s) = a policy that maximizes
Needs higher computation, processing and memory.
Dynamic Programming
Dynamic Programming
Solving problems by breaking it down into simpler subproblems. Solving each subproblems just once, and
storing their solutions.
https://en.wikipedia.org/wiki/Dynamic_programming
A G
caminho ótimo
A Bcaminho ótimo
Gcaminho ótimo
Support Property: Optimal Substructure
Fibonacci Sequence
0 1 1 2 3 5 8 13 21
The sum of two numbers before results in the follow number.
0 1 1 2 3 5 8 13 21
f(n) = f(n-1) + f(n-2)Recursive Formula:
v = 0 1 2 3 4 5 6 7 8 n =
Fibonacci Sequence
Fibonacci0 1 1 2 3 5 8 13 210 1 2 3 4 5 6 7 8
f(6) = f(6-1) + f(6-2)f(6) = f(5) + f(4)f(6) = 5 + 3f(6) = 8
v = n =
Fibonacci Sequence - Normal computation6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
f(n) = f(n-1) + f(n-2)
6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
Fibonacci Sequence - Normal computation
O(n )2
18 of 25 Nodes Are Repeated Calculations!
Dictionary m m[0]=0, m[1]=1
integer fib(n) if m[n] == null m[n] = fib(n-1)+ fib(n-2)
return m[n]
Fibonacci Sequence - Dynamic Programming
Fibonacci Sequence - Dynamic Programming
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1 1
1+0=1
Fibonacci Sequence - Dynamic Programming
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1 1 2
1+0=1
1+1=2
Fibonacci Sequence - Dynamic Programming
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1 1 2 3
1+0=1
1+1=2
2+1=3
Fibonacci Sequence - Dynamic Programming
5
4 3
3
2 1
1 0
2
O(1) memory O(n) running time
index value0 1 2 3 4 5
0 1 1 2 3 51+0=1
1+1=2
2+1=3
3+2=5
Fibonacci Sequence - Dynamic Programming
100 games played31Some scores from time…
Tsitsiklis and van Roy (1996)
Bertsekas and Tsitsiklis (1996)3200 100 games played
Kakade (2001) appliedwithout specifying how many game scores are averaged though6800
Farias and van Roy (2006)90 games played.4700
— Building Controllers for Tetris, 2009
Two-piece controller with some original features of which the weights were tuned by hand. Only 1 game was played and this took a week.
One-piece controller 56 games played.
Tuned by hand. 660Mil
7,2Mi
Currents best!
Dellacherie (Fahey, 2003)
Dellacherie (Fahey, 2003)
— Building Controllers for Tetris, 2009
Experiment…
Experiment
— Feature-Based Methods for Large Scale Dynamic Programming
Experienced human Tetris player would take
about 3 minutes to eliminate 30 rows.
20 jogadores. 3 jogadas cada. 3 minutos cada jogada.
Experiment cont.
30
Média obtida: 24 score
Jogador 7 (eu) jogada 1
1000 scores ~ 1 row
Experiment cont.
• Média 24 score a cada 3 minutos.
• Ou seja, 5.760 a cada 12h de jogo contínuo.
• Um ser-humano jogando começa a ficar próximo a performance dos algoritmos, após algumas otimizações, após mais ou menos 8h de jogo contínuo.
Experiment cont.
Conclusão…
Dynamic Programming
Reinforcement Learning
Tetris
Otimiza a utilização do poder computacional.
Otimiza peso utilizado nas features.
Utiliza feature-based para maximizar o score.
Dúvidas?
Suelen Goularte CarvalhoInteligência Artificial
2015