choosemove=16 19 v =30 points. legalmoves= 16 19, or simulatemove = 11 15, or …. 16 19,
TRANSCRIPT
ChooseMove = 16 19
V = 30 points
LegalMoves =
16 19, or
SimulateMove =
11 15, or
….
16 19,
For all legal moves:
- simulate that move
- check the value of the result
Pick the best move
LegalMoves
12 16, or
11 15, or
….
12 16SimulateMove =
V 30 points=
=
We’ll make our computer program learn the function V
- so when it sees a board, it will assign a score to the board
- it has to learn how to assign the correct score to a board
First, we have to define how our function V is supposed to behave…
…but what about boards in the middle of a game?
b
b ‘
successor( b )
….
….
….
V(b’)=100
…but what about boards in the middle of a game?
b
b ‘
successor( b )
….
….
….
V(b’)=100
V(b)=V(b’)=100
V(b)=V(b’)=100
….
b
b ‘
successor( b )
….
….
….
V(b’)=100
V(b)=V(b’)=100
V(b)=V(b’)=0
….
V(b’)=-100
NB: if this board state occurs in 50% wins and 50% loses then V(b) will eventually be 0
b
b ‘
successor( b )
….
….
….
V(b’)=100
V(b)=V(b’)=100
V(b)=V(b’)=0
….
V(b’)=-100
NB: if this board state occurs in 75% wins and 25% loses then V(b) will eventually be 50
b
b ‘
successor( b )
….
….
….
V(b’)=100
V(b)=V(b’)=100
V(b)=V(b’)=0
….
V(b’)=-100
so each board score will eventually be a mix of the number of games that you can win, lose or draw from that board (assuming an ideal scoring function…in fact we often only approximate this)
…but what about boards in the middle of a game?
b
b ‘
successor( b )
….
….
….
V(b’)=100
V(b)=V(b’)=100
V(b)=V(b’)=100
….
that’s our definition of V – this is how our computer program’s V should learn how to behave…
But in the middle of a game, how can you calculate the path to the final board?
- generate all possibilities? tic-tac-toe
chess …takes too long!
- approximate V and call it V’ (Vhat)
V’(b) V(b) …so we’ll learn an approximation to V
b
b ‘
successor( b )
….
….
….
V(b’)=100
V(b)=V(b’)=100
V(b)=V(b’)=100
….
Now we’ll choose our representation of V’
(what is a board anyway…?)
1) big table?
50
90
input: board score
….
2) CLIPS rules?
IF my piece is near a side edge
THEN score = 80
IF my piece is near the opponents edge
THEN score = 90
….
Now we’ll choose our representation of V’
(what is a board anyway…?)
3) polynomial function?
X1 = 12
X1 = number of white pieces on board
X2 = number of red pieces
X3 = number of white kings
X4 = number of red kings
X5 = number of white pieces threatened by red (can be captured on red’s next turn)
X6 = number of red pieces threatened by white
X2 = 11
X3 = 0
X4 = 0
X5 = 1
X6 = 0
e.g.
we define some variables…
Now we’ll choose our representation of V’
(what is a board anyway…?)
3) polynomial function?
X1 = number of white pieces on board
X2 = number of red pieces
X3 = number of white kings
X4 = number of red kings
X5 = number of white pieces threatened by red (can be captured on red’s next turn)
X6 = number of red pieces threatened by white
arrange as linear combination: (quadratic function)
Now we’ll choose our representation of V’
(what is a board anyway…?)
3) polynomial function?
X1 = number of white pieces on board
X2 = number of red pieces
X3 = number of white kings
X4 = number of red kings
X5 = number of white pieces threatened by red (can be captured on red’s next turn)
X6 = number of red pieces threatened by white
arrange as linear combination: (quadratic function)
Computer program will change the values of the weights – it will learn what the weights should be to give correct score for each board
Now we’ll choose our representation of V’
(what is a board anyway…?)
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
V’(b) = 23 + 0.5 x1 +…
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
board 1
V’(<x1=12, x2=12,…,x6=1>) = 23 + 0.5 x1 +…
= 30
b = <x1=12, x2=12,…,x6=1>
For example, one of the boards resulting from a legal move might be LegalMoves =
12 16, or
11 15, or….
12 16SimulateMove =
V 30 points=
=
=
=
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
board 1
First time round V’ is essentially random (because we set the weights as random) – as it learns V’ should pick better successors
LegalMoves =
12 16, or
11 15, or….
12 16SimulateMove =
V 30 points=
=
=
=
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
board 1
red:1216
board 2
At this point
board 1 = b
board 2 = successor(b)
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
board 1
board 2
red:1216
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
board 1
board 2
error
weight toupdate
learning rate
modify weight in proportion to size of attribute value
red:1216
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
board 1
board 2
error
weight toupdate
learning rate
modify weight in proportion to size of attribute value
At this point the change probably won’t be in a useful direction, because the weights are all still random
red:1216
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
board 1
board 2
….
red wins
red:1216
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
board 1
board 2
….
red wins
b is a final board state
Vtrain(b) = 100 …because we won
red:1216
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
board 1
board 2
….
red wins
now we’re shifting our V’ towards something useful (a little bit)
red:1216
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
board 1
board 2
….
red wins
1 million games later and our V’ might now predict a useful score for any board that we might see
red:1216