choosemove=16 19 v =30 points. legalmoves= 16 19, or simulatemove = 11 15, or …. 16 19,

ChooseMove = 16 19

V = 30 points

LegalMoves =

16 19, or

SimulateMove =

11 15, or

….

16 19,

For all legal moves:

- simulate that move

- check the value of the result

Pick the best move

LegalMoves

12 16, or

11 15, or

….

12 16SimulateMove =

V 30 points=

=

We’ll make our computer program learn the function V

- so when it sees a board, it will assign a score to the board

- it has to learn how to assign the correct score to a board

First, we have to define how our function V is supposed to behave…

…but what about boards in the middle of a game?

b

b ‘

successor( b )

….

….

….

V(b’)=100


b

b ‘

successor( b )

….

….

….

V(b’)=100

V(b)=V(b’)=100

V(b)=V(b’)=100

….

b

b ‘

successor( b )

….

….

….

V(b’)=100

V(b)=V(b’)=100

V(b)=V(b’)=0

….

V(b’)=-100

NB: if this board state occurs in 50% wins and 50% loses then V(b) will eventually be 0

b

b ‘

successor( b )

….

….

….

V(b’)=100

V(b)=V(b’)=100

V(b)=V(b’)=0

….

V(b’)=-100

NB: if this board state occurs in 75% wins and 25% loses then V(b) will eventually be 50

b

b ‘

successor( b )

….

….

….

V(b’)=100

V(b)=V(b’)=100

V(b)=V(b’)=0

….

V(b’)=-100

so each board score will eventually be a mix of the number of games that you can win, lose or draw from that board (assuming an ideal scoring function…in fact we often only approximate this)


b

b ‘

successor( b )

….

….

….

V(b’)=100

V(b)=V(b’)=100

V(b)=V(b’)=100

….

that’s our definition of V – this is how our computer program’s V should learn how to behave…

But in the middle of a game, how can you calculate the path to the final board?

- generate all possibilities? tic-tac-toe

chess …takes too long!

- approximate V and call it V’ (Vhat)

V’(b) V(b) …so we’ll learn an approximation to V

b

b ‘

successor( b )

….

….

….

V(b’)=100

V(b)=V(b’)=100

V(b)=V(b’)=100

….

Now we’ll choose our representation of V’

(what is a board anyway…?)

1) big table?

50

90

input: board score

….

2) CLIPS rules?

IF my piece is near a side edge

THEN score = 80

IF my piece is near the opponents edge

THEN score = 90

….



3) polynomial function?

X1 = 12

X1 = number of white pieces on board

X2 = number of red pieces

X3 = number of white kings

X4 = number of red kings

X5 = number of white pieces threatened by red (can be captured on red’s next turn)

X6 = number of red pieces threatened by white

X2 = 11

X3 = 0

X4 = 0

X5 = 1

X6 = 0

e.g.

we define some variables…










arrange as linear combination: (quadratic function)










arrange as linear combination: (quadratic function)

Computer program will change the values of the weights – it will learn what the weights should be to give correct score for each board



learning process – playing and learning at same time

computer will play against itself

initialise V’ with random weights (w0=23 etc.)

V’(b) = 23 + 0.5 x1 +…




start a new game - for each board

(a) calculate V’ on all possible legal moves

board 1

V’(<x1=12, x2=12,…,x6=1>) = 23 + 0.5 x1 +…

= 30

b = <x1=12, x2=12,…,x6=1>

For example, one of the boards resulting from a legal move might be LegalMoves =

12 16, or

11 15, or….

12 16SimulateMove =

V 30 points=

=

=

=






board 1

First time round V’ is essentially random (because we set the weights as random) – as it learns V’ should pick better successors

LegalMoves =

12 16, or

11 15, or….

12 16SimulateMove =

V 30 points=

=

=

=






(b) pick the successor with the highest score

(c) evaluate error

board 1

red:1216

board 2

At this point

board 1 = b

board 2 = successor(b)







(c) evaluate error

(d) modify each weight to correct error

board 1

board 2

red:1216







(c) evaluate error


board 1

board 2

error

weight toupdate

learning rate

modify weight in proportion to size of attribute value

red:1216







(c) evaluate error


board 1

board 2

error

weight toupdate

learning rate

modify weight in proportion to size of attribute value

At this point the change probably won’t be in a useful direction, because the weights are all still random

red:1216







(c) evaluate error


(e) repeat until end-of-game

board 1

board 2

….

red wins

red:1216







(c) evaluate error



board 1

board 2

….

red wins

b is a final board state

Vtrain(b) = 100 …because we won

red:1216







(c) evaluate error



board 1

board 2

….

red wins

now we’re shifting our V’ towards something useful (a little bit)

red:1216







(c) evaluate error



board 1

board 2

….

red wins

1 million games later and our V’ might now predict a useful score for any board that we might see

red:1216

choosemove=16 19 v =30 points. legalmoves= 16 19, or simulatemove = 11 15, or …. 16 19,

Documents