uct (upper confidence based tree search) an efficient game tree search algorithm for the game of go...

1
UCT (Upper Confidence based Tree Search) An efficient game tree search algorithm for the game of Go by Levente Kocsis and Csaba Szepesvari [1]. The UCB1 (Upper Confidence Bounds) Algorithm [2] is employed to search the game tree. Features of UCT include: - balancing the exploration-exploitation dilemma, - small probability of selecting erroneous moves even if the algorithm is stopped prematurely, - convergence to the best action if sufficient time is allowed, - sampling actions selectively with a fixed look ahead depth d. - tree built incrementally by selecting a branch or an episode selectively over time (Figure 2). The Go problem seen from UCT is a multi-arm bandit problem [3], i.e. a K-armed bandit (Figure 1). Figure 1. The multi-arm bandit problem. Figure 2. The game tree expressed in terms of the Markov Decision Process. A bandit problem is similar to a slot machine with multiple levers The goal is to maximize the expected payoff The problem can be formulated by the Markov Decision Process - State s is the current board status - Action a is the choice of a bandit, action set A=(1, …, K) - Reward r a,n is the payoff of action a and index n in the range of [0,1]. In the problem of Go, the goal is to select the best move, action, or bandit to maximize the expected total reward given a state (positions on the Go board). UCT selects an action a to maximize the summation of the estimate action-value function at depth d, Q t (s,a), and the bias term, b(s,a). (1) Pseudocode for a generic Monte-Carlo planning algorithm The action-value function Q t (s,a) is estimated by averaging the rewards r(a,N s,a,t ) where N s,a,t is the A Go Board Image source: http://upload.wikimedia.org/wikipedia/en/2/2e/Go_board .jpg GPU GO: Accelerating Monte-Carlo UCT Tae-hyung Kim*, Ryan Meuth*, Paul Robinette*, Donald C. Wunsch II* Department of Electrical and Computer Engineering University of Missouri – Rolla, Rolla, MO 65409 [email protected], [email protected], [email protected], [email protected] Any solution to Go, and this algorithm in particular, can be applied to decision processes such as how to price gas for maximum profit Image Source: http://www.gasbuddy.com/gb_gastemp eraturemap.aspx The exploration- exploitation components of this algorithm combined with the concept of territory control offered by Go can give insight in deciding where to place a new Starbucks in Manhattan. Image Source: http://gothamist.com/attachments/jak e/2006_10_starbuckslocations.jpg NVIDIA Tesla GPU Image Source: http://www.nvidia.com/docs/IO/43399/tesla_main.jp Applications in Strategy and Economics Combining novel Computational intelligence methods and state of the art brute-force acceleration methods to solve a game with 10^60 states and wide ranging applications in science and industry. Go is considered to be the most important strategy game to be solved by computers GPU Graphics Processing Units GPU Capabilities are increasing exponentially through the utilization of parallel architectures. H D P Training Perform ance G PU vs.CPU 0.01 0.1 1 10 100 1000 100 500 1000 5000 10000 50000 100000 500000 1E +06 N um berofIterations C om pletion Tim e GPU CPU GPU – HDP implementation yields 22x performance increase on Legacy hardware. Modern graphics processing units (GPU) and game consoles are used for much more than 3D graphics applications and video games. From machine vision to finite element analysis, GPU’s are being used in diverse applications, collectively called General Purpose computation on Graphics Processor Units (GPGPU). Additionally, game consoles are entering the market of high performance computing as inexpensive nodes in computing clusters. Significant performance gains can be elicited from implementing Neural Networks and Approximate Dynamic Programming algorithms on graphics processing units. However, there is an amount of art to these implementations. In some cases the performance gains can be as high as 200 times, but as low as 2 times or even less than CPU operation. Thus it is necessary to understand the limitations of the graphics processing hardware, and to take these limitations into account when developing algorithms targeted at the GPU. In the context of Go, GPU’s can not only reduce the time to evaluate game states, but they can also serve as massively parallel processors, computing move trees in parallel. 00 01 10 11 00 01 10 11 00 01 10 11 1 ST bandit 2 nd bandit K th bandit Decision maker a 1 a 2 a K in ) , ( ) , ( max arg } ,..., 1 { S s a s b a s Q t K a )] , ( [ ) , ( Q , , t t a s N a r E a s ) ln( 2 ) , ( , , , t a s t s d N N C a s b s 0 a 0 a 1 s 0 s 1 s n s 0 s n a 0 a k s 0 s 1 s n An episode a k s 1 a 1 [1] Levente Kocsis and Csaba Szepesvari, “Bandit based Monte-Carlo Planning”, Lecture Notes in Artificial Intelligence, 2006. [2] Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer, “Finite-time Analysis of the Multi-armed Bandit Problem”, Machine Learning, 2002. [3] Richard S. Sutton and Andres G. Barto, "Reinforcement Learning: An Introduction", MIT Press, 1998. The most natural application of Go is in military strategy. The UCT algorithm can be applied to maximizing zones of control, planning efficient and safe routes, and allocating resources correctly. Image Source: http://www.globalsecurity.org/ military/ops/images/oif_mil- ground-routes-map_may04.jpg

Upload: egbert-banks

Post on 25-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: UCT (Upper Confidence based Tree Search) An efficient game tree search algorithm for the game of Go by Levente Kocsis and Csaba Szepesvari [1]. The UCB1

UCT (Upper Confidence based Tree Search)An efficient game tree search algorithm for the game of Go by Levente Kocsis and Csaba

Szepesvari [1].The UCB1 (Upper Confidence Bounds) Algorithm [2] is employed to search the game tree.

Features of UCT include:- balancing the exploration-exploitation dilemma,- small probability of selecting erroneous moves even if the algorithm is stopped prematurely,- convergence to the best action if sufficient time is allowed,- sampling actions selectively with a fixed look ahead depth d.- tree built incrementally by selecting a branch or an episode selectively over time (Figure 2).

The Go problem seen from UCT is a multi-arm bandit problem [3], i.e. a K-armed bandit (Figure 1).

Figure 1. The multi-arm bandit problem. Figure 2. The game tree expressed in terms of the Markov Decision Process.

A bandit problem is similar to a slot machine with multiple leversThe goal is to maximize the expected payoffThe problem can be formulated by the Markov Decision Process - State s is the current board status - Action a is the choice of a bandit, action set A=(1, …, K) - Reward ra,n is the payoff of action a and index n in the range of [0,1].

In the problem of Go, the goal is to select the best move, action, or banditto maximize the expected total reward given a state (positions on the Go board).

UCT selects an action a to maximize the summation of the estimate action-value function at depth d, Qt(s,a), and the bias term, b(s,a).

(1) Pseudocode for a generic Monte-Carlo planning algorithmThe action-value function Qt(s,a) is estimated byaveraging the rewards r(a,Ns,a,t) where Ns,a,t is thenumber of times action a has been selected by time t.

(2)

The bias term used by UCT is given below where Cd is a non-zero constant to prevent drifting and Ns,t is the number of times state s has been visited by time t.

(3)

A Go BoardImage source: http://upload.wikimedia.org/wikipedia/en/2/2e/Go_board.jpg

GPU GO: Accelerating Monte-Carlo UCT

Tae-hyung Kim*, Ryan Meuth*, Paul Robinette*, Donald C. Wunsch II*

Department of Electrical and Computer Engineering University of Missouri – Rolla, Rolla, MO 65409

[email protected], [email protected], [email protected], [email protected]

Any solution to Go, and this algorithm in particular, can be applied to decision processes such as how to price gas for maximum profitImage Source: http://www.gasbuddy.com/gb_gastemperaturemap.aspx

The exploration-exploitation components of this algorithm combined with the concept of territory control offered by Go can give insight in deciding where to place a new Starbucks in Manhattan. Image Source: http://gothamist.com/attachments/jake/2006_10_starbuckslocations.jpg

NVIDIA Tesla GPUImage Source: http://www.nvidia.com/docs/IO/43399/tesla_main.jpg

Applications in Strategy and Economics

Combining novel Computational intelligence methods and state of the art brute-force acceleration methods to solve a game with 10^60 states and wide ranging applications in science and industry.

Go is considered to be the most important strategy game to be solved by computers

GPUGraphics Processing Units

GPU Capabilities are increasing exponentially through the utilization of parallel architectures.

HDP Training Performance GPU vs. CPU

0.01

0.1

1

10

100

1000

100 500 1000 5000 10000 50000 100000 500000 1E+06

Number of Iterations

Co

mp

leti

on

Tim

e

GPU

CPU

GPU – HDP implementation yields 22x performance increase on Legacy hardware.

Modern graphics processing units (GPU) and game consoles are used for much more than 3D graphics applications and video games. From machine vision to finite element analysis, GPU’s are being used in diverse applications, collectively called General Purpose computation on Graphics Processor Units (GPGPU). Additionally, game consoles are entering the market of high performance computing as inexpensive nodes in computing clusters.

Significant performance gains can be elicited from implementing Neural Networks and Approximate Dynamic Programming algorithms on graphics processing units. However, there is an amount of art to these implementations. In some cases the performance gains can be as high as 200 times, but as low as 2 times or even less than CPU operation. Thus it is necessary to understand the limitations of the graphics processing hardware, and to take these limitations into account when developing algorithms targeted at the GPU.

In the context of Go, GPU’s can not only reduce the time to evaluate game states, but they can also serve as massively parallel processors, computing move trees in parallel.

00 01

10 11

00 01

10 11

00 01

10 11

1ST bandit 2nd bandit Kth bandit

Decision maker

a1 a2aK

in ),(),(maxarg},...,1{

SsasbasQtKa

)],([),(Q ,,t tasNarEas

)ln(2),(

,,

,

tas

tsd N

NCasb

s0

a0 …a1

s0 s1 sns0 sn

a0 ak

s0 s1 sn

An episode

ak

s1

a1

[1] Levente Kocsis and Csaba Szepesvari, “Bandit based Monte-Carlo Planning”, Lecture Notes in Artificial Intelligence, 2006.[2] Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer, “Finite-time Analysis of the Multi-armed Bandit Problem”, Machine Learning, 2002. [3] Richard S. Sutton and Andres G. Barto, "Reinforcement Learning: An Introduction", MIT Press, 1998.

The most natural application of Go is in military strategy. The UCT algorithm can be applied to maximizing zones of control, planning efficient and safe routes, and allocating resources correctly.Image Source: http://www.globalsecurity.org/military/ops/images/oif_mil-ground-routes-map_may04.jpg