bachelor's thesis (extended abstract), faculty of computer

4
Supervisor: Prof. Hiroshi Hosobe Q Applying Q-Learning to a Dot-Eat Game Takahiro Morita E-mail: [email protected] Abstract This paper proposes UCB fuzzy Q-learning by combining fuzzy Q-learning and the UCBQ algorithm and applies it to a dot-eat game. The UCBQ algorithm improved the action selection method called the UCB algorithm by applying it to Q-learning. The UCB algorithm selects the action with the highest value, called the UCB value, instead of a value estimate. In addition, since this algorithm is based on the premise that any unselected actions are selected and value estimates are obtained, the number of unselected actions is small, and it is possible to prevent local solutions. The proposed method aims to promote learning efficiently by reducing unselected actions and preventing the Q value from becoming a local solution in fuzzy Q-learning in a dot-eat game. This paper applies the proposed method to a dot- eat game called Ms. PacMan, and presents an experiment on finding optimum values used in the method. Its evaluation is performed by comparing the game score with the score obtained by the AI of a previous study. The result shows that the proposed method significantly reduced unselected actions. 1. AI AI 2017 AI Ponanza AI AI AlphaGo AI 3 AI Q [1] Q Q Q Q [2] Q Q Q UCBQ [3] UCB Q Q Q UCBQ UCB [4] Q UCB UCB Ms. PacMan [5] AI 2. Q AI [2] Q [6] Q 2 1 Ms. PacMan AI DeLooze [5] Q 2017 Microsoft AI [7] Ms. PacMan AI Hybrid Reward Architecture 150 Bachelor's Thesis (Extended Abstract), Faculty of Computer and Information Sciences, Hosei University 1

Upload: others

Post on 26-Nov-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bachelor's Thesis (Extended Abstract), Faculty of Computer

Supervisor: Prof. Hiroshi Hosobe

Q Applying Q-Learning to a Dot-Eat Game

Takahiro Morita

E-mail: [email protected]

Abstract

This paper proposes UCB fuzzy Q-learning by combining fuzzy Q-learning and the UCBQ algorithm and applies it to a dot-eat game. The UCBQ algorithm improved the action selection method called the UCB algorithm by applying it to Q-learning. The UCB algorithm selects the action with the highest value, called the UCB value, instead of a value estimate. In addition, since this algorithm is based on the premise that any unselected actions are selected and value estimates are obtained, the number of unselected actions is small, and it is possible to prevent local solutions. The proposed method aims to promote learning efficiently by reducing unselected actions and preventing the Q value from becoming a local solution in fuzzy Q-learning in a dot-eat game. This paper applies the proposed method to a dot-eat game called Ms. PacMan, and presents an experiment on finding optimum values used in the method. Its evaluation is performed by comparing the game score with the score obtained by the AI of a previous study. The result shows that the proposed method significantly reduced unselected actions.

1. AI AI

2017AI Ponanza AI

AI AlphaGo AI 3AI

Q [1] QQ

QQ [2]

Q Q

Q UCBQ [3]

UCB Q

QQ

UCBQUCB [4] Q

UCBUCB

Ms. PacMan

[5] AI

2. Q AI

[2] Q

[6] Q2

1

Ms. PacMan

AI DeLooze [5]

Q2017 Microsoft AI [7] Ms. PacMan

AI Hybrid Reward Architecture

150

Bachelor's Thesis (Extended Abstract), Faculty of Computer and Information Sciences, Hosei University

1

Page 2: Bachelor's Thesis (Extended Abstract), Faculty of Computer

3. Ms. PacMan Ms.PacMan 1980 PacMan

( 1)

200 400 800 1600

1 Ms. PacMan

PacMan ( )AI

Ms. PacMan

IEEE CEC

4.

4.1. Q Q 1992 Watkins [1]

()

QQ

4.2. Q Q [2] 2008

Q

QQ

DeLooze [5] Q Ms. PacMan

Q Q

4.3. UCB UCB [4] Q Q

UCB

QUCB

4.4. UCBQ [3] UCB Q

UCBQQ UCB

-greedy

Q

UCB

Q UCBQQ UCB Q

Ms. PacMan

UCB

Bachelor's Thesis (Extended Abstract), Faculty of Computer and Information Sciences, Hosei University

2

Page 3: Bachelor's Thesis (Extended Abstract), Faculty of Computer

1. ( ) ( ) ( )( )

2. UCB

3.

Q 2

5~20 1 UCB Q

6UCB

11

1415

Q 17s a

18 Q UCB

procedure UCB fuzzy Q-learning

1 begin 2 initialize , UCB, , ,

numEpisode 3 for cycle := 1 to numEpisode 4 # 5 while (not done) do 6 if 7 8 else 9

10 end 11

12 13 for cycle := 0 to len(Q) 14

15

16 end 17

18

19 20 end 21 end 22 end

2 UCB Q

6. Ms. PacMan OpenAIGym

OpenAIGym OpenAI

OpenAIGym

ATARI Ms. PacMan

OpenCV OpenCV

OpenAIGym Ms. PacMan

( )

3 427

3

50

3

4 ( )

UCBQ

Q 1

00 UCBQ

Q

0

1

100 150 200

0

1

20 60 100

Bachelor's Thesis (Extended Abstract), Faculty of Computer and Information Sciences, Hosei University

3

Page 4: Bachelor's Thesis (Extended Abstract), Faculty of Computer

2 6 UCB Q2 4 2

5 6

1

0 0 0 3360 1 0 0.2 3480 2 0.01 0.2 3560 3 0 0.2 3120 4 0.01 0.2 3300 5 0.1 0.2 3280 6 3.0 0.2 2460

0 Q 50

2 4 5 62 4 5 6 Q

50Q

8.

2 5 6 4.3

6

0.012

Q

02

0.03 QMs. PacMan Q

Q 2 Q1 Q

2

100

2

3

9. Q UCBQ

UCB QMs. PacMan UCBQ

Ms. PacMan 3

UCB Q

[1] C. J. Watkins and P. Dayan, "Q-Learning," Kluwer

Academic Publishers, 1992.

[2] , , , " Q," ,

vol. 15, no. 6, pp. 702-707, 2003.

[3] , , , " UCB," 30, pp. 174-179, 2014.

[4] P. Auer, N. Cesa-Bianchi and P. Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem," Machine Learning, vol. 47, pp. 235-256, 2002.

[5] L. L. DeLooze and W. R. Viner, "Fuzzy Q-Learning in a Nondeterministic Environment: Developing an Intelligent Ms. Pac-Man Agent," Proc. IEEE Symposium on Computational Intelligence and Games, pp. 162-169, 2009.

[6] , , , "Q ," 29

, pp. 1006-1011, 2013.

[7] H. v. Senjen, M. Fateme, J. Romoff, R. Laroche, T. Barnes and J. Tsang, "Hybrid Reward Architecture for Reinforcement Learning," arXiv:1706.04208v2, 2017.

Bachelor's Thesis (Extended Abstract), Faculty of Computer and Information Sciences, Hosei University

4