ttic 31230, fundamentals of deep learning · alphago zero: trained on self play only. trained for 3...

12

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 AlphaZero Background Algorithms 1

Upload: others

Post on 25-May-2020

6 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

TTIC 31230, Fundamentals of Deep Learning

David McAllester, Winter 2020

AlphaZero Background Algorithms

1

Page 2: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaGo Fan (October 2015)

AlphaGo Defeats Fan Hui, European Go Champion.

2

Page 3: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaGo Lee (March 2016)

3

Page 4: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaGo Zero vs. Alphago Lee (April 2017)

AlphaGo Lee:

• Trained on both human games and self play.

• Trained for Months.

•Run on many machines with 48 TPUs for Lee Sedol match.

AlphaGo Zero:

• Trained on self play only.

• Trained for 3 days.

•Run on one machine with 4 TPUs.

•Defeated AlphaGo Lee under match conditions 100 to 0.

4

Page 5: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaZero Defeats Stockfish in Chess (December 2017)

AlphaGo Zero was a fundamental algorithmic advance for gen-eral RL.

The general RL algorithm of AlphaZero is essentially the sameas that of AlphaGo Zero.

5

Page 6: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

Some Background

6

Page 7: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

Monte-Carlo Tree Search (MCTS)

Brugmann (1993)

To estimate the value of a position (who is ahead and by howmuch) run a cheap stochastic policy to generate a sequence ofmoves (a rollout) and see who wins.

Select the move with the best rollout value.

7

Page 8: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

(One Armed) Bandit Problems

Robbins (1952)

Consider a set of choices (different slot machines).Each choice gets a stochastic reward.

We can select a choice and get a reward as often as we like.

We would like to determine which choice is best and also toget reward as quickly as possible.

8

Page 9: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

The Upper Confidence Bound (UCB) Algorithm

Lai and Robbins (1985)

For each action choice (bandit) a, construct a confidence in-terval for its average reward based on n trials for that acrtion.

µ(a) ∈ µ̂(a)± 2σ(a)/√n(a)

Always select

argmaxa

µ̂(a) + 2σ(a)/√n(a)

9

Page 10: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

The Upper Confidence Tree (UCT) Algorithm

Kocsis and Szepesvari (2006), Gelly and Silver (2007)

The UCT algorithm grows a tree by running “simulations”.

Each simulation descends into the tree to a leaf node, expandsthat leaf, and returns a value.

In the UCT algorithm each move choice at each position istreated as a bandit problem.

We select the child (bandit) with the lowest upper bound ascomputed from simulations selecting that child.

10

Page 11: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

Bootstrapping from Game Tree Search

Vaness, Silver, Blair and Uther, NeurIPS 2009

In bootstrapped tree search we do a tree search to compute amin-max value Vmm(s) using tree search with a static evaluatorVΦ(s). We then try to fit the static value to the min-max value.

∆Φ = −η∇Φ (VΦ(s)− Vmm(s))2

This is similar to minimizing a Bellman error between VΦ(s)and a rollout estimate of the value of s but where the rolloutestimate is replaced by a min-max tree search estimate.

11

Page 12: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

END

UNDERSTANDING & GENERALIZING ALPHAGO ZERO

AlphaGo andMonteCarloTreeSearch - University of Washington5/30/17 1 AlphaGo andMonteCarloTreeSearch Machine0Learning0forBig0Data00 CSE547/STAT548, University0of0Washington Sham0Kakade

AlphaGo: Mastering the game of Go with neural networks and ... · •AlphaGo is the first AI Go system to beat a human pro player. •Fan Hui is the European champion and a 2 dan

AlphaGo – Artificial Intelligencecis.csuohio.edu/~sschung/CIS601/Presentation1... · AlphaGo –Artificial Intelligence CIS 601 - Graduate Seminar Presentation ... Employing Deep

How AlphaGo Works

The oil mist separator for extremely efficient separation ......Handte Oil Expert 6,0 31230 x 1181 mm to 6,000 m /h 2,930 mm - 3,960 mm Handte Oil Expert 9,0 31230 x 1725 mm to 9,000

Game tree complexity - ICML · Evaluating AlphaGo against computers Crazy Stone Zen 3000 2500 2000 1500 1000 500 0 3500 AlphaGo (Nature v13) Pachi Fuego Gnu Go 4000 4500 AlphaGo (Seoul

AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search

TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early

Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

DeepBlue, AlphaGo, and AI?

Big Data and the Landscapestrings/halverson.pdf · Recently, AlphaGo became the ﬁrst program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions

B4M35PAG - Paralelní algoritmy · Recent Highlights in Parallel Computing • In March 2016, AlphaGo beat Lee Sedol, a 9-dan professional. AlphaGo ran on 48 CPUs and 8 GPUs. •

cs230.stanford.edu · however, AlphaGo Zero (Silver et al. 2017b) described an approach that used absolutely no expert knowledge and was trained entirely through self-play. In our

In Two Moves, AlphaGo and Lee Sedol Redefined the Future ...publicservicesalliance.org/wp-content/uploads/2016/...#ALPHAGO#ARTIFICIAL INTELLIGENCE#DEEP LEARNING#DEEPMIND#GO#GOOGLE

AlphaGo for DISCO group of ETH Zurich

Deep Q-Learning, AlphaGo and AlphaZero

Where will AGI come from?€¦ · Example: AlphaGo (see my Medium post “AlphaGo, in context”) Convenient properties of Go: 1. Deterministic. No noise in the game. 2. Fully observed

AlphaGo Zero & AlphaZero Mastering Go, Chess and Shogi ... · AlphaGo Zero is the Google Deepmind’s successor to AlphaGo [11]. It falls into the category of deep learning augmented

HEROZ O. AlphaGo Teach JAPAN VR Spacely VRCHEL … · O. AlphaGo Teach JAPAN VR Spacely VRCHEL EmbodgMe D7Ry Kibiro PENDROID Musio n Asratec 9curon åiL Dentrg Holoeyes A-zas Fafñõt-e

Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Maurizio Lenzerini Tiziana Catarci · 2016: AlphaGo beats Go champion Lee Sedol Thanks to machine learning techniques AlphaGo may develop «intuitions» for Go playing

How AlphaGo Works - ktiml.mff.cuni.czktiml.mff.cuni.cz/.../2017ZS/YuuSakaguchi_AlphaGo.pdf · AlphaGo played professional Go player Lee Sedol, ranked 9-dan, one of the best players

INTRODUCING WAQF BASED TAKAFUL MODEL IN INDIAirep.iium.edu.my/31230/1/INTRODUCING_WAQF_BASED_TAKAFUL_MODEL_IN_I… · (s.a.w.) and has ever since moderately blossomed until the starting

Deep Reinforcement Learning in Continuous Action Spaces: a …proceedings.mlr.press/v80/lee18b/lee18b.pdf · AlphaGo Zero (Silver et al., 2017), which is trained via self-play without

Exploration versus exploitation in reinforcement learning ... › ~xz2574 › download › RL-ver12.pdfing playing perfect-information board games such as Go (AlphaGo/AlphaGo Zero,

Alphago Tech talk

AlphaGo, etc. - Swarthmore College · AlphaGo results Beat a low-rank professional player (Fan Hui) 5 games to 0. Will take on a top professional player (Lee Sedol) March 8-15 in

What did AlphaGo do to beat the strongest human Go player?

PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games ﬁrst

Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark

TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass18/16... · 2018. 2. 27. · AlphaZero Defeats Stock sh in Chess (December 2017) AlphaGo Zero was a fundamental algorithmic

CMPUT 396: AlphaGo - University of Albertawebdocs.cs.ualberta.ca/~hayward/396/hoven/11alphago.pdf · CMPUT 396: AlphaGo The race to build a superhuman Go-bot is over! AlphaGo won,

AlphaGo in Depth

Lessons Learned from AlphaGo - CEUR-WS.orgceur-ws.org/Vol-1837/paper14.pdf · Lessons Learned from AlphaGo Ste en Hölldobler1;2 ... The game of Go is known to be one of the most