distributed asynchronous online learning for natural...

48
lti Distributed Asynchronous Online Learning for Natural Language Processing Kevin Gimpel Dipanjan Das Noah A. Smith

Upload: others

Post on 05-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Distributed Asynchronous Online Learning for

Natural Language Processing

Kevin Gimpel Dipanjan Das Noah A. Smith

Page 2: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Introduction

� Two recent lines of research in speeding up large learning problems:�Parallel/distributed computing

�Online (and mini-batch) learning algorithms:stochastic gradient descent, perceptron, MIRA, stepwise EM

� How can we bring together the benefits of parallel computing and online learning?

Page 3: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Introduction

� We use asynchronous algorithms (Nedic, Bertsekas, and Borkar, 2001; Langford, Smola, and Zinkevich, 2009)

� We apply them to structured prediction tasks:� Supervised learning� Unsupervised learning with both convex and non-

convex objectives

� Asynchronous learning speeds convergence and works best with small mini-batches

Page 4: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Problem Setting

� Iterative learning� Moderate to large numbers of training examples� Expensive inference procedures for each example� For concreteness, we start with gradient-based

optimization

� Single machine with multiple processors� Exploit shared memory for parameters, lexicons,

feature caches, etc.� Maintain one master copy of model parameters

Page 5: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

D

Single-Processor Batch Learning

Page 6: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

D

θ

P1

Single-Processor Batch Learning

Time0

Page 7: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

D

g=calc(D, θ) :

θ

P1

Calculate gradient on data using parametersDg θ

θ0

Single-Processor Batch Learning

Time0

Page 8: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

θ1=up(θ0,g) :D

θ1θ

P1

Update using gradient to obtain

θ1=up(θ0,g)

θ0 θ1g

θ0

Single-Processor Batch Learning

Time0

g=calc(D, θ) :Calculate gradient on data using parametersDg θ

Page 9: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

θ1=up(θ0,g) :D

θ

P1

Update using gradient to obtainθ0 θ1g

g=calc(D, θ1)

θ0 θ1

θ1=up(θ0,g)

Single-Processor Batch Learning

Time0

g=calc(D, θ) :Calculate gradient on data using parametersDg θ

Page 10: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ θ0

P2 g2=calc(D2, θ0)

0

� Divide data into parts, compute gradient on parts in parallel

One processor updates parameters

D = D1 ∪ D2 ∪ D3Dataset:

Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1

Parallel Batch Learning

Gradient: g = g1 + g2 + g3

Page 11: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ1θ

θ1=up(θ0,g)

θ0

P2 g2=calc(D2, θ0)

0

� Divide data into parts, compute gradient on parts in parallel

� One processor updates parameters

D = D1 ∪ D2 ∪ D3Dataset:

Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1

Gradient: g = g1 + g2 + g3

Parallel Batch Learning

Page 12: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ1θ

θ1=up(θ0,g)

θ0

P2 g2=calc(D2, θ0)

0

� Divide data into parts, compute gradient on parts in parallel

� One processor updates parameters

D = D1 ∪ D2 ∪ D3Dataset:

Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1

Gradient: g = g1 + g2 + g3

g1=calc(D1, θ1)

g2=calc(D2, θ1)

g3=calc(D3, θ1)

θ2=up(θ1,g

Parallel Batch Learning

Page 13: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ2=u(θ1,g)θ1=u(θ0,g)

Time

θ1θ θ0

P2

0

� Same architecture, just more frequent updates

P3

P1

Gradient: g = g1 + g2 + g3

g1=c(B1

1, θ0)

g2=c(B2

1, θ0)

g3=c(B3

1, θ0)

g1=c(B1

2, θ1)

g2=c(B2

2, θ1)

g3=c(B3

2, θ1)

Parallel Synchronous Mini-Batch LearningFinkel, Kleeman, and Manning (2008)

Mini-batches:

Processors: Pi

Parameters: θt

Bt = B1t ∪ B2

t ∪ B3

t

θ2

g1=c(B1

3,

g2=c(B2

3,

g3=c(B3

3,

Page 14: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

gk

Bj

Page 15: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

Bj

Page 16: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

Bj

Page 17: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

Bj

Page 18: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2=u(θ1,g2)

Bj

θ2

Page 19: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

Bj

Page 20: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

θ3=u(θ2,g3)

θ3

Bj

Page 21: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

� Gradients computed using stale parameters

� Increased processor utilization� Only idle time caused by lock for

updating parameters

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

θ3=u(θ2,g3)

θ3

θ4=u(θ3,g1)

θ5=u(θ4

g1=c(B

g3=c(B6, θ3)

Bj

θ4

Page 22: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Theoretical Results

� How does the use of stale parameters affect convergence?

� Convergence results exist for convex optimization using stochastic gradient descent� Convergence guaranteed when max delay is

bounded (Nedic, Bertsekas, and Borkar, 2001)� Convergence rates linear in max delay (Langford,

Smola, and Zinkevich, 2009)

Page 23: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

4

10k

4

m

HMM

IBM Model 1

CRF

Model

2M42kNStepwise

EM

Unsupervised Part-of-Speech

Tagging

14.2M300kYStepwise

EMWord

Alignment

1.3M15kYStochastic Gradient Descent

Named-Entity Recognition

Convex?MethodTask |θ||D|

� To compare algorithms, we use wall clock time (with a dedicated 4-processor machine)

� m = mini-batch size

Page 24: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

4

m

CRF

Model

1.3M15kYStochastic Gradient Descent

Named-Entity Recognition

Convex?MethodTask |θ||D|

� CoNLL 2003 English data

� Label each token with entity type (person, location, organization, or miscellaneous) or non-entity

� We show convergence in F1 on development data

Page 25: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 2 4 6 8 10 1284

85

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous (4 processors)Synchronous (4 processors)Single-processor

Asynchronous Updating Speeds Convergence

All use a mini-batch size of 4

Page 26: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 2 4 6 8 10 1284

85

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous (4 processors)Ideal

Comparison with Ideal Speed-up

Page 27: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Why Does Asynchronous Converge Faster?

� Processors are kept in near-constant use� Synchronous SGD leads to idle

processors � need for load-balancing

Page 28: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6 7 885

86

87

88

89

90

91

F1

Synchronous (4 processors)Synchronous (2 processors)Single-processor

0 1 2 3 4 5 6 7 885

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous (4 processors)Asynchronous (2 processors)Single-processor

Clearer improvement for asynchronous algorithms when increasing number ofprocessors

Page 29: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 2 4 6 8 10 1285

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous, no delayAsynchronous, µ = 5Single-processor, no delayAsynchronous, µ = 10Asynchronous, µ = 20

Artificial Delays

After completing a mini-batch, 25% chance of delaying

Delay (in seconds) sampled from max(N (µ, (µ/5)2), 0)

Avg. time per mini-batch = 0.62 s

Page 30: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

10k

m

IBM Model 1

Model

14.2M300kYStepwise

EMWord

Alignment

Convex?MethodTask

� Given parallel sentences, draw links between words:

� We show convergence in log-likelihood (convergence in AER is similar)

konnten sie es übersetzen ?

could you translate it ?

|θ||D|

Page 31: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Stepwise EM(Sato and Ishii, 2000; Cappe and Moulines, 2009)

� Similar to stochastic gradient descent in the space of sufficient statistics, with a particular scaling of the update

� More efficient than incremental EM

(Neal and Hinton, 1998)� Found to converge much faster than batch EM

(Liang and Klein, 2009)

Page 32: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

For stepwise EM, mini-batch size = 10,000

Page 33: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

Asynchronous is no faster than synchronous!

For stepwise EM, mini-batch size = 10,000

Page 34: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

Asynchronous is no faster than synchronous!

For stepwise EM, mini-batch size = 10,000

Page 35: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)

Page 36: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)

Asynchronous is faster when using small mini-batches

Page 37: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)

Error from asynchronous

updating

Page 38: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

For stepwise EM, mini-batch size = 10,000

Page 39: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)Ideal

Comparison with Ideal Speed-up

For stepwise EM, mini-batch size = 10,000

Page 40: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

MapReduce?

� We also ran these algorithms on a large MapReduce cluster (M45 from Yahoo!)

� Batch EM�Each iteration is one MapReduce job, using

24 mappers and 1 reducer

� Asynchronous Stepwise EM�4 mini-batches processed simultaneously,

each run as a MapReduce job�Each uses 6 mappers and 1 reducer

Page 41: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

MapReduce?

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20Lo

g-Li

kelih

ood

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)

Page 42: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

MapReduce?

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20Lo

g-Li

kelih

ood

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)

Page 43: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

4

m

HMM

Model

2M42kNStepwise

EM

Unsupervised Part-of-Speech

Tagging

Convex?MethodTask |θ||D|

� Bigram HMM with 45 states

� We plot convergence in likelihood and many-to-1 accuracy

Page 44: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65

Wall clock time (hours)

Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Part-of-Speech Tagging Results

mini-batch size = 4 for stepwise EM

Page 45: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65

Wall clock time (hours)

Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Ideal

Comparison with Ideal

Page 46: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65

Wall clock time (hours)

Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Ideal

Comparison with Ideal

Asynchronous better than ideal?

Page 47: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Conclusions and Future Work

� Asynchronous algorithms speed convergence and do not introduce additional error

� Effective for unsupervised learning and non-convex objectives

� If your problem works well with small mini-batches, try this!

� Future work� Theoretical results for non-convex case� Explore effects of increasing number of processors� New architectures (maintain multiple copies of )θ

Page 48: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Thanks!