distributed perceptron

Distributed Perceptron

Introducing Distributed Training Strategies for the Structured Perceptron,

published by R. McDonald, K. Hall & G. Mannin NAACL 2010

2010-10-06 / 2nd seminar for State-of-the-Art NLP

http://www.aclweb.org/anthology-new/N/N10/N10-1069.pdf


http://www-tsujii.is.s.u-tokyo.ac.jp/sanlp-2/

Distributed training of perceptronsin a theoretically-proven way Naive distribution strategy fails

Parameter mixing (or averaging) Simple modification

Iterative parameter mixing Proofs & Experiments

ConvergenceConvergence speedNER experimentsDependency parsing experiments

Timeline

1958 F. RosenblattPrinciples of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms

1962 H.D. Block and A.B. Novikoff (independently)the perceptron convergence theorem of for the separable case

1999 Y. Freund & R.E. Schapirevoted perceptron with a bound to the generalization error for the inseparable case

2002 M. CollinsGeneralization to the structured prediction problem

2010 R. McDonald et alparallelization with parameter mixing and synchronization

http://www.google.com/search?q=Principles+of+Neurodynamics:+Perceptrons+and+the+Theory+of+Brain+Mechanisms

http://dx.doi.org/10.1103/RevModPhys.34.123

http://www.ams.org/mathscinet-getitem?mr=175722

http://dx.doi.org/10.1145/279943.279985

http://dx.doi.org/10.3115/1118693.1118694


A new strategy of parallelization is required for distributed perceptrons

Gradient-based batch training algorithms have been parallelized in the forms of Map-Reduce Parameter mixing works for maximum entropy models

Divide the training data into a number of shardsTrain separate models with the shardsTake average of the weights of the models

Perceptrons? Non-convex objective functionSimple parameter mixing doesn't work

Parameter mixing (averaging) fails (1/6)

Parameter mixing: Train S perceptrons with S shards of the training data, Take a weighted average of their weights

Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010


Counter example Feature space (separated into observed and non-observed examples): f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1)Shard 2: (x2,1, 0), (x2,2, 1)

Preview of the consequence:Mixing of two local optimaSmaller data can fool the algorithm, because of the increased initializations and tie-breakings.


Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] Shard 1: (x1,1, 0), (x1,2, 1)

w1 := [0 0 0 0 0 0] {initialization}w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t

w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0] = [1 1 0 -1 -1 0]w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}


Counter example Feature space: f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 2: (x2,1, 0), (x2,2, 1)

w2 := [0 0 0 0 0 0] {initialization}w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t

w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1] = [0 1 1 0 -1 -1]w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}


Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0]Shard 2: (x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]

mixed weight: [μ1 1 μ2 -μ1 -1 -μ2]


Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1 f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2 f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1 f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1 Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and negatives:

LHS feature vectors always beat RHS vectorsw·f(*,0) ≦ w·f(*,1)

But there is a separating weight vector: [-1 2 -1 1 -2 1]

Iterative parameter mixing

Convergence theorem of iterative parameter mixing (1/4)

Assumptions u: separating weight vector γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y'R: maxt,y' |f(xt,yt) - f(xt,y')|ki,n : the number of updates (errors) occur in the n th epoch of the i th OneEpochPerceptron



Lowerbound of the number of the errors in a epoch


← from definition: γ ≦ u ·(f(xt,yt) - f(xt,

y'))

By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ


Upperbound of the number of the errors in a epoch


← from definition:R ≧ |f(xt,yt) - f(xt,y')| y' = argmaxy w f(...)

By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2



|w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2

(ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R

2

(ΣnΣi μi,n ki,n )γ2 ≦ R2

(ΣnΣi μi,n ki,n ) ≦ R2/γ2

|w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2 = (ΣnΣi μi,n ki,n )

2γ2

Convergence speed is predicted in two ways (1/2)

Theorem 3 impliesWhen we take uniform weights for mixing, the number of errors is proportional to the number of shards (in worst case when the equality holds)

implying that we cannot benefit from the parallelization very much

#(errors per epoch) can be multiplied by S the time required in an epoch would reduced to 1/S.


Convergence speed is predicted in two ways (2/2)Section 4.3

When we take error-proportional weighting for mixing, the number of epochs Ndist is bounded by

Worst case (when the equality holds)The same number of epochs as the vanilla perceptronEven in that case, each epoch is S times faster because of the parallelization

Ndist doesn't depend on the number of shardsimplying that we can well benefit from parallelization

↑error-proportional mixinggeometric mean ≦ arithmetic mean


Experiments

ComparisonSerial (All Data)Serial (Sub Sampling): use only one shardParallel (Parameter Mix)Parallel (Iterative Parameter Mix)

SettingsNumber of shards: 10(see the paper for more details)

NER experiments: faster & better, close to averaged perceptrons


NER experiments: faster & better, close to averaged perceptrons


Iterative mixing is faster and more accurate than serial. (non-averaged case)

Iterative mixing is faster and similarly accurate to serial. (averaged case)

Dependency parsing experiments: similar improvements


Different shard size: the more shards, the slower convergence


Different shard size: the more shards, the slower convergence


High parallelism leads to slower convergence (in a rate somewhere middle in the two predictions)

Conclusions

Distributed training of the structured perceptron via simple parameter mixing strategies

Guaranteed to converge and separate the data (if separable)Results in fast and accurate classifiers

Trade-off between high parallelism and slow convergence

(+ applicable to online passive-aggressive algorithm)

Presenter's comments

Parameter synchronization can be slow, especially when the feature space or the number of epochs is largeAnalysis of the generalization error (for inseparable case)? Relation to voted perceptron?

Voted perceptron: weighting with survival timeDistributed perceptron: weighting with the number of updates

Relation to Bayes point machines?

distributed perceptron

Education

n ni i

n r2ni i

n th epoch

n2 ni i

training data

t w1fx1

t w2fx2

yt fxt