distributed perceptron
TRANSCRIPT
Distributed Perceptron
Introducing Distributed Training Strategies for the Structured Perceptron,
published by R. McDonald, K. Hall & G. Mannin NAACL 2010
2010-10-06 / 2nd seminar for State-of-the-Art NLP
Distributed training of perceptronsin a theoretically-proven way Naive distribution strategy fails
Parameter mixing (or averaging) Simple modification
Iterative parameter mixing Proofs & Experiments
ConvergenceConvergence speedNER experimentsDependency parsing experiments
Timeline
1958 F. RosenblattPrinciples of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms
1962 H.D. Block and A.B. Novikoff (independently)the perceptron convergence theorem of for the separable case
1999 Y. Freund & R.E. Schapirevoted perceptron with a bound to the generalization error for the inseparable case
2002 M. CollinsGeneralization to the structured prediction problem
2010 R. McDonald et alparallelization with parameter mixing and synchronization
A new strategy of parallelization is required for distributed perceptrons
Gradient-based batch training algorithms have been parallelized in the forms of Map-Reduce Parameter mixing works for maximum entropy models
Divide the training data into a number of shardsTrain separate models with the shardsTake average of the weights of the models
Perceptrons? Non-convex objective functionSimple parameter mixing doesn't work
Parameter mixing (averaging) fails (1/6)
Parameter mixing: Train S perceptrons with S shards of the training data, Take a weighted average of their weights
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
Parameter mixing (averaging) fails (2/6)
Counter example Feature space (separated into observed and non-observed examples): f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1)Shard 2: (x2,1, 0), (x2,2, 1)
Preview of the consequence:Mixing of two local optimaSmaller data can fool the algorithm, because of the increased initializations and tie-breakings.
Parameter mixing (averaging) fails (3/6)
Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] Shard 1: (x1,1, 0), (x1,2, 1)
w1 := [0 0 0 0 0 0] {initialization}w1·f(x1,1,0)t ≦ w1·f(x1,1,1)t
w1 := [1 1 0 0 0 0] - [0 0 0 1 1 0] = [1 1 0 -1 -1 0]w1·f(x1,2,0)t ≦ w1·f(x1,1,1)t {tie-breaking}
Parameter mixing (averaging) fails (4/6)
Counter example Feature space: f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 2: (x2,1, 0), (x2,2, 1)
w2 := [0 0 0 0 0 0] {initialization}w2·f(x2,1,0)t ≦ w1·f(x2,1,1)t
w2 := [0 1 1 0 0 0] - [0 0 0 0 1 1] = [0 1 1 0 -1 -1]w2·f(x2,2,0)t ≦ w2·f(x2,2,1)t {tie-breaking}
Parameter mixing (averaging) fails (5/6)
Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] Shard 1: (x1,1, 0), (x1,2, 1) ... w1=[1 1 0 -1 -1 0]Shard 2: (x2,1, 0), (x2,2, 1) ... w2=[0 1 1 0 -1 -1]
mixed weight: [μ1 1 μ2 -μ1 -1 -μ2]
Parameter mixing (averaging) fails (6/6)
Counter example Feature space: f(x1,1,0) = [1 1 0 0 0 0] f(x1,1,1) = [0 0 0 1 1 0] ... μ1+1, -μ1-1 f(x1,2,0) = [0 0 1 0 0 0] f(x1,2,1) = [0 0 0 0 0 1] ... μ2, -μ2 f(x2,1,0) = [0 1 1 0 0 0] f(x2,1,1) = [0 0 0 1 1 1] ... μ2+1, -μ2-1 f(x2,2,0) = [1 0 0 0 0 0] f(x2,2,1) = [0 0 0 1 0 0] ... μ1, -μ1 Mixed weight [μ1 1 μ2 -μ1 -1 -μ2] doesn't separate positives and negatives:
LHS feature vectors always beat RHS vectorsw·f(*,0) ≦ w·f(*,1)
But there is a separating weight vector: [-1 2 -1 1 -2 1]
Iterative parameter mixing
Convergence theorem of iterative parameter mixing (1/4)
Assumptions u: separating weight vector γ: margin, γ ≦ u ·(f(xt,yt) - f(xt,y')) for all t and y'R: maxt,y' |f(xt,yt) - f(xt,y')|ki,n : the number of updates (errors) occur in the n th epoch of the i th OneEpochPerceptron
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
Convergence theorem of iterative parameter mixing (2/4)
Lowerbound of the number of the errors in a epoch
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
← from definition: γ ≦ u ·(f(xt,yt) - f(xt,
y'))
By induction on n, u·w(avg,N) ≧ ΣnΣi μi,nki,nγ
Convergence theorem of iterative parameter mixing (3/4)
Upperbound of the number of the errors in a epoch
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
← from definition:R ≧ |f(xt,yt) - f(xt,y')| y' = argmaxy w f(...)
By induction on n, |w(avg,N)|2 ≦ ΣnΣi μi,n ki,n R2
Convergence theorem of iterative parameter mixing (4/4)
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
|w(avg,N)|2 ≦ (ΣnΣi μi,n ki,n) R2
(ΣnΣi μi,n ki,n )2γ2 ≦ (ΣnΣi μi,n ki,n) R
2
(ΣnΣi μi,n ki,n )γ2 ≦ R2
(ΣnΣi μi,n ki,n ) ≦ R2/γ2
|w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (ΣnΣi μi,n ki,n γ)2 = (ΣnΣi μi,n ki,n )
2γ2
Convergence speed is predicted in two ways (1/2)
Theorem 3 impliesWhen we take uniform weights for mixing, the number of errors is proportional to the number of shards (in worst case when the equality holds)
implying that we cannot benefit from the parallelization very much
#(errors per epoch) can be multiplied by S the time required in an epoch would reduced to 1/S.
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
Convergence speed is predicted in two ways (2/2)Section 4.3
When we take error-proportional weighting for mixing, the number of epochs Ndist is bounded by
Worst case (when the equality holds)The same number of epochs as the vanilla perceptronEven in that case, each epoch is S times faster because of the parallelization
Ndist doesn't depend on the number of shardsimplying that we can well benefit from parallelization
↑error-proportional mixinggeometric mean ≦ arithmetic mean
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
Experiments
ComparisonSerial (All Data)Serial (Sub Sampling): use only one shardParallel (Parameter Mix)Parallel (Iterative Parameter Mix)
SettingsNumber of shards: 10(see the paper for more details)
NER experiments: faster & better, close to averaged perceptrons
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
NER experiments: faster & better, close to averaged perceptrons
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
Iterative mixing is faster and more accurate than serial. (non-averaged case)
Iterative mixing is faster and similarly accurate to serial. (averaged case)
Dependency parsing experiments: similar improvements
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
Different shard size: the more shards, the slower convergence
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
Different shard size: the more shards, the slower convergence
Distributed Training Strategies for the Structured Perceptronby R. McDonald, K. Hall & G. Mann, 2010
High parallelism leads to slower convergence (in a rate somewhere middle in the two predictions)
Conclusions
Distributed training of the structured perceptron via simple parameter mixing strategies
Guaranteed to converge and separate the data (if separable)Results in fast and accurate classifiers
Trade-off between high parallelism and slow convergence
(+ applicable to online passive-aggressive algorithm)
Presenter's comments
Parameter synchronization can be slow, especially when the feature space or the number of epochs is largeAnalysis of the generalization error (for inseparable case)? Relation to voted perceptron?
Voted perceptron: weighting with survival timeDistributed perceptron: weighting with the number of updates
Relation to Bayes point machines?