Download - Technical Tricks of Vowpal Wabbit
![Page 1: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/1.jpg)
Technical Tricks of Vowpal Wabbit
http://hunch.net/~vw/
John Langford, Columbia, Data-Driven Modeling,April 16
git clonegit://github.com/JohnLangford/vowpal_wabbit.git
![Page 2: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/2.jpg)
Goals of the VW project
1 State of the art in scalable, fast, e�cientMachine Learning. VW is (by far) the mostscalable public linear learner, and plausibly themost scalable anywhere.
2 Support research into new ML algorithms. MLresearchers can deploy new algorithms on ane�cient platform e�ciently. BSD open source.
3 Simplicity. No strange dependencies, currentlyonly 9437 lines of code.
4 It just works. A package in debian & R.Otherwise, users just type �make�, and get aworking system. At least a half-dozen companiesuse VW.
![Page 3: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/3.jpg)
Demonstration
vw -c rcv1.train.vw.gz �exact_adaptive_norm�power_t 1 -l 0.5
![Page 4: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/4.jpg)
The basic learning algorithm
Learn w such that fw(x) = w .x predicts well.1 Online learning with strong defaults.2 Every input source but library.3 Every output sink but library.4 In-core feature manipulation for ngrams, outer
products, etc... Custom is easy.5 Debugging with readable models & audit mode.6 Di�erent loss functions � squared, logistic, ...7 `1 and `2 regularization.8 Compatible LBFGS-based batch-mode
optimization9 Cluster parallel10 Daemon deployable.
![Page 5: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/5.jpg)
The tricks
Basic VW Newer Algorithmics Parallel Stu�
Feature Caching Adaptive Learning Parameter AveragingFeature Hashing Importance Updates Nonuniform AverageOnline Learning Dim. Correction Gradient SummingImplicit Features L-BFGS Hadoop AllReduce
Hybrid LearningWe'll discuss Basic VW and algorithmics, then Parallel.
![Page 6: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/6.jpg)
Feature Caching
Compare: time vw rcv1.train.vw.gz �exact_adaptive_norm�power_t 1
![Page 7: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/7.jpg)
Feature Hashing
RA
M
Conventional VW
Weights
Str
ing −
> I
ndex
dic
tionar
y
Weights
Most algorithms use a hashmap to change a word into anindex for a weight.VW uses a hash function which takes almost no RAM, is x10faster, and is easily parallelized.
![Page 8: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/8.jpg)
The spam example [WALS09]
1 3.2 ∗ 106 labeled emails.
2 433167 users.
3 ∼ 40 ∗ 106 unique features.
How do we construct a spam �lter which is personalized, yetuses global information?
Answer: Use hashing to predict according to:〈w , φ(x)〉+ 〈w , φu(x)〉
![Page 9: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/9.jpg)
The spam example [WALS09]
1 3.2 ∗ 106 labeled emails.
2 433167 users.
3 ∼ 40 ∗ 106 unique features.
How do we construct a spam �lter which is personalized, yetuses global information?
Answer: Use hashing to predict according to:〈w , φ(x)〉+ 〈w , φu(x)〉
![Page 10: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/10.jpg)
Results
(baseline = global only predictor)
![Page 11: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/11.jpg)
Basic Online Learning
Start with ∀i : wi = 0, Repeatedly:
1 Get example x ∈ (∞,∞)∗.
2 Make prediction y −∑i wixi clipped to interval [0, 1].
3 Learn truth y ∈ [0, 1] with importance I or goto (1).
4 Update wi ← wi + η2(y − y)Ixi and go to (1).
![Page 12: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/12.jpg)
Reasons for Online Learning
1 Fast convergence to a good predictor
2 It's RAM e�cient. You need store only one example inRAM rather than all of them. ⇒ Entirely new scales ofdata are possible.
3 Online Learning algorithm = Online OptimizationAlgorithm. Online Learning Algorithms ⇒ the ability tosolve entirely new categories of applications.
4 Online Learning = ability to deal with driftingdistributions.
![Page 13: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/13.jpg)
Implicit Outer Product
Sometimes you care about the interaction of two sets offeatures (ad features x query features, news features x userfeatures, etc...).Choices:
1 Expand the set of features explicitly, consuming n2 diskspace.
2 Expand the features dynamically in the core of yourlearning algorithm.
Option (2) is x10 faster. You need to be comfortable withhashes �rst.
![Page 14: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/14.jpg)
Implicit Outer Product
Sometimes you care about the interaction of two sets offeatures (ad features x query features, news features x userfeatures, etc...).Choices:
1 Expand the set of features explicitly, consuming n2 diskspace.
2 Expand the features dynamically in the core of yourlearning algorithm.
Option (2) is x10 faster. You need to be comfortable withhashes �rst.
![Page 15: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/15.jpg)
The tricks
Basic VW Newer Algorithmics Parallel Stu�
Feature Caching Adaptive Learning Parameter AveragingFeature Hashing Importance Updates Nonuniform AverageOnline Learning Dim. Correction Gradient SummingImplicit Features L-BFGS Hadoop AllReduce
Hybrid LearningNext: algorithmics.
![Page 16: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/16.jpg)
Adaptive Learning [DHS10,MS10]
For example t, let git = 2(y − y)xit .
New update rule: wi ← wi − η gi,t+1√Ptt′=1
g2it′
Common features stabilize quickly. Rare features can havelarge updates.
![Page 17: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/17.jpg)
Adaptive Learning [DHS10,MS10]
For example t, let git = 2(y − y)xit .
New update rule: wi ← wi − η gi,t+1√Ptt′=1
g2it′
Common features stabilize quickly. Rare features can havelarge updates.
![Page 18: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/18.jpg)
Adaptive Learning [DHS10,MS10]
For example t, let git = 2(y − y)xit .
New update rule: wi ← wi − η gi,t+1√Ptt′=1
g2it′
Common features stabilize quickly. Rare features can havelarge updates.
![Page 19: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/19.jpg)
Learning with importance weights [KL11]
y
yw>t x yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??yw>t x
−η(∇`)>x
w>t+1xyw>t x w>t+1x yw>t x w>t+1x
s(h)||x||2
![Page 20: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/20.jpg)
Learning with importance weights [KL11]
y
yw>t x
yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??yw>t x
−η(∇`)>x
w>t+1xyw>t x w>t+1x yw>t x w>t+1x
s(h)||x||2
![Page 21: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/21.jpg)
Learning with importance weights [KL11]
yyw>t x
yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??yw>t x
−η(∇`)>x
w>t+1xyw>t x w>t+1x yw>t x w>t+1x
s(h)||x||2
![Page 22: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/22.jpg)
Learning with importance weights [KL11]
yyw>t x yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x
yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??yw>t x
−η(∇`)>x
w>t+1xyw>t x w>t+1x yw>t x w>t+1x
s(h)||x||2
![Page 23: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/23.jpg)
Learning with importance weights [KL11]
yyw>t x yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x
yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??yw>t x
−η(∇`)>x
w>t+1xyw>t x w>t+1x yw>t x w>t+1x
s(h)||x||2
![Page 24: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/24.jpg)
Learning with importance weights [KL11]
yyw>t x yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??
yw>t x
−η(∇`)>x
w>t+1xyw>t x w>t+1x yw>t x w>t+1x
s(h)||x||2
![Page 25: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/25.jpg)
Learning with importance weights [KL11]
yyw>t x yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??
yw>t x
−η(∇`)>x
w>t+1x
yw>t x w>t+1x yw>t x w>t+1x
s(h)||x||2
![Page 26: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/26.jpg)
Learning with importance weights [KL11]
yyw>t x yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??yw>t x
−η(∇`)>x
w>t+1x
yw>t x w>t+1x
yw>t x w>t+1x
s(h)||x||2
![Page 27: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/27.jpg)
Learning with importance weights [KL11]
yyw>t x yw>t x
−η(∇`)>x
yw>t x
−η(∇`)>x
w>t+1x yw>t x
−6η(∇`)>x
yw>t x
−6η(∇`)>x
w>t+1x ??yw>t x
−η(∇`)>x
w>t+1xyw>t x w>t+1x
yw>t x w>t+1x
s(h)||x||2
![Page 28: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/28.jpg)
Robust results for unweighted problems
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97
stan
dard
importance aware
astro - logistic loss
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98
stan
dard
importance aware
spam - quantile loss
0.9
0.905
0.91
0.915
0.92
0.925
0.93
0.935
0.94
0.945
0.95
0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95
stan
dard
importance aware
rcv1 - squared loss
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
stan
dard
importance aware
webspam - hinge loss
![Page 29: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/29.jpg)
Dimensional Correction
Gradient of squared loss = ∂(fw (x)−y)2
∂wi= 2(fw (x)− y)xi and
change weights in the negative gradient direction:
wi ← wi − η∂(fw (x)− y)2
∂wi
But the gradient has intrinsic problems. wi naturally has unitsof 1/i since doubling xi implies halving wi to get the sameprediction.⇒ Update rule has mixed units!
A crude �x: divide update by∑
i x2
i . It helps much!
This is scary! The problem optimized is minw∑
x ,y(fw (x)−y)2P
i x2
i
rather than minw∑
x ,y (fw (x)− y)2. But it works.
![Page 30: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/30.jpg)
Dimensional Correction
Gradient of squared loss = ∂(fw (x)−y)2
∂wi= 2(fw (x)− y)xi and
change weights in the negative gradient direction:
wi ← wi − η∂(fw (x)− y)2
∂wi
But the gradient has intrinsic problems. wi naturally has unitsof 1/i since doubling xi implies halving wi to get the sameprediction.⇒ Update rule has mixed units!
A crude �x: divide update by∑
i x2
i . It helps much!
This is scary! The problem optimized is minw∑
x ,y(fw (x)−y)2P
i x2
i
rather than minw∑
x ,y (fw (x)− y)2. But it works.
![Page 31: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/31.jpg)
Dimensional Correction
Gradient of squared loss = ∂(fw (x)−y)2
∂wi= 2(fw (x)− y)xi and
change weights in the negative gradient direction:
wi ← wi − η∂(fw (x)− y)2
∂wi
But the gradient has intrinsic problems. wi naturally has unitsof 1/i since doubling xi implies halving wi to get the sameprediction.⇒ Update rule has mixed units!
A crude �x: divide update by∑
i x2
i . It helps much!
This is scary! The problem optimized is minw∑
x ,y(fw (x)−y)2P
i x2
i
rather than minw∑
x ,y (fw (x)− y)2. But it works.
![Page 32: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/32.jpg)
LBFGS [Nocedal80]
Batch(!) second order algorithm. Core idea = e�cientapproximate Newton step.
H = ∂2(fw (x)−y)2
∂wi∂wj= Hessian.
Newton step = ~w → ~w + H−1~g .
Newton fails: you can't even represent H.Instead build up approximate inverse Hessian according to:∆w∆T
w
∆Tw ∆g
where ∆w is a change in weights w and ∆g is a change
in the loss gradient g .
![Page 33: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/33.jpg)
LBFGS [Nocedal80]
Batch(!) second order algorithm. Core idea = e�cientapproximate Newton step.
H = ∂2(fw (x)−y)2
∂wi∂wj= Hessian.
Newton step = ~w → ~w + H−1~g .
Newton fails: you can't even represent H.Instead build up approximate inverse Hessian according to:∆w∆T
w
∆Tw ∆g
where ∆w is a change in weights w and ∆g is a change
in the loss gradient g .
![Page 34: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/34.jpg)
LBFGS [Nocedal80]
Batch(!) second order algorithm. Core idea = e�cientapproximate Newton step.
H = ∂2(fw (x)−y)2
∂wi∂wj= Hessian.
Newton step = ~w → ~w + H−1~g .
Newton fails: you can't even represent H.Instead build up approximate inverse Hessian according to:∆w∆T
w
∆Tw ∆g
where ∆w is a change in weights w and ∆g is a change
in the loss gradient g .
![Page 35: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/35.jpg)
Hybrid Learning
Online learning is GREAT for getting to a good solution fast.LBFGS is GREAT for getting a perfect solution.
Use Online Learning, then LBFGS
0 10 20 30 40 500.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
0 5 10 15 20
0.466
0.468
0.47
0.472
0.474
0.476
0.478
0.48
0.482
0.484
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
![Page 36: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/36.jpg)
Hybrid Learning
Online learning is GREAT for getting to a good solution fast.LBFGS is GREAT for getting a perfect solution.Use Online Learning, then LBFGS
0 10 20 30 40 500.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
0 5 10 15 20
0.466
0.468
0.47
0.472
0.474
0.476
0.478
0.48
0.482
0.484
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
![Page 37: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/37.jpg)
The tricks
Basic VW Newer Algorithmics Parallel Stu�
Feature Caching Adaptive Learning Parameter AveragingFeature Hashing Importance Updates Nonuniform AverageOnline Learning Dim. Correction Gradient SummingImplicit Features L-BFGS Hadoop AllReduce
Hybrid LearningNext: Parallel.
![Page 38: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/38.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 39: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/39.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?
John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 40: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/40.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.
I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 41: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/41.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?
J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 42: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/42.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!
I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 43: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/43.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!
The worst part: he had a point.
![Page 44: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/44.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 45: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/45.jpg)
Terascale Linear Learning ACDL11
Given 2.1 Terafeatures of data, how can you learn a goodlinear predictor fw (x) =
∑i wixi?
2.1T sparse features17B Examples16M parameters1K nodes
70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.
![Page 46: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/46.jpg)
Terascale Linear Learning ACDL11
Given 2.1 Terafeatures of data, how can you learn a goodlinear predictor fw (x) =
∑i wixi?
2.1T sparse features17B Examples16M parameters1K nodes
70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.
![Page 47: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/47.jpg)
Terascale Linear Learning ACDL11
Given 2.1 Terafeatures of data, how can you learn a goodlinear predictor fw (x) =
∑i wixi?
2.1T sparse features17B Examples16M parameters1K nodes
70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ we beat all possible singlemachine linear learning algorithms.
![Page 48: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/48.jpg)
Compare: Other Supervised Algorithms inParallel Learning book
100 1000
10000 100000 1e+06 1e+07 1e+08 1e+09
RB
F-S
VM
MP
I?-5
00R
CV
1E
nsem
ble
Tre
e M
PI-
128
Syn
thet
icR
BF
-SV
MT
CP
-48
MN
IST
220
KD
ecis
ion
Tre
eM
apR
ed-2
00A
d-B
ounc
e #
Boo
sted
DT
MP
I-32
Ran
king
#Li
near
Thr
eads
-2R
CV
1Li
near
Had
oop+
TC
P-1
000
Ads
*&
Fea
ture
s/s
Speed per method
parallelsingle
![Page 49: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/49.jpg)
MPI-style AllReduce
7
2 3 4
6
Allreduce initial state
5
1
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 50: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/50.jpg)
MPI-style AllReduce
2828 28
Allreduce final state
28 28 28 28
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 51: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/51.jpg)
MPI-style AllReduce
2 3 4
6
7
5
1
Create Binary Tree
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 52: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/52.jpg)
MPI-style AllReduce
2 3 4
7
8
1
13
Reducing, step 1
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 53: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/53.jpg)
MPI-style AllReduce
2 3 4
8
1
13
Reducing, step 2
28
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 54: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/54.jpg)
MPI-style AllReduce
2 3 41
28
Broadcast, step 1
28 28
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 55: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/55.jpg)
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 56: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/56.jpg)
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 57: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/57.jpg)
An Example Algorithm: Weight averagingn = AllReduce(1)While (pass number < max)
1 While (examples left)
1 Do online update.
2 AllReduce(weights)
3 For each weight w ← w/n
Other algorithms implemented:
1 Nonuniform averaging for online learning
2 Conjugate Gradient
3 LBFGS
![Page 58: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/58.jpg)
An Example Algorithm: Weight averagingn = AllReduce(1)While (pass number < max)
1 While (examples left)
1 Do online update.
2 AllReduce(weights)
3 For each weight w ← w/n
Other algorithms implemented:
1 Nonuniform averaging for online learning
2 Conjugate Gradient
3 LBFGS
![Page 59: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/59.jpg)
What is Hadoop AllReduce?
1
DataProgram
�Map� job moves program to data.
2 Delayed initialization: Most failures are disk failures.First read (and cache) all data, before initializingallreduce. Failures autorestart on di�erent node withidentical data.
3 Speculative execution: In a busy cluster, one node isoften slow. Hadoop can speculatively start additionalmappers. We use the �rst to �nish reading all data once.
![Page 60: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/60.jpg)
What is Hadoop AllReduce?
1
DataProgram
�Map� job moves program to data.
2 Delayed initialization: Most failures are disk failures.First read (and cache) all data, before initializingallreduce. Failures autorestart on di�erent node withidentical data.
3 Speculative execution: In a busy cluster, one node isoften slow. Hadoop can speculatively start additionalmappers. We use the �rst to �nish reading all data once.
![Page 61: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/61.jpg)
What is Hadoop AllReduce?
1
DataProgram
�Map� job moves program to data.
2 Delayed initialization: Most failures are disk failures.First read (and cache) all data, before initializingallreduce. Failures autorestart on di�erent node withidentical data.
3 Speculative execution: In a busy cluster, one node isoften slow. Hadoop can speculatively start additionalmappers. We use the �rst to �nish reading all data once.
![Page 62: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/62.jpg)
Approach Used
1 Optimize hard so few data passes required.
1 Normalized, adaptive, safe, online, gradientdescent.
2 L-BFGS3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and errorrecovery.
3 Use AllReduce code to sync state.
4 Always save input examples in a cache�le to speed laterpasses.
5 Use hashing trick to reduce input complexity.
Open source in Vowpal Wabbit 6.1. Search for it.
![Page 63: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/63.jpg)
Approach Used
1 Optimize hard so few data passes required.
1 Normalized, adaptive, safe, online, gradientdescent.
2 L-BFGS3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and errorrecovery.
3 Use AllReduce code to sync state.
4 Always save input examples in a cache�le to speed laterpasses.
5 Use hashing trick to reduce input complexity.
Open source in Vowpal Wabbit 6.1. Search for it.
![Page 64: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/64.jpg)
Approach Used
1 Optimize hard so few data passes required.
1 Normalized, adaptive, safe, online, gradientdescent.
2 L-BFGS3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and errorrecovery.
3 Use AllReduce code to sync state.
4 Always save input examples in a cache�le to speed laterpasses.
5 Use hashing trick to reduce input complexity.
Open source in Vowpal Wabbit 6.1. Search for it.
![Page 65: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/65.jpg)
Approach Used
1 Optimize hard so few data passes required.
1 Normalized, adaptive, safe, online, gradientdescent.
2 L-BFGS3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and errorrecovery.
3 Use AllReduce code to sync state.
4 Always save input examples in a cache�le to speed laterpasses.
5 Use hashing trick to reduce input complexity.
Open source in Vowpal Wabbit 6.1. Search for it.
![Page 66: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/66.jpg)
Approach Used
1 Optimize hard so few data passes required.
1 Normalized, adaptive, safe, online, gradientdescent.
2 L-BFGS3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and errorrecovery.
3 Use AllReduce code to sync state.
4 Always save input examples in a cache�le to speed laterpasses.
5 Use hashing trick to reduce input complexity.
Open source in Vowpal Wabbit 6.1. Search for it.
![Page 67: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/67.jpg)
Approach Used
1 Optimize hard so few data passes required.
1 Normalized, adaptive, safe, online, gradientdescent.
2 L-BFGS3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and errorrecovery.
3 Use AllReduce code to sync state.
4 Always save input examples in a cache�le to speed laterpasses.
5 Use hashing trick to reduce input complexity.
Open source in Vowpal Wabbit 6.1. Search for it.
![Page 68: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/68.jpg)
Robustness & Speedup
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50 60 70 80 90 100
Spe
edup
Nodes
Speed per method
Average_10Min_10
Max_10linear
![Page 69: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/69.jpg)
Splice Site Recognition
0 10 20 30 40 500.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
0 5 10 15 20
0.466
0.468
0.47
0.472
0.474
0.476
0.478
0.48
0.482
0.484
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
![Page 70: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/70.jpg)
Splice Site Recognition
0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
Effective number of passes over data
auP
RC
L−BFGS w/ one online passZinkevich et al.Dekel et al.
![Page 71: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/71.jpg)
To learn more
The wiki has tutorials, examples, and help:https://github.com/JohnLangford/vowpal_wabbit/wiki
Mailing List: [email protected]
Various discussion: http://hunch.net Machine Learning(Theory) blog
![Page 72: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/72.jpg)
Bibliography: Original VW
Caching L. Bottou. Stochastic Gradient Descent Examples onToy Problems,http://leon.bottou.org/projects/sgd, 2007.
Release Vowpal Wabbit open source project, http://github.com/JohnLangford/vowpal_wabbit/wiki,2007.
Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola,and SVN Vishwanathan, Hash Kernels for StructuredData, AISTAT 2009.
Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, andJ. Attenberg, Feature Hashing for Large Scale MultitaskLearning, ICML 2009.
![Page 73: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/73.jpg)
Bibliography: Algorithmics
L-BFGS J. Nocedal, Updating Quasi-Newton Matrices withLimited Storage, Mathematics of Computation35:773�782, 1980.
Adaptive H. B. McMahan and M. Streeter, Adaptive BoundOptimization for Online Convex Optimization, COLT2010.
Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive SubgradientMethods for Online Learning and StochasticOptimization, COLT 2010.
Safe N. Karampatziakis, and J. Langford, Online ImportanceWeight Aware Updates, UAI 2011.
![Page 74: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/74.jpg)
Bibliography: Parallel
grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A ScalableModular Convex Solver for Regularized RiskMinimization, KDD 2007.
avg. 1 G. Mann et al. E�cient large-scale distributed trainingof conditional maximum entropy models, NIPS 2009.
avg. 2 K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtablefor Distributed Optimization, LCCC 2010.
ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li,Parallelized Stochastic Gradient Descent, NIPS 2010.
P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,Parallel Online Learning, in SUML 2010.
D. Mini 1 O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao,Optimal Distributed Online Predictions Using Minibatch,http://arxiv.org/abs/1012.1367
D. Mini 2 A. Agarwal and J. Duchi, Distributed delayed stochasticoptimization, http://arxiv.org/abs/1009.0571
![Page 75: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/75.jpg)
Vowpal Wabbit Goals for FutureDevelopment
1 Native learning reductions. Just like more complicatedlosses. In development now.
2 Librari�cation, so people can use VW in their favoritelanguage.
3 Other learning algorithms, as interest dictates.
4 Various further optimizations. (Allreduce can beimproved by a factor of 3...)
![Page 76: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/76.jpg)
Reductions
Goal: minimize ` on D
Algorithm for optimizing `0/1
Transform D into D ′
Transform h with small `0/1(h,D ′) into Rh with small `(Rh,D).
h
such that if h does well on (D ′, `0,1), Rh is guaranteed to dowell on (D, `).
![Page 77: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/77.jpg)
The transformation
R = transformer from complex example to simple example.
R−1 = transformer from simple predictions to complexprediction.
![Page 78: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/78.jpg)
example: One Against All
Create k binary regression problems, one per class.For class i predict �Is the label i or not?�
(x , y) 7−→
(x ,1(y = 1))
(x ,1(y = 2))
. . .
(x ,1(y = k))
Multiclass prediction: evaluate all the classi�ers and choosethe largest scoring label.
![Page 79: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/79.jpg)
The code: oaa.cc
// parses reduction-speci�c �ags.void parse_�ags(size_t s, void (*base_l)(example*), void(*base_f)())
// Implements R and R−1 using base_l.void learn(example* ec)
// Cleans any temporary state and calls base_f.void �nish()
The important point: anything �tting this interface is easy tocode in VW now, including all forms of feature diddling andcreation.And reductions inherit all theinput/output/optimization/parallelization of VW!
![Page 80: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/80.jpg)
Reductions implemented
1 One-Against-All (� �oaa <k>). The baseline multiclassreduction.
2 Cost Sensitive One-Against-All (� �csoaa <k>).Predicts cost of each label and minimizes the cost.
3 Weighted All-Pairs (� �wap <k>). An alternative to�csoaa with better theory.
4 Cost Sensitive One-Against-All with Label DependentFeatures (� �csoaa_ldf). As csoaa, but features notshared between labels.
5 WAP with Label Dependent Features (� �wap_ldf).
6 Sequence Prediction (� �sequence <k>). A simpleimplementation of Searn and Dagger for sequenceprediction. Uses cost sensitive predictor.
![Page 81: Technical Tricks of Vowpal Wabbit](https://reader031.vdocument.in/reader031/viewer/2022013114/5549f299b4c90518488b5521/html5/thumbnails/81.jpg)
Reductions to Implement
Mean Regression
AUC Ranking
Classification
IW ClassificationQuantile Regression 1 1
k−1 4
Classification
FilterTree
PECOC
Probing
CostingEi
Quanting
4 ECT
k/2
SearnTk ln TPSDP
Unsupervised by Self Prediction
Tk
??
−cost k
−Classificationk−Partial Labelk
T step RL with State Visitation Step RL with Demonstration PolicyT
Dynamic Models
??
OffsetTree
−way Regressionk
Regret Transform Reductions
1QuicksortRegret multiplier Algorithm Name