Download - Cluster Parallel
-
8/11/2019 Cluster Parallel
1/29
Goals for future from last year
1 Finish Scaling up. I want a kilonode program.
2 Native learning reductions. Just like more complicated losses.3 Other learning algorithms,as interest dictates.
4 Persistent Demonization
-
8/11/2019 Cluster Parallel
2/29
Goals for future from last year
1 Finish Scaling up. I want a kilonode program.
-
8/11/2019 Cluster Parallel
3/29
Some design considerations
Hadoop compatibility: Widely available, scheduling androbustness
Iteration-firendly: Lots of iterative learning algorithms exist
Minimum code overhead: Dont want to rewrite learningalgorithms from scratch
Balance communication/computation: Imbalance on eitherside hurts the system
-
8/11/2019 Cluster Parallel
4/29
Some design considerations
Hadoop compatibility: Widely available, scheduling androbustness
Iteration-firendly: Lots of iterative learning algorithms exist
Minimum code overhead: Dont want to rewrite learningalgorithms from scratch
Balance communication/computation: Imbalance on eitherside hurts the system
Scalable: John has nodes aplenty
-
8/11/2019 Cluster Parallel
5/29
Current system provisions
Hadoop-compatible AllReduce
Various parameter averaging routinesParallel implementation of Adaptive GD, CG, L-BFGS
Robustness and scalability tested up to 1K nodes andthousands of node hours
-
8/11/2019 Cluster Parallel
6/29
Basic invocation on single machine
./spanning tree
../vw --total 2 --node 0 --unique id 0 -d $1
--span server localhost > node 0 2>&1 &../vw --total 2 --node 1 --unique id 0 -d $1
--span server localhost
killall spanning tree
-
8/11/2019 Cluster Parallel
7/29
Command-line options
--span server : Location of server for setting upspanning tree
--unique id (=0): Unique id for cluster parallel job--total (=1): Total number of nodes used incluster parallel job
--node (=0): Node id in cluster parallel job
-
8/11/2019 Cluster Parallel
8/29
Basic invocation on a non-Hadoop cluster
Spanning-tree server: Runs on cluster gateway, organizescommunication
./spanning tree
Worker nodes: Each worker node runs VW
./vw --span server --total --node --unique id -d
-
8/11/2019 Cluster Parallel
9/29
Basic invocation in a Hadoop cluster
Spanning-tree server: Runs on cluster gateway, organizescommunication
./spanning tree
Map-only jobs: Map-only job launched on each node usingHadoop streaming
hadoop jar $HADOOP HOME/hadoop-streaming.jar-Dmapred.job.map.memory.mb=2500 -input -output -file vw -file runvw.sh -mapper
runvw.sh -reducer NONEEach mapper runs VWModel stored in /model on HDFSrunvw.sh calls VW, used to modify VW arguments
-
8/11/2019 Cluster Parallel
10/29
mapscript.sh example
//Hadoop-streaming has no specification for number of mappers,
we calculate it indirectly
total=
mapsize=`expr $total / $nmappers`
maprem=`expr $total % $nmappers`
mapsize=`expr $mapsize + $maprem`
./spanning tree //Starting span-tree server on the gateway//Note the argument min.split.size to specify number of mappers
hadoop jar $HADOOP HOME/hadoop-streaming.jar
-Dmapred.min.split.size=$mapsize
-Dmapred.map.tasks.speculative.execution=true -input
$in directory -output $out directory -file ../vw -file
runvw.sh -mapper runvw.sh -reducer NONE
-
8/11/2019 Cluster Parallel
11/29
Communication and computation
Two main additions in cluster-parallel code:
Hadoop-compatible AllReduce communicationNew and old optimization algorithms modified for AllReduce
-
8/11/2019 Cluster Parallel
12/29
Communication protocol
Spanning-tree server runs as daemon and listens forconnections
Workers via TCP with a node-id and job-id
Two workers with same job-id and node-id are duplicates,faster one kept (speculative execution)
Available as mapper environment variables in Hadoop
mapper=`printenv mapred task id | cut -d " " -f 5`
mapred job id=`echo $mapred job id | tr -d job `
-
8/11/2019 Cluster Parallel
13/29
Communication protocol contd.
Each worker connects to spanning-tree sever
Server creates a spanning tree on the n nodes, communicatesparent and children to each node
Node connects to parent and children via TCP
AllReduce run on the spanning tree
-
8/11/2019 Cluster Parallel
14/29
AllReduce
Every node begins with a number (vector)
7
1
2 3
4 5 6
Extends to other functions: max, average, gather, . . .
-
8/11/2019 Cluster Parallel
15/29
AllReduce
Every node begins with a number (vector)
16
1
4 5 6 7
11
Extends to other functions: max, average, gather, . . .
-
8/11/2019 Cluster Parallel
16/29
AllReduce
Every node begins with a number (vector)
28
4 5 6 7
11 16
Extends to other functions: max, average, gather, . . .
AllR d
-
8/11/2019 Cluster Parallel
17/29
AllReduce
Every node begins with a number (vector)Every node ends up with the sum
28
28
28 28
28 28 28
Extends to other functions: max, average, gather, . . .
AllR d E l
-
8/11/2019 Cluster Parallel
18/29
AllReduce Examples
Counting: n = allreduce(1)
Average: avg = allreduce(ni)/allreduce(1)
Non-uniform averaging: weighted avg =allreduce(niwi)/allreduce(wi)
Gather: node array = allreduce({0, 0, . . . , 1
i
, . . . , 0})
AllR d E l
-
8/11/2019 Cluster Parallel
19/29
AllReduce Examples
Counting: n = allreduce(1)
Average: avg = allreduce(ni)/allreduce(1)
Non-uniform averaging: weighted avg =allreduce(niwi)/allreduce(wi)
Gather: node array = allreduce({0, 0, . . . , 1
i
, . . . , 0})
Current code provides 3 routines:
accumulate(): Computes vector sums
accumulate scalar(): Computes scalar sumsaccumulate avg(): Computes weighted andunweighted averages
M hi l i ith AllR d
-
8/11/2019 Cluster Parallel
20/29
Machine learning with AllReduce
Previously: Single node SGD, multiple passes over data
Parallel: Each node runs SGD, averages parameters afterevery pass (or more often!)
Code change:if(global.span server != "") {
if(global.adaptive)
accumulate weighted avg(global.span server,
params->reg);
else
accumulate avg(global.span server,params->reg, 0);
}
Weighted averages computed for adaptive updates, weightfeatures differently
M hi l i ith AllR d td
-
8/11/2019 Cluster Parallel
21/29
Machine learning with AllReduce contd.
L-BFGS requires gradients and loss values
One call to AllReduce for each
Parallel synchronized L-BFGS updates
Same with CG, another AllReduce operation for Hessian
Extends to many other common algorithms
Communication and computation
-
8/11/2019 Cluster Parallel
22/29
Communication and computation
Two main additions in cluster-parallel code:
Hadoop-compatible AllReduce communicationNew and old optimization algorithms modified for AllReduce
Hybrid optimization for rapid convergence
-
8/11/2019 Cluster Parallel
23/29
Hybrid optimization for rapid convergence
SGD converges fast initially, but slow to squeeze the final bitof precision
L-BFGS converges rapidly towards the end, once in a goodregion
Hybrid optimization for rapid convergence
-
8/11/2019 Cluster Parallel
24/29
Hybrid optimization for rapid convergence
SGD converges fast initially, but slow to squeeze the final bitof precision
L-BFGS converges rapidly towards the end, once in a goodregion
Hybrid optimization for rapid convergence
-
8/11/2019 Cluster Parallel
25/29
Hybrid optimization for rapid convergence
SGD converges fast initially, but slow to squeeze the final bitof precision
L-BFGS converges rapidly towards the end, once in a goodregion
Each node performs few local SGD iterations, averaging afterevery pass
Switch to L-BFGS with synchronized iterations usingAllReduce
Two calls to VW
Speedup
-
8/11/2019 Cluster Parallel
26/29
Speedup
Near linear speedup
10 20 30 40 50 60 70 80 90 100
1
2
3
4
5
6
7
8
9
10
Nodes
Speedup
Hadoop helps
-
8/11/2019 Cluster Parallel
27/29
Hadoop helps
Nave implementation driven by slow node
Speculative execution ameliorates the problem
Table: Distribution of computing time (in seconds) over 1000 nodes.First three columns are quantiles. The first row is without speculativeexecution while the second row is with speculative execution.
5% 50% 95% Max Comm. time
Without spec. exec. 29 34 60 758 26With spec. exec. 29 33 49 63 10
Fast convergence
-
8/11/2019 Cluster Parallel
28/29
Fast convergence
auPRC curves for two tasks, higher is better
0 10 20 30 40 500.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Iteration
auPRC
OnlineLBFGS w/ 5 online passesLBFGS w/ 1 online passLBFGS
0 5 10 15 20
0.466
0.468
0.47
0.472
0.474
0.476
0.478
0.48
0.482
0.484
Iteration
auPRC
OnlineLBFGS w/ 5 online passesLBFGS w/ 1 online passLBFGS
Conclusions
-
8/11/2019 Cluster Parallel
29/29
Conclusions
AllReduce quite general yet easy for machine learning
Marriage with Hadoop great for robustnessHybrid optimization strategies effective for rapid convergence
John gets his kilonode program