Download - Matrix Factorization via SGD
Matrix Factorizationvia SGD
BACKGROUND
Recovering latent factors in a matrixm movies
n u
sers
m movies
x1 y1x2 y2.. ..
… …xn yn
a1 a2 .. … amb1 b2 … … bm v11 …
… …vij
… vnm
~
V[i,j] = user i’s rating of movie j
r
W
H
V
MF VIA SGD
Matrix factorization as SGD
step size
local gradient
…scaled up by N to approximate gradient
Key claim:
What loss functions are possible?
ALS = alternating least squares
DISTRIBUTED MF VIA SGD
talk pilfered from …..
KDD 2011
NAACL 2010
Parallel Perceptrons• Simplest idea:
– Split data into S “shards”– Train a perceptron on each shard independently
• weight vectors are w(1) , w(2) , …
– Produce some weighted average of the w(i)‘s as the final result
Parallelizing perceptrons
Instances/labels
Instances/labels – 1 Instances/labels – 2 Instances/labels – 3
vk -1 vk- 2 vk-3
vk
Split into example subsets
Combine by some sort of
weighted averaging
Compute vk’s on subsets
Parallel Perceptrons – take 2
Idea: do the simplest possible thing iteratively.
• Split the data into shards• Let w = 0• For n=1,…• Train a perceptron on each
shard with one pass starting with w
• Average the weight vectors (somehow) and let w be that average
Extra communication cost: • redistributing the weight vectors• done less frequently than if fully synchronized, more frequently than if fully parallelized
All-Reduce
Parallelizing perceptrons – take 2
Instances/labels
Instances/labels – 1
Instances/labels – 2
Instances/labels – 3
w -1 w- 2 w-3
w
Split into example subsets
Combine by some sort of
weighted averaging
Compute local vk’s
w (previous)
Similar to McDonnell et al with perceptron learning
Slow convergence…..
More detail….• Initialize W,H randomly– not at zero
• Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch”• Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch k • Use “bold driver” to set step size:– increase step size when loss decreases (in an epoch)– decrease step size when loss increases
• Implemented in Hadoop and R/Snowfall
M=
Wall Clock Time8 nodes, 64 cores, R/snowIn-memory implementation
Number of Epochs
Varying rank100 epochs for all
Wall Clock TimeHadoopOne map-reduce job per epoch
Hadoop scalabilityHadoop
process setup time starts to
dominate
Hadoop scalability