alpine spark implementation - technical

31
Multinomial Logistic Regression with Apache Spark DB Tsai Machine Learning Engineer May 1 st , 2014

Upload: alpinedatalabs

Post on 11-Aug-2014

7.540 views

Category:

Data & Analytics


10 download

DESCRIPTION

Alpine Data Labs presents a deep dive into our implementation of Multinomial Logistic Regression with Apache Spark. Machine Learning Engineer DB Tsai takes us through the technical implementation details step by step. First, he explains how the state of the art Machine Learning on Hadoop is not doing fulfilling the promise of Big Data. Next, he explains how Spark is a perfect match for machine learning through their in-memory cache-ing capability demonstrating 100x performance improvement. Third, he takes us through each aspect of a multinomial logistic regression and how this is developed with Spark APIs. Fourth, he demonstrates an extension of MLOR and training parameters. Finally, he benchmarks MLOR with 11M rows, 123 features, 11% non-zero elements with a 5 node Hadoop cluster. Finally, he shows Alpine's unique visual environment with Spark and verifies the performance with the job tracker. In conclusion, Alpine supports the state of the art Cloudera and Pivotal Hadoop clusters and performances at a level that far exceeds its next nearest competitor.

TRANSCRIPT

Page 1: Alpine Spark Implementation - Technical

Multinomial Logistic Regression with Apache Spark

DB TsaiMachine Learning Engineer

May 1st, 2014

Page 2: Alpine Spark Implementation - Technical

Machine Learning with Big Data

● Hadoop MapReduce solutions

● MapReduce is scaling well for batch processing● However, lots of machine learning algorithms

are iterative by nature.● There are lots of tricks people do, like training

with subsamples of data, and then average themodels. Why big data with approximation?

+ =

Page 3: Alpine Spark Implementation - Technical

Lightning-fast cluster computing

● Empower users to iterate through the data byutilizing the in-memory cache.

● Logistic regression runs up to 100x faster thanHadoop M/R in memory.

● We're able to train exact model without doing any approximation.

Page 4: Alpine Spark Implementation - Technical

Binary Logistic Regressiond=

ax1+bx 2+cx0

√a 2+b2where x0=1

P ( y=1∣⃗x , w⃗ )= exp (d )1+exp (d )

= exp( x⃗ w⃗ )1+exp( x⃗ w⃗ )

P ( y=0∣⃗x , w⃗)= 11+exp( x⃗ w⃗)

logP ( y=1∣⃗x , w⃗ )P( y=0∣⃗x , w⃗)

= x⃗ w⃗

w0=c

√a2+b2

w1=a

√a2+b2

w2=b

√a2+b2

wherew0 is called as intercept

Page 5: Alpine Spark Implementation - Technical

Training Binary Logistic Regression● Maximum Likelihood estimation

From a training data and labels

● We want to find that maximizes thelikelihood of data defined by

● We can take log of the equation, and minimize

it instead. The Log-Likelihood becomes theloss function.

X=( x⃗1 , x⃗ 2 , x⃗3 , ...)Y=( y1, y2, y3, ...)

w⃗

L( w⃗ , x⃗1, ... , x⃗N)=P ( y1∣x⃗1 , w⃗)P ( y2∣x⃗2 , w⃗ )... P ( yN∣x⃗N , w⃗ )

l ( w⃗ , x⃗ )=log P ( y1∣x⃗1 , w⃗ )+log P ( y2∣x⃗ 2 , w⃗ )...+log P ( yN∣x⃗N , w⃗ )

Page 6: Alpine Spark Implementation - Technical

Optimization● First Order Minimizer

Require loss, gradient of loss function

– Gradient Decent is step size

– Limited-memory BFGS (L-BFGS)

– Orthant-Wise Limited-memory Quasi-Newton(OWLQN)

– Coordinate Descent (CD)

– Trust Region Newton Method (TRON)● Second Order Minimizer

Require loss, gradient and hessian of loss function

– Newton-Raphson, quadratic convergence. Fast!

● Ref: Journal of Machine Learning Research 11 (2010) 3183-3234, Chih-Jen Lin et al.

w⃗n+1=w⃗ n−γG⃗ , γ

w⃗n+1=w⃗ n−H−1G⃗

Page 7: Alpine Spark Implementation - Technical

Problem of Second Order Minimizer

● Scale horizontally (the numbers of trainingdata) by leveraging Spark to parallelize thisiterative optimization process.

● Don't scale vertically (the numbers of trainingfeatures). Dimension of Hessian is

● Recent applications from documentclassification and computational linguistics areof this type.

dim (H )=[(k−1)(n+1)]2 where k is numof class , nis numof features

Page 8: Alpine Spark Implementation - Technical

L-BFGS● It's a quasi-Newton method.● Hessian matrix of second derivatives doesn't

need to be evaluated directly. ● Hessian matrix is approximated using gradient

evaluations.● It converges a way faster than the default

optimizer in Spark, Gradient Decent.● We love open source! Alpine Data Labs

contributed our L-BFGS to Spark, and it'salready merged in Spark-1157.

Page 9: Alpine Spark Implementation - Technical

Training Binary Logistic Regressionl ( w⃗ , x⃗ ) = ∑

k=1

N

log P ( yk∣x⃗k , w⃗ )

= ∑k=1

N

yk logP ( yk=1∣x⃗k , w⃗)+(1− yk ) logP ( yk=0∣x⃗k , w⃗)

= ∑k=1

N

yk logexp ( x⃗k w⃗)

1+exp( x⃗k w⃗)+(1− yk ) log 1

1+exp( x⃗k w⃗ )

= ∑k=1

N

yk x⃗ k w⃗−log(1+exp( x⃗k w⃗))

Gradient : Gi (w⃗ , x⃗)=∂ l ( w⃗ , x⃗ )∂wi

=∑k=1

N

y k xki−exp ( x⃗k w⃗)

1+exp ( x⃗k w⃗ )xki

Hessian : H ij (w⃗ , x⃗)=∂∂ l (w⃗ , x⃗ )∂wi ∂w j

=−∑k=1

N exp( x⃗k w⃗)

(1+exp( x⃗k w⃗ ))2xki x kj

Page 10: Alpine Spark Implementation - Technical

Overfitting

P ( y=1∣⃗x , w⃗)=exp(zd )

1+exp(zd )=

exp( x⃗ w⃗)1+exp( x⃗ w⃗)

Page 11: Alpine Spark Implementation - Technical

Regularization● The loss function becomes

● The loss function of regularizer doesn'tdepend on data. Common regularizers are

– L2 Regularization:

– L1 Regularization:

● L1 norm is not differentiable at zero!

l total (w⃗ , x⃗)=lmodel (w⃗ , x⃗)+l reg( w⃗)

lreg (w⃗ )=λ∑i=1

N

wi2

lreg (w⃗ )=λ∑i=1

N

∣wi∣

G⃗ (w⃗ , x⃗)total=G⃗ (w⃗ , x⃗)model+G⃗ ( w⃗)reg

H̄ ( w⃗ , x⃗)total=H̄ (w⃗ , x⃗)model+H̄ (w⃗ )reg

Page 12: Alpine Spark Implementation - Technical

Mini School of Spark APIs

● map(func) : Return a new distributed datasetformed by passing each element of the sourcethrough a function func.

● reduce(func) : Aggregate the elements of thedataset using a function func (which takes twoarguments and returns one). The functionshould be commutative and associative sothat it can be computed correctly in parallel. No “key” thing here compared with HadoopM/R.

Page 13: Alpine Spark Implementation - Technical

Example – No More “Word Count”! Let's compute the mean of numbers.

3.0

2.0

1.0

5.0

Executor 1

Executor 2

map

map

(1.0, 1)

(5.0, 1)

(3.0, 1)

(2.0, 1)

reduce

(5.0, 2)

reduce (11.0, 4)

Shuffle

Page 14: Alpine Spark Implementation - Technical

● The previous example using map(func) willcreate new tuple object, which may causeGarbage Collection issue.

● Like the combiner in Hadoop Map Reduce, for each tuple emitted from map(func), there isno guarantee that they will be combinedlocally in the same executor by reduce(func). It may increase the traffic in shuffle phase.

● In Hadoop, we address this by in-mappercombiner or aggregating the result in globalvariables which have scope to entire partition.The same approach can be used in Spark.

Page 15: Alpine Spark Implementation - Technical

Mini School of Spark APIs

● mapPartitions(func) : Similar to map, but runsseparately on each partition (block) of theRDD, so func must be of type Iterator[T] =>Iterator[U] when running on an RDD of type T.

● This API allows us to have global variables onentire partition to aggregate the result locallyand efficiently.

Page 16: Alpine Spark Implementation - Technical

Better Mean of Numbers

Implementation

3.0

2.0

Executor 1

reduce (11.0, 4)

Shuffle

var counts = 0var sum = 0.0

loop

var counts = 2var sum = 5.0

3.0

2.0

(5.0, 2)

1.0

5.0

Executor 2

var counts = 0var sum = 0.0

loop

var counts = 2var sum = 6.0

1.0

5.0

(6.0, 2)

mapPartitions

Page 17: Alpine Spark Implementation - Technical

More Idiomatic Scala Implementation● aggregate(zeroValue: U)(seqOp: (U, T) => U,

combOp: (U, U) => U): U zeroValue is a neutral "zero value"with type U for initialization. This isanalogous to

in previous example.

var counts = 0var sum = 0.0

seqOp is function taking(U, T) => U, where U is aggregatorinitialized as zeroValue. T is eachline of values in RDD. This isanalogous to mapPartition inprevious example.

combOp is function taking(U, U) => U. This is essentiallycombing the results betweendifferent executors. Thefunctionality is the same asreduce(func) in previous example.

Page 18: Alpine Spark Implementation - Technical

Approach of Parallelization in SparkSpark Driver JVM

Tim

e

Spark Executor 1 JVM

Spark Executor 2 JVM

1) Find available resources in cluster, and launch executor JVMs. Also initialize the weights.

2) Ask executors to load the data

into executors' JVMs

2) Trying to load the data into memory. If the data is bigger than memory, it will be partial cached. (The data locality from source will be taken care by Spark)

3) Ask executors to compute loss, and gradient of each training sample (each row) given the current weights. Get the aggregated results after the Reduce Phase in executors.

5) If the regularization is enabled, compute the loss and gradient of regularizer in driver since it doesn't depend on training data but only depends on weights. Add them into the results from executors.

3) Map Phase: Compute the loss and gradient of each row of training data locally given the weights obtained from the driver. Can either emit each result or sum them up in local aggregators.

4) Reduce Phase: Sum up the losses and gradients emitted from the Map Phase

Taking a rest!

6) Plug the loss and gradient from model and regularizer into optimizer to get the new weights. If the differences of weights and losses are larger than criteria, GO BACK TO 3)

Taking a rest!

7) Finish the model training! Taking a rest!

Page 19: Alpine Spark Implementation - Technical

Step 3) and 4) ● This is the implementation of step 3) and 4) in

MLlib before Spark 1.0

● gradient can have implementations ofLogistic Regression, Linear Regression, SVM,or any customized cost function.

● Each training data will create new “grad” object after gradient.compute.

Page 20: Alpine Spark Implementation - Technical

Step 3) and 4) with mapPartitions

Page 21: Alpine Spark Implementation - Technical

Step 3) and 4) with aggregate

● This is the implementation of step 3) and 4) inMLlib in coming Spark 1.0

● No unnecessary object creation! It's helpfulwhen we're dealing with large features trainingdata. GC will not kick in the executor JVMs.

Page 22: Alpine Spark Implementation - Technical

Extension to Multinomial Logistic Regression

● In Binary Logistic Regression

● For K classes multinomial problem where labelsranged from [0, K-1], we can generalize it via

● The model, weights becomes(K-1)(N+1) matrix, where N is number of features.

logP ( y=1∣⃗x , w̄)P ( y=0∣⃗x , w̄)

= x⃗ w⃗1

logP ( y=2∣⃗x , w̄)P ( y=0∣⃗x , w̄)

= x⃗ w⃗2

...

logP ( y=K−1∣⃗x , w̄)P ( y=0∣⃗x , w̄)

= x⃗ w⃗K−1

logP ( y=1∣⃗x , w⃗)P ( y=0∣⃗x , w⃗)

= x⃗ w⃗

w̄=(w⃗1, w⃗2, ... , w⃗K−1)T

P ( y=0∣⃗x , w̄ )= 1

1+∑i=1

K−1

exp( x⃗ w⃗ i )

P ( y=1∣⃗x , w̄)=exp( x⃗ w⃗2)

1+∑i=1

K−1

exp( x⃗ w⃗ i)

...

P ( y=K−1∣⃗x , w̄ )=exp( x⃗ w⃗K−1)

1+∑i=1

K−1

exp( x⃗ w⃗ i )

Page 23: Alpine Spark Implementation - Technical

Training Multinomial Logistic Regression

l ( w̄ , x⃗ ) = ∑k=1

N

log P ( yk∣x⃗k , w̄ )

= ∑k=1

N

α( y k) log P ( y=0∣x⃗ k , w̄ )+(1−α( yk )) logP ( yk∣x⃗k , w̄)

= ∑k=1

N

α( y k) log 1

1+∑i=1

K−1

exp( x⃗ w⃗i)+(1−α( yk )) log

exp( x⃗ w⃗ yk)

1+∑i=1

K−1

exp( x⃗ w⃗i)

= ∑k=1

N

(1−α( yk )) x⃗ w⃗ y k−log (1+∑

i=1

K−1

exp( x⃗ w⃗i))

Gradient : Gij (w̄ , x⃗)=∂ l ( w̄ , x⃗)∂wij

=∑k=1

N

(1−α( yk )) xkj δi , y k−

exp ( x⃗k w⃗)1+exp( x⃗k w⃗ )

xkj

α( yk )=1 if y k=0α( yk )=0 if yk≠0

Define:

Note that the first index “i” is for classes, and the second index “j” is for features.

Hessian: H ij , lm( w̄ , x⃗ )=∂∂ l (w̄ , x⃗)∂wij ∂wlm

Page 24: Alpine Spark Implementation - Technical

0 5 10 15 20 25 30 350.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

L-BFGS Dense Features

L-BFGS Sparse Features

GD Sparse Features

GD Dense Features

Seconds

Log-

Like

liho

od /

Num

ber

of S

ampl

esa9a Dataset Benchmark

Page 25: Alpine Spark Implementation - Technical

a9a Dataset Benchmark

-1 1 3 5 7 9 11 13 150.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

L-BFGS

GD

Iterations

Log-

Like

liho

od /

Num

ber

of S

ampl

es

Page 26: Alpine Spark Implementation - Technical

0 5 10 15 20 25 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

LBFGS Sparse Vector

GD Sparse Vector

Second

Log-

Like

liho

od /

Num

ber

of S

ampl

esrcv1 Dataset Benchmark

Page 27: Alpine Spark Implementation - Technical

news20 Dataset Benchmark

0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

LBFGS Sparse Vector

GD Sparse Vector

Second

Log-

Like

liho

od /

Num

ber

of S

ampl

es

Page 28: Alpine Spark Implementation - Technical

Alpine Demo

Page 29: Alpine Spark Implementation - Technical

Alpine Demo

Page 30: Alpine Spark Implementation - Technical

Alpine Demo

Page 31: Alpine Spark Implementation - Technical

Conclusion● Alpine supports state of the art Cloudera CDH5

● Spark runs programs 100x faster than Hadoop

● Spark turns iterative big data machine learning problemsinto single machine problems

● MLlib provides lots of state of art machine learningimplementations, e.g. K-means, linear regression,logistic regression, ALS, collaborative filtering, andNaive Bayes, etc.

● Spark provides ease of use APIs in Java, Scala, orPython, and interactive Scala and Python shell

● Spark is Apache project, and it's the most active big dataopen source project nowadays.