implementation of linear regression and logistic regression on spark

Parallel implementation of ML algorithms on Spark

Dalei Li EIT Digital

https://github.com/lidalei/LinearLogisticRegSpark

Overview• Linear regression + l2 regularization

• Normal equation

• Logistic regression + l2 regularization

• Gradient descend

• Newton’s method

• Hyper-parameter optimization

• Experiments

• IntelliJ + sbt

• Scala 2.11.8 + Spark 2.0.1

Linear regression• Problem formulation

• Closed-form solution

• Computation reformulation

Linear regression• Data set - UCI YearPredictionMSD, text file

• 515,345 songs, (90 audio numerical features, year)

• Core computation - norm terms and rmse

Implemented outer product + vector addition

Workflow

Read file RegexTokenizer StandardScaler Solve normal equation

Spark SQL textAdd l2 regularization

LAPACK

Center data

Evaluation

Validation

Spark ML linear regression with norm solver vs. my implementation (both with 0.1 l2 regularization)

Randomly split data set into train 70% + test 30%. The RMSEs on test set are also identical, less than 0.5% difference.

Logistic regression• Problem formulation

• Gradient descent

• Newton’s method

• Computation reformulation - gradient and Hessian matrix

Logistic regression• Data set - UCI HIGGS, csv file

• 11 million instances, (21+7 numerical features, binary label)

• Core computation - gradient and Hessian matrix

treeReduce can reduce the pressure of final ops in driver.

Workflow

Read file VectorAssembler DF to RDDgradient descent/

newton’s method

Spark SQL csv Gradient - add l2 regularization

Scala case class Instance (features, label),

Newton’s - append all-one column

Evaluation

cross entropy confusion matrix

Validation

Spark ML logistic regression with L-BFGS vs. my implementation of Newton’s method

Randomly split data set into train 70% + test 30%. The learned THETAs are almost identical, the last one is bias.

• Grid search to find optimal hyper-parameters with best generalization error

• Estimate generalization error

• k-Fold cross validation

Hyper-parameter optimization

Hyper-parameter is a parameter used in a training process but not a part of a classifier itself. It controls what kind of parameters can / tend to be selected. For example, polynomial expansion will make non-linear relationship between a label and features be learned possibly.

Grid search

• Grid - [polynomial expansion degree] x [l2 regularization]

• Polynomial expansion is memory killer

• Degree 3 on 7 features results in 119 features

• Be careful with exploiting parallelism

To increase temporal locality - accesses to a data frame are clustered in time.

Polynomial expansion does not include constant column.

K-Fold

DF Persist, randomSplit map=> [([train_i], test)] map=>[(train, test)]

Spark SQL data frame

[([DF], DF)]

[(union[DF], DF)]

k-Fold

Experiments

Spark 2.0.2 standalone mode

3 cores + 5GB mem exact copy of read-in file

http://spark.apache.org/docs/latest/cluster-overview.html

In total, we have 3 physical machines with 12GB mem + 8 cores.Driver - execute scala programWorker - execute tasksExecutor - each application runs a or more processes on a worker nodeJob - triggered by an actionTask - a unit of work executed on an executor, related with number of partitions >= number of blocks (128MB). If set manually, 2-4 partitions for each CPU in your cluster.Stage - a set of tasks

Local file - path + content on each worker node.

Performance test• ML Settings

• Logistic regression on HIGGS

• Train-test split, 70% + 30%

• Only 7 high level features were used

• Test unit 1 - 100 times full gradient descent + training error on training set, initial learning rate 0.001, l2 regularization 0.1

• Test unit 2 - compute confusion matrix on test set and make predictions

Performance and speedup curve

local 1 executor 2 executors 3 executors 4 executors 5 executors

training time (s) training-speed up

Running time vs. #executors (2 times average). Except for local, all tests have enough memory

Local mode does not have enough memory, causing data cannot be persist in memory. Thus, the running time is much higher.

Having more executors will reduce the running time linearly.

Grid search• 10% of original data, i.e., 1.1 million instances, 7 high level features only

• Grid

• Polynomial degrees - 1, 2, 3

• l2 regularization - 0, 0.001, 0.01, 0.1, 0.5

• 3-Fold cross validation

• 100 times gradient descent with initial learning rate 0.01

• 2 executors with 10GB mem + 5 cores each

• Result - 4400s training time, final test accuracy 62.4%

Confusion matrix: truePositive: 117605, trueNegative: 88664, falsePositive: 66529, falseNegative: 57786

Conclusion• Persist data - use more than once (incl. having branches)

• Change default cluster settings, e.g., executor memory per executor is 1GB

• Make use of Spark UI to find bottlenecks

• Using Spark builtin functions if possible

• Good examples for missing functions

• Don’t use accumulators in a transformation, except only need approximations

• Always start from small data to debug faster

• Future work - obey train-test split

Q&A• Thank you!

• Useful links

• Master - spark://ip:7077, e.g., spark://b2.lxd:7077

• Cluster - http://ip:8080/

• Spark UI - http://ip:4040/

• https://spark.apache.org/docs/latest/programming-guide.html

• http://spark.apache.org/docs/latest/submitting-applications.html, package a jar - sbt package

Backend slides

Training time vs. # executors

local 1 executor 2 executors 3 executors 4 executors 5 executors

training time (s) test accuracy

Spark UI

Jobs timeline

Spark UI

Executor summary

Numerical stability

implementation of linear regression and logistic regression on spark

Data & Analytics

and logistic regression linear and logistic regression

logistic regression -...

logistic regression with apache spark

lecture 20 - logistic regression - statistical science ·...

logistic regression€¦ · logistic regression • combine...

stata for logistic regression - people.umass.edu for...

logistic regression using spss -...

introduction to logistic regression modeling -...

scaling out logistic regression with spark

logistic regression and discriminant analysis ·...

logistic regression for distribution modeling - clas...

cloud implementation of logistic regression for …...key...

logistic regression - svivek · logistic regression is the...

binary logistic regression - juan battlebinary logistic...

logistic regression - krishanpandey.com...

teaching logistic regression using ordinary least …...

large scale logistic regression and linear support vector...

1 chapter 16 logistic regression analysis. 2 content...

multinomial logistic regression with apache spark

2014-06-20 multinomial logistic regression with apache spark