machine learning at scale with apache spark

Martin Zapletal @zapletal_martinCake Solutions @cakesolutions

Machine learning at scale with Apache Spark

Scaling computation

● Analytics tools with poor scalability and integration● Manual processes● Slow iterations● Not suitable for large amounts of data

● We want fast iteration, reliability, integration

● Serial implementation● Parallel● GPUs● Distributed

Scaling neural networks

Perceptron

● Basic building block of neural networks

a = f(Σ(y * w) + b)

Artificial neural network

● Network training○ Many “optimal” solutions○ Optimization and training techniques - LBFGS,

Backpropagation, batch and online gradient descent, Downpour SGD, Sandblaster LBFGS, …

○ Vanishing gradient, amplifying parameters, ...○ New methods for large networks - deep learning

-10.895

0.999595

-24.584

-1.159

-40.119

35.369

-24.687-53.197

-8.627

-57.122

61.488

-52.985

-22.904

-67.173

22.172-53.706

27.098-0.375

Output 2.613296075440797E-4 for input Vector(0, 0)Output 0.9989222606269823 for input Vector(0, 1)Output 0.9995952194411893 for input Vector(1, 0)Output 4.0074182099155245E-7 for input Vector(1, 1)

Scaling computation

● Different programming models, Different languages, Different levels

● Sequential○ R, Matlab, Python, Scala

● Parallel○ Theano, Torch, Caffe, Tensor Flow, Deeplearning4j

Elapsed times for 20 PageRank iterations

[3, 4]

Machine learning

● Linear algebra● Vectors, matrices, vector spaces, matrix transformations,

eigenvectors/values● Many machine learning algorithms are optimization problems● Goal is to solve them in reasonable (bounded) time ● Goal not always to find the best possible model (data size, feature

engineering vs. algorithm/model complexity)● Goal is to solve them reliably, at scale, support application needs

and improve

Distributed environment

● Asynchronous and unreliable

● CAP theorem● Consistency● Availability● Partition tolerance

Consistency, time and order in DS

● Sequential program always one total order of operations

● No order guarantees in distributed system● At-most-once. Messages may be lost.● At-least-once. Messages may be duplicated but not

lost.● Exactly-once.

Failure in distributed system

● Node failures, network partitions, message loss, split brains,

inconsistencies

● Microsoft's data centers average failure rate is 5.2 devices per day

and 40.8 links per day, with a median time to repair of approximately

five minutes (and a maximum of one week).

● Google new cluster over one year. Five times rack issues 40-80

machines seeing 50 percent packet loss. Eight network maintenance

events (four of which might cause ~30-minute random connectivity

losses). Three router failures (resulting in the need to pull traffic

immediately for an hour).

● CENIC 500 isolating network partitions with median 2.7 and 32

minutes; 95th percentile of 19.9 minutes and 3.7 days, respectively

for software and hardware problems

Failure in distributed system

● MongoDB separated primary from its 2 secondaries. 2 hours later the old primary rejoined and rolled back everything on the new primary

● A network partition isolated the Redis primary from all secondaries. Every API call caused the billing system to recharge customer credit cards automatically, resulting in 1.1 percent of customers being overbilled over a period of 40 minutes.

● The partition caused inconsistency in the MySQL database. Because foreign key relationships were not consistent, Github showed private repositories to the wrong users' dashboards and incorrectly routed some newly created repositories.

● For several seconds, Elasticsearch is happy to believe two nodes in the same cluster are both primaries, will accept writes on both of those nodes, and later discard the writes to one side.

● RabbitMQ lost ~35% of acknowledged writes under those conditions.● Redis threw away 56% of the writes it told us succeeded.● In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the

strongest consistency settings● MongoDB “strictly consistent” reads see stale versions of documents, but they

can also return garbage data from writes that never should have occurred.

Algorithm parallelization

computation

Algorithm parallelization

Neural network parallelism

import tensorflow as tf

def init_weights(shape):

return tf.Variable(tf.random_normal(shape, stddev=0.01))

def model(X, w_h, w_o):

h = tf.nn.sigmoid(tf.matmul(X, w_h))

return tf.matmul(h, w_o)

X = tf.placeholder("float", [None, 784])

Y = tf.placeholder("float", [None, 10])

w_h = init_weights([784, 625])

w_o = init_weights([625, 10])

py_x = model(X, w_h, w_o)

cost = tf.reduce_mean(

tf.nn.softmax_cross_entropy_with_logits(py_x, Y))

train_op = tf.train.GradientDescentOptimizer(0.05).minimize(cost)

predict_op = tf.argmax(py_x, 1)

sess = tf.Session()

init = tf.initialize_all_variables()

sess.run(init)

sess.run(train_op, …)

sess.run(predict_op, …) [9, 10]

Model parallelism

e 1 Mach

Data parallelism

Parameter server

● Model and data parallelism● Failures and slow machines● Additional stochasticity due to asynchrony (relaxed

consistency, not up to data parameters, ordering not guaranteed, …)

Examples

“Their network for face detection from youtube comprised millions of neurons and 1 billion connection weights. They trained it on a dataset of 10 million 200x200 pixel RGB images to learn 20,000 object categories. The training simulation ran for three days on a cluster of 1,000 servers totaling 16,000 CPU cores. Each instantiation of the network spanned 170 servers”

Google.

“We demonstrate near-perfect weak scaling on a 16 rack IBM Blue Gene/Q (262144 CPUs, 256 TB memory), achieving an unprecedented scale of 256 million neurosynaptic cores containing 65 billion neurons and 16 trillion synapses“

TrueNorth, part of project IBM SyNAPSE. [11, 12]

Examples

Architecture

Preprocessing

Features

Training

Testing

Error %

Data processing pipeline

● Whole lifecycle of data

● Data processing● Data stores● Integration● Distributed computing primitives● Cluster managers and task schedulers● Deployment, configuration management and DevOps● Data analytics and machine learning

Client

QueryCommand

Denormalise/Precompute

Kappa architecture

Batch-Pipeline

Client

Client Views

Streamprocessor

ScoopHive

Impala

Lambda Architecture

Batch Layer Serving Layer

Stream layer (fast)

Serving DB

[15, 16]

Apache Spark

● In memory dataflow distributed data processing framework, streaming and batch

● Distributes computation using a higher level API● Load balancing● Moves computation to data ● Fault tolerant

Spark distributed programming model

● Resilient Distributed Datasets● Fault tolerance● Caching● Serialization● Transformations

○ Lazy, form the DAG○ map, filter, flatMap, union, group, reduce, sort, join, repartition,

cartesian, glom, ... ● Actions

○ Execute DAG, retrieve result○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...

● Accumulators, Broadcast Variables● SQL● Integration● Streaming● Machine Learning● Graph Processing

Distributed computation

● Spark streaming● Computing, processing, transforming, analytics

textFile mapmapreduceByKey

collect

sc.textFile("counts") .map(line => line.split("\t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect()

Graph lineage

● Master and worker failures

val data2a = data2 .map(x => x.label -> x.features)

val dataa = data .map(x => x.label -> x.features) .union(data2a) .cache()

val data3a = data3 .map(x => x.label -> x.features)

val datab = dataa .join(data3a, 4) .cache() .mapPartitions(it => it.map(x => x._1 + 1 -> x._2)) .groupByKey(4) .reduceByKey((it1, it2) => it1 ++ it2) .collect()

Optimizations

● Multiple phases● Catalyst

Optimizations

Spark master

Spark worker

Cassandra

Optimizations

● CPU and memory bottlenecks, not IO● Project Tungsten

○ Explicit memory management and binary processing

○ Cache-aware computation

○ Code generation

● Daytona Gray Sort 100TB Benchmark won by Apache Spark○ Optimized memory layout, shuffle algorithm, ...

● Data types● Basic statistics

○ summary statistics, correlations, stratified sampling, hypothesis testing, streaming significance testing, random data generation

● Classification and regression○ SVMs, logistic regression, linear regression, naive Bayes, decision trees, ensembles of

trees (Random Forests and Gradient-Boosted Trees), isotonic regression, multilayer perceptron classifier, one-vs-rest classifier, survival regression

● Collaborative filtering○ alternating least squares (ALS)

● Clustering○ k-means, Gaussian mixture, power iteration clustering (PIC), latent Dirichlet allocation

(LDA), bisecting k-means, streaming k-means● Dimensionality reduction

○ singular value decomposition (SVD), principal component analysis (PCA)● Feature extraction and transformation

○ TF-IDF, word2vec, normalizers, scaling● Frequent pattern mining

○ FP-growth, association rules, PrefixSpan● Evaluation metrics● PMML model export● Optimization (developer)

○ stochastic gradient descent, limited-memory BFGS (L-BFGS)

Example application

7 * Dumbbell AlternatingBicep Curl

Muvr architecture

Reactive

● Responsive● Resilient● Elastic● Message driven

● Classify finished (in progress) exercises● Gather data for improved classification● Predict next exercises● Predict weights, intensity● Design a schedule of exercises and improvements

(personal trainer)● Monitor exercise quality

Scaling model trainingval sc = new SparkContext("local[4]", "NN")

val data = ...

val layers = Array[Int](inputSize, 250, 50, outputSize)

val trainer = new MultilayerPerceptronClassifier()

.setLayers(layers)

.setBlockSize(128)

.setSeed(1234L)

.setMaxIter(100)

val model = trainer.fit(data)

val result = model.transform(data)

println(result.select(result("prediction")).foreach(println))

val predictionAndLabels = result.select("prediction", "label")

val evaluator = new MulticlassClassificationEvaluator()

.setMetricName("precision")

println("Precision:" + evaluator.evaluate(predictionAndLabels))

Scaling model training

● Deeplearning4j, Neon, Tensor flow on Spark

Model 1 training

Model 2 training

Model 3 training

Best model

init_norm = Uniform(low=-0.1,high=0.1)

bias_init = Constant(val = 1.0)

layers = []

layers.append(Conv(

fshape = (1, 3, 16),

init=init_norm,

bias=bias_init,

activation=Rectlin()))

layers.append(Pooling(

op="max",

fshape=(2,1),

strides=2))

layers.append(Conv(

fshape = (1, 3, 32),

init=init_norm,

bias=bias_init,

layers.append(Pooling(

op="max",

fshape=(2,1),

strides=2))

layers.append(Affine(

nout=100,

init=init_norm,

bias=bias_init,

layers.append(Dropout(

name="do_2",

keep = 0.9))

layers.append(Affine(

nout=dataset.num_labels,

init=init_norm,

bias=bias_init,

activation = Logistic()))

return Model(layers=layers)

backend = gen_backend(

backend='cpu',

batch_size=self.batch_size,

rng_seed=self.random_seed,

stochastic_round=False)

# backend = gen_backend(rng_seed=0, gpu='cudanet')

cost = GeneralizedCost(

name='cost',

costfunc=CrossEntropyMulti())

optimizer = GradientDescentMomentum(

learning_rate=self.lrate,

momentum_coef=0.9)

model.fit(

dataset.train(),

optimizer=optimizer,

num_epochs=self.max_epochs,

cost=cost,

callbacks=callbacks)

.cassandraTable(conf["cassandra"]["data_keyspace"], conf["cassandra"]["data_table"]) \

.select("user_id", "model_id", "file_name", "time", "x", "y", "z", "exercise") \

.spanBy("user_id", "model_id") \

.map(train_model_for_user) \

.saveToCassandra(conf["cassandra"]["model_keyspace"], conf["cassandra"]["model_table"])

val events = sc.eventTable().cache().toDF()

val lr = new LinearRegression()

val pipeline = new Pipeline().setStages(Array(new UserFilter(), new ZScoreNormalizer(),

new IntensityFeatureExtractor(), lr))

val paramGrid = new ParamGridBuilder()

.addGrid(lr.regParam, Array(0.1, 0.01))

.addGrid(lr.fitIntercept, Array(true, false))

getEligibleUsers(events, sessionEndedBefore)

.map { user =>

val trainValidationSplit =

new TrainValidationSplit()

.setEstimator(pipeline)

.setEvaluator(new RegressionEvaluator)

.setEstimatorParamMaps(paramGrid)

val model = trainValidationSplit.fit(

events,

ParamMap(ParamPair(userIdParam, user)))

val testData = // Prepare test data.

val predictions = model.transform(testData)

submitResult(userId, predictions, config)

Queries and analytics

val events: RDD[(JournalKey, Any)] = sc.eventTable().cache().filterClass[EntireResistanceExerciseSession].flatMap(_.deviations)

val deviationsFrequency = sqlContext.sql(

"""SELECT planned.exercise, hour(time), COUNT(1)

FROM exerciseDeviations

WHERE planned.exercise = 'bench press'

GROUP BY planned.exercise, hour(time)""")

val deviationsFrequency2 = exerciseDeviationsDF

.where(exerciseDeviationsDF("planned.exercise")

=== "bench press")

.groupBy(

exerciseDeviationsDF("planned.exercise"),

exerciseDeviationsDF("time”))

.count()

val deviationsFrequency3 = exerciseDeviations

.filter(_.planned.exercise == "bench press")

.groupBy(d => (d.planned.exercise, d.time.getHours))

.map(d => (d._1, d._2.size))

Clustering

def toVector(user: User): mllib.linalg.Vector =

Vectors.dense(

user.frequency,

user.performanceIndex,

user.improvementIndex)

val events: RDD[(JournalKey, Any)] =

sc.eventTable().cache()

val users: RDD[User] = events.filterClass[User]

val kmeans = new KMeans()

.setK(5)

.set...

val clusters = kmeans.run(users.map(_.toVector))

Recommendations

val weight: RDD[(JournalKey, Any)] = sc.eventTable().cache()

val exerciseDeviations = events

.filterClass[EntireResistanceExerciseSession]

.flatMap(session =>

session.sets.flatMap(set =>

set.sets.map(

exercise => (session.id.id, exercise.exercise))))

.groupBy(e => e)

.map(g =>

Rating(normalize(g._1._1), normalize(g._1._2),

normalize(g._2.size)))

val model = new ALS().run(ratings)

val predictions = model.predict(recommend)

bench press

bicep curl

dead lift

user 1 5 2

user 2 4 3

user 3 5 2

user 4 3 1

Graph analysis

val events: RDD[(JournalKey, Any)] =

sc.eventTable().cache()

val connections = events.filterClass[Connections]

val vertices: RDD[(VertexId, Long)] =

connections.map(c => (c.id, 1l))

val edges: RDD[Edge[Long]] = connections

.flatMap(c => c.connections

.map(Edge(c.id, _, 1l)))

val graph = Graph(vertices, edges)

val ranks = graph.pageRank(0.0001).vertices

Conclusions

● Scaling systems, data pipelines and machine learning

● Reactive○ Elasticity○ Resilience○ Responsiveness○ Message driven

Questions

Thank you

● Jobs at www.cakesolutions.net/careers● Code at https://github.com/muvr ● Martin Zapletal @zapletal_martin

References

[1] http://arxiv.org/abs/1112.6209

[2] SuperComputing 2012 two weeks ago and part of the IBM SyNAPSE project

[3] http://www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf

[4] http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/

[5] https://www.tensorflow.org/versions/master/tutorials/mnist/beginners/index.html

[6] https://queue.acm.org/detail.cfm?id=2655736

[7] http://fa.bianp.net/blog/2013/isotonic-regression/

[8] http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5

[9] https://github.com/nlintz/TensorFlow-Tutorials/blob/master/3_net.py

[10] https://www.tensorflow.org/

[11] http://static.googleusercontent.com/media/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf

[12] https://www.quora.com/How-big-is-the-largest-feedforward-neural-network-ever-trained-and-what-for

[13] http://static.googleusercontent.com/media/research.google.com/en//archive/unsupervised_icml2012.pdf

[14] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/

[15] http://malteschwarzkopf.de/research/assets/google-stack.pdf

[16] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf

[17] https://twitter.com/tsantero/status/695013012525060097

[18] http://www.slideshare.net/LisaHua/spark-overview-37479609

[19] https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

[20] https://kayousterhout.github.io/trace-analysis/

[21] https://github.com/muvr

[22] https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html

Twitter: @cakesolutionsTel: 0845 617 1200

Email: enquiries@cakesolutions.net

machine learning at scale with apache spark

Software

operationalizing edge machine learning with apache spark ·...

spark sql | apache spark

flare: optimizing apache spark with native compilation for...

apache spark - courses€¦ · apache spark introduction to...

inspire 2015 - alteryx: apache spark: scale your analytics...

fully fault tolerant streaming workflows at scale using...

stream processing on iot devices using calvin framework ·...

developing apache spark applications - cloudera · apache...

generating recommendations at amazon scale with apache spark...

large-scale machine learning with apache spark

data science at scale: using apache spark for data science...

apache spark performance troubleshooting at scale...

flare: optimizing apache spark with native compilation for...

matrix math at scale with apache mahout and spark...matrix...

a distributed approach to epifast using apache spark · a...

developing apache spark applications · apache spark...

apache spark & hadoop

data-driven financial risk modeling at scale with apache...

apache spark in the cloud - amazon s3 · 13 apache spark in...

using apache spark pat mcdonough - databricks. apache spark...