real time machine learning visualization with spark

Post on 19-Jan-2017

460 Views

Category:

Software

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Real Time Machine Learning Visualization with Spark

Chester ChenDirector of EngineeringAlpine Data

March 13, 2016

COMPANY CONFIDENTIAL2

Who am I ?• Director of Engineering at Alpine Data• Founder and Organizer of SF Big Analytics Meetup (3500+ members)• Previous Employment:

– Architect / Director at Tinga, Symantec, AltaVista, Ascent Media, ClearStory Systems, WebWare.

• Experience with Spark– Exposed to Spark since Spark 0.6– Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x

• Hadoop Distribution– CDH, HDP and MapR

COMPANY CONFIDENTIAL3

Alpine Data at a Glance Enterprise Scale Predictive Analytics with deep experience in Machine Learning, Data Science, and Distributed Data Architectures

Industry Innovations and IPBroad patents awarded for in-cluster and in-database machine learning - 2012First web-based solution for end-to-end Predictive analytics - 2012Created Industry first integrated Analytics Services Platform - 2013First Predictive Analytics solution to be certified on Spark - 2014Launched Touchpoints, Industry first predictive applications service layer- 2015

Global Brand Names in Financial Services, Telco/Media, Healthcare, Manufacturing, Public Sector and RetailVisionary in the Gartner Magic Quadrant for Advanced Analytics

Key Partners:

COMPANY CONFIDENTIAL4

Lightning-fast cluster computing

Real Time ML Visualization with Spark

-- What is Spark

http://spark.apache.org/

COMPANY CONFIDENTIAL5

Iris data set, K-Means clustering with K=3Cluster 2

Cluster 1

Cluster 0

Centroids

Sepal width vs Petal length

COMPANY CONFIDENTIAL6

Iris data set, K-Means clustering with K=3

distance

COMPANY CONFIDENTIAL7

What is K-Means ?• Given a set of observations (x1, x2, …, xn), where each observation is a d-

dimensional real vector, • k-means clustering aims to partition the n observations into k (≤ n) sets

S = {S1, S2, …, Sk}• The clusters are determined by minimizing the inter-cluster sum of squares

(ICSS) (sum of distance functions of each point in the cluster to the K center). In other words, the objective is to find

• where μi is the mean of points in Si.• https://en.wikipedia.org/wiki/K-means_clustering

COMPANY CONFIDENTIAL8

Visualization Cost

0 5 10 15 20 2534

34.5

35

35.5

36

36.5

37

37.5

38

38.5

Cost vs Iteration

Cost

COMPANY CONFIDENTIAL9

Real Time ML Visualization – Why ?• Use Cases

– Use visualization to determine whether to end the training early• Need a way to visualize the training process including the

convergence, clustering or residual plots, etc. • Need a way to stop the training and save current model• Need a way to disable or enable the visualization

COMPANY CONFIDENTIAL10

Real Time ML Visualization with Spark

DEMO

COMPANY CONFIDENTIAL11

How to Enable Real Time ML Visualization ? • A callback interface for Spark Machine Learning Algorithm to send

messages – Algorithms decide when and what message to send– Algorithms don’t care how the message is delivered

• A task channel to handle the message delivery from Spark Driver to Spark Client– It doesn’t care about the content of the message or who sent the message

• The message is delivered from Spark Client to Browser– We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH) – Pull is possible, but requires a message Queue

• Visualization using JavaScript Frameworks Plot.ly and D3

COMPANY CONFIDENTIAL12

Spark Job in Yarn-Cluster mode

Spark Client

Hadoop Cluster

Yarn-ContainerSpark Driver

Spark Job

Spark Context

Spark ML algorithm

Command Line

Rest API

Servlet

Application Host

COMPANY CONFIDENTIAL13

Spark Job in Yarn-Cluster mode

Spark Client

Hadoop Cluster

Command Line

Rest API

Servlet

Application Host

Spark Job

App Context Spark ML Algorithms

ML Listener

Message Logger

COMPANY CONFIDENTIAL14

Spark Client

Hadoop Cluster

Application Host

Spark Job

App Context Spark ML Algorithms

ML Listener

Message Logger

Spark Job in Yarn-Cluster mode

Web/Rest API

Server

Akka

Browser

COMPANY CONFIDENTIAL15

Enable Real Time ML Visualization

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

COMPANY CONFIDENTIAL16

Enable Real Time ML Visualization

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

COMPANY CONFIDENTIAL17

Machine Learning Listeners

COMPANY CONFIDENTIAL18

Callback Interface: ML Listener

trait MLListener { def onMessage(message: => Any)}

COMPANY CONFIDENTIAL19

Callback Interface: MLListenerSupport trait MLListenerSupport {

// rest of codedef sendMessage(message: => Any): Unit = { if (enableListener) { listeners.foreach(l => l.onMessage(message)) }}

COMPANY CONFIDENTIAL20

KMeansEx: KMeans with MLListener

class KMeansExt private (…) extends Serializable with Logging with MLListenerSupport { ... }

COMPANY CONFIDENTIAL21

KMeansEx: KMeans with MLListenercase class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double )

private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = { ... while (!stopIteration && iteration < maxIterations && !activeRuns.isEmpty) {

...if (listenerEnabled()) {

sendMessage(KMeansCoreStats(…)) }...

}}

COMPANY CONFIDENTIAL22

KMeans Spark Job Setupval kMeans = new KMeansExt().setK(numClusters) .setEpsilon(epsilon) .setMaxIterations(maxIterations) .enableListener(enableVisualization) .addListener( new KMeansListener(...))

appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger)))

kMeans.run(vectors)

COMPANY CONFIDENTIAL23

KMeans ML Listener class KMeansListener(columnNames: List[String], data : RDD[Vector], logger : MessageLogger) extends MLListener{

//sampling the data

message match { case coreStats :KMeansCoreStats =>

//use the KMeans model of the current iteration to predict sample //cluster indexes

//construct message consists of sample, cost, iteration and centroids

//use logger to send the message out }

COMPANY CONFIDENTIAL24

ML Task Observer• Receives command from User to update running Spark Job• Once receives UpdateTask Command from notify call, it

preforms the necessary update operation

trait TaskObserver { def notify (task: UpdateTaskCmd)}

class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger ) extends TaskObserver { //implement notify }

COMPANY CONFIDENTIAL25

Logistic Regression MLListenerclass LogisticRegression(…) extends MLListenerSupport { def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= {

// initialization code val (rawWeights, loss) = OWLQN.runOWLQN( …) generateLORModel(…) }

}

COMPANY CONFIDENTIAL26

Logistic Regression MLListenerobject OWLQN extends Logging { def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector, Array[Double]) = {

val costFun=new CostFun(data, mlSupport, IterationState(), /*other args */)val states : Iterator[lbfgs.State] = lbfgs.iterations(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector ) …}

COMPANY CONFIDENTIAL27

Logistic Regression MLListenerIn Cost function :

override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = {

val shouldStop = mlSupport.exists(_.stopIteration)

if (!shouldStop) { … mlSupport.filter(_.listenerEnabled()).map { s=> s.sendMessage( (iState.iteration, w, loss)) }

… } else { … }}

COMPANY CONFIDENTIAL28

Task Communication Channel

COMPANY CONFIDENTIAL29

Task Channel : Akka Messaging

Spark Application Application

Context

Actor System

MessagerActor

Task ChannelActor

SparkContext Spark tasks

Akka

Akka

COMPANY CONFIDENTIAL30

Task Channel : Akka messaging

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

COMPANY CONFIDENTIAL31

Push To The Browser

COMPANY CONFIDENTIAL32

HTTP Chunked Response and SSE

SSE

PlotlyD3

Browser

Rest API

Server

Web Server

Spark Client

Hadoop Cluster

Spark Job

App Context

Message Logger

Task Channel

Spark ML Algorithms

ML Listener

AkkaChunked Response

Akka

COMPANY CONFIDENTIAL33

HTML5 Server-Sent Events (SSE)• Server-sent Events (SSE) is one-way messaging

– An event is when a web page automatically get update from Server

• Register an event source (JavaScript) var source = new EventSource(url);• The Callback onMessage(data)

source.onmessage = function(message){...}• Data Format:

data: { \ndata: “key” : “value”, \n\ndata: } \n\n

COMPANY CONFIDENTIAL34

HTTP Chunked Response• Spray Rest Server supports Chunked Response

val responseStart = HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Start\n"))requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack)

val nextChunk = MessageChunk(s"data: $r \n\n")requestCtx.responder ! nextChunk.withAck(Messages.Ack)

requestCtx.responder ! MessageChunk(s"data: Finished \n\n")requestCtx.responder ! ChunkedMessageEnd

COMPANY CONFIDENTIAL35

Push vs. PullPush• Pros

– The data is streamed (pushed) to browser via chunked response

– There is no need for data queue, but the data can be lost if not consumed

– Multiple pages can be pushed at the same time, which allows multiple visualization views

• Cons– For slow network, slow browser and fast data iterations, the

data might all show-up in browser at once, rather showing a nice iteration-by-iteration display

– If you control the data chunked response by Network Acknowledgement, the visualization may not show-up at all as the data is not pushed due to slow network acknowledgement

COMPANY CONFIDENTIAL36

Push vs. PullPull• Pros

– Message does not get lost, since it can be temporarily stored in the message queue

– The visualization will render in an even pace • Cons

– Need to periodically send server request for update,– We will need a message queue before the message is

consumed– Hard to support multiple pages rendering with simple

message queue

COMPANY CONFIDENTIAL37

Visualization: Plot.ly + D3

Cost vs. IterationCost vs. Iteration

ArrTime vs. DistanceArrTime vs. DepTime

Alpine Workflow

COMPANY CONFIDENTIAL38

Use Plot.ly to render graph

function showCost(dataParsed) { var costTrace = { … }; var data = [ costTrace ]; var costLayout = { xaxis: {…}, yaxis: {…}, title: … }; Plotly.newPlot('cost', data, costLayout);}

COMPANY CONFIDENTIAL39

Real Time ML Visualization: Summary• Training machine learning model involves a lot of

experimentation, we need a way to visualize the training process.

• We presented a system to enable real time machine learning visualization with Spark: – Gives visibility into the training of a model– Allows us monitor the convergence of the algorithms during

training– Can stop the iterations when convergence is good enough.

COMPANY CONFIDENTIAL40

Thank YouChester Chen chester@alpinenow.com

LinkedInhttps://www.linkedin.com/in/chester-chen-3205992

SlideSharehttp://www.slideshare.net/ChesterChen/presentations

demo videohttps://youtu.be/DkbYNYQhrao

top related