apache spark machine learning decision trees

43
© 2016 MapR Technologies 10-1 © 2016 MapR Technologies Machine Learning with Apache Spark Carol McDonald, Solution Architect Strata + Hadoop World March 2016

Upload: carol-mcdonald

Post on 14-Jan-2017

1.243 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-1© 2016 MapR Technologies

Machine Learning with Apache SparkCarol McDonald, Solution ArchitectStrata + Hadoop World March 2016

Page 2: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-2© 2016 MapR Technologies

Page 3: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-3

Agenda

• Brief overview of • Classification• Clustering• Collaborative Filtering

• Predicting Flight Delays using a Decision Tree

Page 4: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-4

Spark SQL

• Structured Data• Querying with

SQL/HQL• DataFrames

Spark Streaming

• Processing of live streams

• Micro-batching

MLlib

• Machine Learning• Multiple types of

ML algorithms

GraphX

• Graph processing• Graph parallel

computations

RDD Transformations and Actions

• Task scheduling• Memory management• Fault recovery• Interacting with storage systems

Spark Core

What is MLlib?

Page 5: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-5

MLlib Algorithms and Utilities

Algorithms and Utilities Description

Basic statistics Includes summary statistics, correlations, hypothesis testing, random data generation

Classification and regression Includes methods for linear models, decision trees and Naïve Bayes

Collaborative filtering Supports model-based collaborative filtering using alternating least squares (ALS) algorithm

Clustering Supports K-means clustering

Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value decomposition (SVD) and principal component analysis (PCA)

Feature extraction and transformation

Contains several classes for common feature transformations

Page 6: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-6

Examples of ML Algorithms

Supervised• Classification

– Naïve Bayes– SVM– Random Decision

Forests

• Regression– Linear– Logistic

Machine Learning

Unsupervised• Clustering

– K-means

• Dimensionality reduction– Principal Component

Analysis– SVD

Page 7: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-7

Examples of ML Algorithms

Supervised• Classification

– Naïve Bayes– SVM– Random Decision

Forests

• Regression– Linear– Logistic

Machine Learning

Unsupervised• Clustering

– K-means

• Dimensionality reduction– Principal Component

Analysis– SVD

Page 8: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-8

Examples of ML Algorithms

Machine Learning

Unsupervised• Clustering

– K-means

• Dimensionality reduction– Principal Component

Analysis– SVD

Supervised• Classification

– Naïve Bayes– SVM– Random Decision

Forests

• Regression– Linear– Logistic

Page 9: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-9

Three Categories of Techniques for Machine Learning

Collaborative Filtering (Recommendation)

Classification Clustering

Page 10: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-10

Machine Learning: ClassificationClassification

Identifies category for item

Page 11: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-11

Classification: DefinitionForm of ML that:• Identifies which category an item belongs to • Uses supervised learning algorithms

– Data is labeled

FRAUD

Sentiment

Page 12: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-12

If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck

swims

walks

quacks

Features:

walks

quacksswims

Features:

ducksnot ducks

Page 13: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-13

Building and Deploying a Classifier Model

Image reference O’Reilly Learning Spark

Spam:free money now!get this moneyfree savings $$$

Training Data

Non-spam:how are you?that Spark joblunch plans

Page 14: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-14

Building and Deploying a Classifier Model

Image reference O’Reilly Learning Spark

++ ;+

; ;

Feature Vectors

FeaturizationSpam:

free money now!get this moneyfree savings $$$

Training Data

Non-spam:how are you?that Spark joblunch plans

Page 15: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-15

Building and Deploying a Classifier Model

Image reference O’Reilly Learning Spark

++ ;+

; ;

Feature Vectors Model

Featurization TrainingSpam:free money now!get this moneyfree savings $$$

Training Data

Non-spam:how are you?that Spark joblunch plans

++ ;+

; ;

Page 16: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-16

Building and Deploying a Classifier Model

Image reference O’Reilly Learning Spark

++ ;+

; ;

Feature Vectors Model

Featurization TrainingModel

Evaluation

Best Model

Spam:free money now!get this moneyfree savings $$$

Training Data

Non-spam:how are you?that Spark joblunch plans

++ ;+; ;

++ ;+

; ;

++ ;+

; ;

++ ;+

; ;

Page 17: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-17

Machine Learning: ClusteringClassification Clustering

Page 18: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-18

Clustering: Definition• Unsupervised learning task• Groups objects into clusters of high similarity

Page 19: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-19

Clustering: Definition• Unsupervised learning task• Groups objects into clusters of high similarity

– Search results grouping– Grouping of customers– Anomaly detection– Text categorization

Page 20: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-20

Clustering: Example• Group similar objects• Use MLlib K-means algorithm

1. Initialize coordinates to center of clusters (centroid)

2. Assign all points to nearest centroid

3. Update centroids to center of points

4. Repeat until conditions met

Page 21: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-21

Three Categories of Techniques for Machine Learning

Collaborative Filtering (Recommendation)

Classification Clustering

Page 22: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-22

Collaborative Filtering with Spark• Recommend items

– (Filtering)

• Based on user preferences data

– (Collaborative)

4 5 5

5 5

5 ?

Ted

Carol

Bob

A B C

User Item Rating Matrix

Page 23: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-23

Train a Model to Make Predictions

Ted and Carol like movies B and C

Bob likes movie B, what might he like?

Bob likes movie B, predict C

TrainingData ModelAlgorithm

New Data PredictionsModel

4 5 5

5 5

5 ?

Ted

Carol

Bob

A B C

User Item Rating Matrix

Page 24: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-24© 2016 MapR Technologies

Predict Flight Delays

Page 25: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-25

Use Case: Flight Data• Predict if a flight is going to be delayed• Use Decision Tree for prediction

• Used for Classification and Regression• Represents tree with nodes, Binary decision at each node

Page 26: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-26

Flight Data

Page 27: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-27

// Define the schemacase class Flight(dofM: String, dofW: String, carrier: String, tailnum: String, flnum: Int, org_id: String, origin: String, dest_id: String, dest: String, crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double, arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int)

def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6), line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble, line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble, line(16).toInt)}// load file into a RDDval rdd = sc.textFile(”flights.csv”)// create an RDD of Flight objectsval flightRDD = rdd.map(parseFlight).cache()//Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0, 13.0,385.0,2475)

Parse Input

Page 28: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-28

Building and Deploying a Classifier Model

++ ;+

; ;

Feature Vectors

FeaturizationDelayed:

FridayLAXAA

Training Data

Not Delayed:WednesdayBNADelta

Page 29: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-29

Classification Learning Problem - Features

• Label delayed and not delayed - delayed if delay > 40 minutes

• Features {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest}

Page 30: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-30

// create map of airline -> numbervar carrierMap: Map[String, Int] = Map()var index: Int = 0flightsRDD.map(flight => flight.carrier).distinct.collect.foreach( x => { carrierMap += (x -> index); index += 1 } )carrierMap.toString// String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...)

// create map of destination airport -> numbervar destMap: Map[String, Int] = Map()var index2: Int = 0flightsRDD.map(flight => flight.dest).distinct.collect.foreach( x => { destMap += (x -> index2); index2 += 1 })destMap.toString// Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ...

Transform non-numeric features into numeric values

Page 31: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-31

Classification Learning Problem - Features

• Label delayed and not delayed - delayed if delay > 40 minutes

• Features {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest}

• MLLIB Datatypes:• Vector: Contains the feature data points• LabeledPoint: Contains feature vector and label

Page 32: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-32

// Defining the features arrayval mlprep = flightsRDD.map(flight => { val monthday = flight.dofM.toInt - 1 // category val weekday = flight.dofW.toInt - 1 // category val crsdeptime1 = flight.crsdeptime.toInt val crsarrtime1 = flight.crsarrtime.toInt val carrier1 = carrierMap(flight.carrier) // category val crselapsedtime1 = flight.crselapsedtime.toDouble val origin1 = originMap(flight.origin) // category val dest1 = destMap(flight.dest) // category val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0 Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble, crsarrtime1.toDouble, carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble, dest1.toDouble)})mlprep.take(1)//Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0))val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6), x(7), x(8))))mldata.take(1)// Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0]))

Define the features, Create LabeledPoint with Vector

Page 33: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-36

Build ModelSplit data into:• Training data RDD (80%) • Test data RDD (20%)

Data Build Model

TrainingSet

TestSet

Page 34: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-37

// Randomly split RDD into training data RDD (80%) and test data RDD (20%)val splits = mldata.randomSplit(Array(0.8, 0.2))

val trainingRDD = splits(0).cache()val testRDD = splits(1).cache()testData.take(1)//Array[LabeledPoint] = Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0]))

Split Data

Page 35: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-38

Build Model

Training Set with Labels, Build a model

Data Build Model

TrainingSet

TestSet

Page 36: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-39

Use Case: Flight Data• Predict if a flight is going to be delayed• Use Decision Tree for prediction

• Used for Classification and Regression• Represents tree with nodes• Binary decision at each node

Page 37: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-40

// set ranges for categorical features var categoricalFeaturesInfo = Map[Int, Int]()categoricalFeaturesInfo += (0 -> 31) //dofM 31 categoriescategoricalFeaturesInfo += (1 -> 7) //dofW 7 categoriescategoricalFeaturesInfo += (4 -> carrierMap.size) //number of carrierscategoricalFeaturesInfo += (6 -> originMap.size) //number of origin airportscategoricalFeaturesInfo += (7 -> destMap.size) //number of dest airportsval numClasses = 2val impurity = "gini"val maxDepth = 9val maxBins = 7000

// call DecisionTree trainClassifier with the trainingData , which returns the modelval model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,

impurity, maxDepth, maxBins)

Build Model

Page 38: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-41

// print out the decision treemodel.toDebugString// 0=dofM 4=carrier 3=crsarrtime1 6=origin res20: String = DecisionTreeModel classifier of depth 9 with 919 nodes If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,

22.0,23.0,24.0,25.0,26.0,27.0,30.0}) If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0}) If (feature 3 <= 1603.0) If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0}) If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0...

Build Model

Page 39: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-42

Get Predictions

TestData

Without label

PredictDelay or NotModel

Page 40: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-43

// Get Predictions,create RDD of test Label, test Predictionval labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}labelAndPreds.take(1)// Label, Prediction//Array((0.0,0.0))

Get Predictions

Page 41: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-44

// get instances where label != predictionval wrongPrediction =(labelAndPreds.filter{ case (label, prediction) => ( label !=prediction) })

val wrong= wrongPrediction.count()res35: Long = 11040

val ratioWrong=wrong.toDouble/testData.count()

ratioWrong: Double = 0.3157443157443157

Test Model

Page 42: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-45

To Learn More:• Download example code

– https://github.com/caroljmcdonald/sparkmldecisiontree• Read explanation of example code

– https://www.mapr.com/blog/apache-spark-machine-learning-tutorial

• Engage with us!– https://www.mapr.com/blog/author/carol-mcdonald

– https://community.mapr.com

Page 43: Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-46

Q & A@mapr

https://www.mapr.com/blog/author/carol-mcdonald

Engage with us!

mapr-technologies