apache spark machine learning decision trees
TRANSCRIPT
© 2016 MapR Technologies 10-1© 2016 MapR Technologies
Machine Learning with Apache SparkCarol McDonald, Solution ArchitectStrata + Hadoop World March 2016
© 2016 MapR Technologies 10-2© 2016 MapR Technologies
© 2016 MapR Technologies 10-3
Agenda
• Brief overview of • Classification• Clustering• Collaborative Filtering
• Predicting Flight Delays using a Decision Tree
© 2016 MapR Technologies 10-4
Spark SQL
• Structured Data• Querying with
SQL/HQL• DataFrames
Spark Streaming
• Processing of live streams
• Micro-batching
MLlib
• Machine Learning• Multiple types of
ML algorithms
GraphX
• Graph processing• Graph parallel
computations
RDD Transformations and Actions
• Task scheduling• Memory management• Fault recovery• Interacting with storage systems
Spark Core
What is MLlib?
© 2016 MapR Technologies 10-5
MLlib Algorithms and Utilities
Algorithms and Utilities Description
Basic statistics Includes summary statistics, correlations, hypothesis testing, random data generation
Classification and regression Includes methods for linear models, decision trees and Naïve Bayes
Collaborative filtering Supports model-based collaborative filtering using alternating least squares (ALS) algorithm
Clustering Supports K-means clustering
Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value decomposition (SVD) and principal component analysis (PCA)
Feature extraction and transformation
Contains several classes for common feature transformations
© 2016 MapR Technologies 10-6
Examples of ML Algorithms
Supervised• Classification
– Naïve Bayes– SVM– Random Decision
Forests
• Regression– Linear– Logistic
Machine Learning
Unsupervised• Clustering
– K-means
• Dimensionality reduction– Principal Component
Analysis– SVD
© 2016 MapR Technologies 10-7
Examples of ML Algorithms
Supervised• Classification
– Naïve Bayes– SVM– Random Decision
Forests
• Regression– Linear– Logistic
Machine Learning
Unsupervised• Clustering
– K-means
• Dimensionality reduction– Principal Component
Analysis– SVD
© 2016 MapR Technologies 10-8
Examples of ML Algorithms
Machine Learning
Unsupervised• Clustering
– K-means
• Dimensionality reduction– Principal Component
Analysis– SVD
Supervised• Classification
– Naïve Bayes– SVM– Random Decision
Forests
• Regression– Linear– Logistic
© 2016 MapR Technologies 10-9
Three Categories of Techniques for Machine Learning
Collaborative Filtering (Recommendation)
Classification Clustering
© 2016 MapR Technologies 10-10
Machine Learning: ClassificationClassification
Identifies category for item
© 2016 MapR Technologies 10-11
Classification: DefinitionForm of ML that:• Identifies which category an item belongs to • Uses supervised learning algorithms
– Data is labeled
FRAUD
Sentiment
© 2016 MapR Technologies 10-12
If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacksswims
Features:
ducksnot ducks
© 2016 MapR Technologies 10-13
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
Spam:free money now!get this moneyfree savings $$$
Training Data
Non-spam:how are you?that Spark joblunch plans
© 2016 MapR Technologies 10-14
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
++ ;+
; ;
Feature Vectors
FeaturizationSpam:
free money now!get this moneyfree savings $$$
Training Data
Non-spam:how are you?that Spark joblunch plans
© 2016 MapR Technologies 10-15
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
++ ;+
; ;
Feature Vectors Model
Featurization TrainingSpam:free money now!get this moneyfree savings $$$
Training Data
Non-spam:how are you?that Spark joblunch plans
++ ;+
; ;
© 2016 MapR Technologies 10-16
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
++ ;+
; ;
Feature Vectors Model
Featurization TrainingModel
Evaluation
Best Model
Spam:free money now!get this moneyfree savings $$$
Training Data
Non-spam:how are you?that Spark joblunch plans
++ ;+; ;
++ ;+
; ;
++ ;+
; ;
++ ;+
; ;
© 2016 MapR Technologies 10-17
Machine Learning: ClusteringClassification Clustering
© 2016 MapR Technologies 10-18
Clustering: Definition• Unsupervised learning task• Groups objects into clusters of high similarity
© 2016 MapR Technologies 10-19
Clustering: Definition• Unsupervised learning task• Groups objects into clusters of high similarity
– Search results grouping– Grouping of customers– Anomaly detection– Text categorization
© 2016 MapR Technologies 10-20
Clustering: Example• Group similar objects• Use MLlib K-means algorithm
1. Initialize coordinates to center of clusters (centroid)
2. Assign all points to nearest centroid
3. Update centroids to center of points
4. Repeat until conditions met
© 2016 MapR Technologies 10-21
Three Categories of Techniques for Machine Learning
Collaborative Filtering (Recommendation)
Classification Clustering
© 2016 MapR Technologies 10-22
Collaborative Filtering with Spark• Recommend items
– (Filtering)
• Based on user preferences data
– (Collaborative)
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix
© 2016 MapR Technologies 10-23
Train a Model to Make Predictions
Ted and Carol like movies B and C
Bob likes movie B, what might he like?
Bob likes movie B, predict C
TrainingData ModelAlgorithm
New Data PredictionsModel
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix
© 2016 MapR Technologies 10-24© 2016 MapR Technologies
Predict Flight Delays
© 2016 MapR Technologies 10-25
Use Case: Flight Data• Predict if a flight is going to be delayed• Use Decision Tree for prediction
• Used for Classification and Regression• Represents tree with nodes, Binary decision at each node
© 2016 MapR Technologies 10-26
Flight Data
© 2016 MapR Technologies 10-27
// Define the schemacase class Flight(dofM: String, dofW: String, carrier: String, tailnum: String, flnum: Int, org_id: String, origin: String, dest_id: String, dest: String, crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double, arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int)
def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6), line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble, line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble, line(16).toInt)}// load file into a RDDval rdd = sc.textFile(”flights.csv”)// create an RDD of Flight objectsval flightRDD = rdd.map(parseFlight).cache()//Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0, 13.0,385.0,2475)
Parse Input
© 2016 MapR Technologies 10-28
Building and Deploying a Classifier Model
++ ;+
; ;
Feature Vectors
FeaturizationDelayed:
FridayLAXAA
Training Data
Not Delayed:WednesdayBNADelta
© 2016 MapR Technologies 10-29
Classification Learning Problem - Features
• Label delayed and not delayed - delayed if delay > 40 minutes
• Features {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest}
© 2016 MapR Technologies 10-30
// create map of airline -> numbervar carrierMap: Map[String, Int] = Map()var index: Int = 0flightsRDD.map(flight => flight.carrier).distinct.collect.foreach( x => { carrierMap += (x -> index); index += 1 } )carrierMap.toString// String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...)
// create map of destination airport -> numbervar destMap: Map[String, Int] = Map()var index2: Int = 0flightsRDD.map(flight => flight.dest).distinct.collect.foreach( x => { destMap += (x -> index2); index2 += 1 })destMap.toString// Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ...
Transform non-numeric features into numeric values
© 2016 MapR Technologies 10-31
Classification Learning Problem - Features
• Label delayed and not delayed - delayed if delay > 40 minutes
• Features {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest}
• MLLIB Datatypes:• Vector: Contains the feature data points• LabeledPoint: Contains feature vector and label
© 2016 MapR Technologies 10-32
// Defining the features arrayval mlprep = flightsRDD.map(flight => { val monthday = flight.dofM.toInt - 1 // category val weekday = flight.dofW.toInt - 1 // category val crsdeptime1 = flight.crsdeptime.toInt val crsarrtime1 = flight.crsarrtime.toInt val carrier1 = carrierMap(flight.carrier) // category val crselapsedtime1 = flight.crselapsedtime.toDouble val origin1 = originMap(flight.origin) // category val dest1 = destMap(flight.dest) // category val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0 Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble, crsarrtime1.toDouble, carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble, dest1.toDouble)})mlprep.take(1)//Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0))val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6), x(7), x(8))))mldata.take(1)// Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Define the features, Create LabeledPoint with Vector
© 2016 MapR Technologies 10-36
Build ModelSplit data into:• Training data RDD (80%) • Test data RDD (20%)
Data Build Model
TrainingSet
TestSet
© 2016 MapR Technologies 10-37
// Randomly split RDD into training data RDD (80%) and test data RDD (20%)val splits = mldata.randomSplit(Array(0.8, 0.2))
val trainingRDD = splits(0).cache()val testRDD = splits(1).cache()testData.take(1)//Array[LabeledPoint] = Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Split Data
© 2016 MapR Technologies 10-38
Build Model
Training Set with Labels, Build a model
Data Build Model
TrainingSet
TestSet
© 2016 MapR Technologies 10-39
Use Case: Flight Data• Predict if a flight is going to be delayed• Use Decision Tree for prediction
• Used for Classification and Regression• Represents tree with nodes• Binary decision at each node
© 2016 MapR Technologies 10-40
// set ranges for categorical features var categoricalFeaturesInfo = Map[Int, Int]()categoricalFeaturesInfo += (0 -> 31) //dofM 31 categoriescategoricalFeaturesInfo += (1 -> 7) //dofW 7 categoriescategoricalFeaturesInfo += (4 -> carrierMap.size) //number of carrierscategoricalFeaturesInfo += (6 -> originMap.size) //number of origin airportscategoricalFeaturesInfo += (7 -> destMap.size) //number of dest airportsval numClasses = 2val impurity = "gini"val maxDepth = 9val maxBins = 7000
// call DecisionTree trainClassifier with the trainingData , which returns the modelval model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
Build Model
© 2016 MapR Technologies 10-41
// print out the decision treemodel.toDebugString// 0=dofM 4=carrier 3=crsarrtime1 6=origin res20: String = DecisionTreeModel classifier of depth 9 with 919 nodes If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,
22.0,23.0,24.0,25.0,26.0,27.0,30.0}) If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0}) If (feature 3 <= 1603.0) If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0}) If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0...
Build Model
© 2016 MapR Technologies 10-42
Get Predictions
TestData
Without label
PredictDelay or NotModel
© 2016 MapR Technologies 10-43
// Get Predictions,create RDD of test Label, test Predictionval labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}labelAndPreds.take(1)// Label, Prediction//Array((0.0,0.0))
Get Predictions
© 2016 MapR Technologies 10-44
// get instances where label != predictionval wrongPrediction =(labelAndPreds.filter{ case (label, prediction) => ( label !=prediction) })
val wrong= wrongPrediction.count()res35: Long = 11040
val ratioWrong=wrong.toDouble/testData.count()
ratioWrong: Double = 0.3157443157443157
Test Model
© 2016 MapR Technologies 10-45
To Learn More:• Download example code
– https://github.com/caroljmcdonald/sparkmldecisiontree• Read explanation of example code
– https://www.mapr.com/blog/apache-spark-machine-learning-tutorial
• Engage with us!– https://www.mapr.com/blog/author/carol-mcdonald
– https://community.mapr.com
© 2016 MapR Technologies 10-46
Q & A@mapr
https://www.mapr.com/blog/author/carol-mcdonald
Engage with us!
mapr-technologies