big data science in scala
TRANSCRIPT
Big Data Science in Scala
Anastasia LievaData Scientist @lievAnastazia
1. R
2. Python
3. SQL
2014
KDnuggets Polls: most popular tools in data-science
2015
2016
Context: Real Time Bidding
Raw requests: 100 000 requests per second
4 terabytes per day
RPython
SQL
Scala
RPython
SQL
ScalaSpark
ML/DATAFRAME/SQL
SMILE
Saddle
Spark Saddle Smile
Preprocessing
Machine LearningEvaluation
Preprocessing Machine Learning Evaluation
Problem: Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:
Algorithm: Random Forest
Averaging the decisions from all the trees
os
Categorie City
Games
Android
Music
iOs
ParisNantes
Oui Non OuiNon
adType
adSize weekDay
320x50 480x320
Video
SaturdayMonday
Oui Non OuiNon
Banner
Raw data
{ "time":"2016-06-09T0:25:28Z", "bidfloor":2.88, "appOrSite":"app", "adType":"banner", "categories":"games,news,football", "carrier":"208-10", "os":"iOS", "connectionType":3, "coords":[48.929256439208984, 2.4255824089050293], "adSize":[320, 50], "exchange":"xxxxx", [...], "clicked":true}
Sampling of 13 Gb
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
3.0 6.0 1.0
5.0 3.0 5.0
1.0 2.0 3.0
Preprocessing: Spark ml
Extraction: Extracting features from “raw” data
Transformation: Scaling, converting, or modifying features
Selection: Selecting a subset from a larger set of features
Preprocessing: Spark ml
Extraction: Extracting features from “raw” dataTF-IDF, SparkSQL
Transformation: Scaling, converting, or modifying featuresBucketizer, String Indexer, Index to String, Vector Assembler
Selection: Selecting a subset from a larger set of featuresChiSqSelector
Preprocessing: Saddlearray-backed, specialized data structures:
Pandas-like operations:dealing with missing values
index transformation tools
extracting,slicing,mapping row/column wise
groupBy/join/concatsorting/pivoting
Learning: Spark mlDataframe-based API
ClassificationRegressionLinear MethodsDecision TreesTree ensembles
Learning: Spark mlDataframe-based API
Pipeline interface
ClassificationRegressionLinear MethodsDecision TreesTree ensembles
TF-IDF String Indexer Assembler Random Forest Evaluation
Compare performance : Spark
Learning: Smile
ClassificationRegressionLinear MethodsDecision TreesTree ensembles
Array-backed API
Learning: SmileClassificationRegressionLinear MethodsDecision TreesTree ensembles
★ Visualisation★ Missing Values Imputation★ Association Rule Mining★ Manifold learning★ Multi-dimensional scaling★ Feature selection and dimensionality reduction
Preprocessing: SaddleCreate dataframe and balance the data
Preprocessing: Spark ml Create dataframe and balance the data
Preprocessing: Spark mlIndex categorical data
timestamp os osIdx1465037789 iOS 11464983457 Windows Phone 21465019529 Android 01464974567 iOS 11465018552 Android 0
Preprocessing: Saddle Index categorical data
Preprocessing: SaddleSplit randomly to test and train sets
and convert to input type needed in Smile RF implementation
Preprocessing: Spark mlConversion and sampling
Learning:
Smile
Construct Classifier and set hyperparameters
Spark ml
Learning: Train model
and predict on test dataframe
Spark ml
Smile
Learning: Evaluate modelSpark ml
Smile
Compare Spark and Smile Random Forest
The higher the better
The lower the better
Classification metrics
Compare Spark and Smile Random Forest
Running time on 13 GB
minutes
Compare preprocessing: Spark vs Saddle
My List[tools] for THIS project:
Preprocessing SparkMachine Learning
(Random Forest)
Smile
Your Option[tools] for YOUR project:
Spark
SMILE
Saddle