big data science in scala

Big Data Science in Scala

Anastasia LievaData Scientist @lievAnastazia

1. R

2. Python

3. SQL

2014

KDnuggets Polls: most popular tools in data-science

2015

2016

Context: Real Time Bidding

Raw requests: 100 000 requests per second

4 terabytes per day

RPython

SQL

Scala

RPython

SQL

ScalaSpark

ML/DATAFRAME/SQL

SMILE

Saddle

Spark Saddle Smile

Preprocessing

Machine LearningEvaluation

Preprocessing Machine Learning Evaluation

Problem: Optimize click rate of delivering ads

We want to estimate the probability the ads will be clicked

● request configuration

● proposed creative

● user history

● third-party information

depending on:

Algorithm: Random Forest

Averaging the decisions from all the trees

os

Categorie City

Games

Android

Music

iOs

ParisNantes

Oui Non OuiNon

adType

adSize weekDay

320x50 480x320

Video

SaturdayMonday

Oui Non OuiNon

Banner

Raw data

{ "time":"2016-06-09T0:25:28Z", "bidfloor":2.88, "appOrSite":"app", "adType":"banner", "categories":"games,news,football", "carrier":"208-10", "os":"iOS", "connectionType":3, "coords":[48.929256439208984, 2.4255824089050293], "adSize":[320, 50], "exchange":"xxxxx", [...], "clicked":true}

Sampling of 13 Gb

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Click

False

True

False

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Click

False

True

False

Os MaxPrice Time

3.0 6.0 1.0

5.0 3.0 5.0

1.0 2.0 3.0

Preprocessing: Spark ml

Extraction: Extracting features from “raw” data

Transformation: Scaling, converting, or modifying features

Selection: Selecting a subset from a larger set of features

Preprocessing: Spark ml

Extraction: Extracting features from “raw” dataTF-IDF, SparkSQL

Transformation: Scaling, converting, or modifying featuresBucketizer, String Indexer, Index to String, Vector Assembler

Selection: Selecting a subset from a larger set of featuresChiSqSelector

Preprocessing: Saddlearray-backed, specialized data structures:

Pandas-like operations:dealing with missing values

index transformation tools

extracting,slicing,mapping row/column wise

groupBy/join/concatsorting/pivoting

Learning: Spark mlDataframe-based API

ClassificationRegressionLinear MethodsDecision TreesTree ensembles

Learning: Spark mlDataframe-based API

Pipeline interface


TF-IDF String Indexer Assembler Random Forest Evaluation

Compare performance : Spark

Learning: Smile


Array-backed API

Learning: SmileClassificationRegressionLinear MethodsDecision TreesTree ensembles

★ Visualisation★ Missing Values Imputation★ Association Rule Mining★ Manifold learning★ Multi-dimensional scaling★ Feature selection and dimensionality reduction

Preprocessing: SaddleCreate dataframe and balance the data

Preprocessing: Spark ml Create dataframe and balance the data

Preprocessing: Spark mlIndex categorical data

timestamp os osIdx1465037789 iOS 11464983457 Windows Phone 21465019529 Android 01464974567 iOS 11465018552 Android 0

Preprocessing: Saddle Index categorical data

Preprocessing: SaddleSplit randomly to test and train sets

and convert to input type needed in Smile RF implementation

Preprocessing: Spark mlConversion and sampling

Learning:

Smile

Construct Classifier and set hyperparameters

Spark ml

Learning: Train model

and predict on test dataframe

Spark ml

Smile

Learning: Evaluate modelSpark ml

Smile

Compare Spark and Smile Random Forest

The higher the better

The lower the better

Classification metrics

Compare Spark and Smile Random Forest

Running time on 13 GB

minutes

Compare preprocessing: Spark vs Saddle

My List[tools] for THIS project:

Preprocessing SparkMachine Learning

(Random Forest)

Smile

Your Option[tools] for YOUR project:

Spark

SMILE

Saddle

big data science in scala

Data & Analytics