big data science in scala

37
Big Data Science in Scala Anastasia Lieva Data Scientist @lievAnastazia

Upload: anastasia-lieva

Post on 21-Apr-2017

1.046 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data Science in Scala

Big Data Science in Scala

Anastasia LievaData Scientist @lievAnastazia

Page 2: Big Data Science in Scala

1. R

2. Python

3. SQL

2014

KDnuggets Polls: most popular tools in data-science

2015

2016

Page 3: Big Data Science in Scala

Context: Real Time Bidding

Raw requests: 100 000 requests per second

4 terabytes per day

Page 4: Big Data Science in Scala

RPython

SQL

Scala

Page 5: Big Data Science in Scala

RPython

SQL

ScalaSpark

ML/DATAFRAME/SQL

SMILE

Saddle

Page 6: Big Data Science in Scala

Spark Saddle Smile

Preprocessing

Machine LearningEvaluation

Preprocessing Machine Learning Evaluation

Page 7: Big Data Science in Scala

Problem: Optimize click rate of delivering ads

We want to estimate the probability the ads will be clicked

● request configuration

● proposed creative

● user history

● third-party information

depending on:

Page 8: Big Data Science in Scala

Algorithm: Random Forest

Averaging the decisions from all the trees

os

Categorie City

Games

Android

Music

iOs

ParisNantes

Oui Non OuiNon

adType

adSize weekDay

320x50 480x320

Video

SaturdayMonday

Oui Non OuiNon

Banner

Page 9: Big Data Science in Scala

Raw data

{ "time":"2016-06-09T0:25:28Z", "bidfloor":2.88, "appOrSite":"app", "adType":"banner", "categories":"games,news,football", "carrier":"208-10", "os":"iOS", "connectionType":3, "coords":[48.929256439208984, 2.4255824089050293], "adSize":[320, 50], "exchange":"xxxxx", [...], "clicked":true}

Sampling of 13 Gb

Page 10: Big Data Science in Scala

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Page 11: Big Data Science in Scala

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Page 12: Big Data Science in Scala

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Page 13: Big Data Science in Scala

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Page 14: Big Data Science in Scala

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Click

False

True

False

Page 15: Big Data Science in Scala

Os MaxPrice Time

Android 7.3 2016-06-09T0:25:28Z

iOS 4.55 2016-05-09T14:23:12Z

WindowsPhone 2.89 2016-06-09T11:35:11Z

Click

False

True

False

Os MaxPrice Time

3.0 6.0 1.0

5.0 3.0 5.0

1.0 2.0 3.0

Page 16: Big Data Science in Scala

Preprocessing: Spark ml

Extraction: Extracting features from “raw” data

Transformation: Scaling, converting, or modifying features

Selection: Selecting a subset from a larger set of features

Page 17: Big Data Science in Scala

Preprocessing: Spark ml

Extraction: Extracting features from “raw” dataTF-IDF, SparkSQL

Transformation: Scaling, converting, or modifying featuresBucketizer, String Indexer, Index to String, Vector Assembler

Selection: Selecting a subset from a larger set of featuresChiSqSelector

Page 18: Big Data Science in Scala

Preprocessing: Saddlearray-backed, specialized data structures:

Pandas-like operations:dealing with missing values

index transformation tools

extracting,slicing,mapping row/column wise

groupBy/join/concatsorting/pivoting

Page 19: Big Data Science in Scala

Learning: Spark mlDataframe-based API

ClassificationRegressionLinear MethodsDecision TreesTree ensembles

Page 20: Big Data Science in Scala

Learning: Spark mlDataframe-based API

Pipeline interface

ClassificationRegressionLinear MethodsDecision TreesTree ensembles

TF-IDF String Indexer Assembler Random Forest Evaluation

Page 21: Big Data Science in Scala

Compare performance : Spark

Page 22: Big Data Science in Scala

Learning: Smile

ClassificationRegressionLinear MethodsDecision TreesTree ensembles

Array-backed API

Page 23: Big Data Science in Scala

Learning: SmileClassificationRegressionLinear MethodsDecision TreesTree ensembles

★ Visualisation★ Missing Values Imputation★ Association Rule Mining★ Manifold learning★ Multi-dimensional scaling★ Feature selection and dimensionality reduction

Page 24: Big Data Science in Scala

Preprocessing: SaddleCreate dataframe and balance the data

Page 25: Big Data Science in Scala

Preprocessing: Spark ml Create dataframe and balance the data

Page 26: Big Data Science in Scala

Preprocessing: Spark mlIndex categorical data

timestamp os osIdx1465037789 iOS 11464983457 Windows Phone 21465019529 Android 01464974567 iOS 11465018552 Android 0

Page 27: Big Data Science in Scala

Preprocessing: Saddle Index categorical data

Page 28: Big Data Science in Scala

Preprocessing: SaddleSplit randomly to test and train sets

and convert to input type needed in Smile RF implementation

Page 29: Big Data Science in Scala

Preprocessing: Spark mlConversion and sampling

Page 30: Big Data Science in Scala

Learning:

Smile

Construct Classifier and set hyperparameters

Spark ml

Page 31: Big Data Science in Scala

Learning: Train model

and predict on test dataframe

Spark ml

Smile

Page 32: Big Data Science in Scala

Learning: Evaluate modelSpark ml

Smile

Page 33: Big Data Science in Scala

Compare Spark and Smile Random Forest

The higher the better

The lower the better

Classification metrics

Page 34: Big Data Science in Scala

Compare Spark and Smile Random Forest

Running time on 13 GB

minutes

Page 35: Big Data Science in Scala

Compare preprocessing: Spark vs Saddle

Page 36: Big Data Science in Scala

My List[tools] for THIS project:

Preprocessing SparkMachine Learning

(Random Forest)

Smile

Page 37: Big Data Science in Scala

Your Option[tools] for YOUR project:

Spark

SMILE

Saddle