data science on big data. pragmatic approach

Post on 31-Jul-2015

68 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

 Data science using Big Data. Pragmatic

approach.

Pavel Mesentsev Dmitri Babaev

Tinkoff Bank

Machine learning in business

Other ML tasks examples

Size of data

~ 1 mln obj

~ 1 bln obj

~ 1 trln obj

Small Data

Big Data

Huge Data

Predictive models design

Small data case

Big data case

Big data case

Apache Spark case

How does its works

How does its works

How does its works

How does its works

How does its works

How does its works

How does its works

Spark Sql

Spark Sql

Spark Sql

Spark Sql

Spark Sql

Spark Sql

Spark Sql

Spark Sql

Spark Sql

Apache Spark case

Apache Spark in

Large Scale Machine Learning

LSML issues

• Too much samples to classify

• Training data does not fit in memory

• Too much training samples

• Too much models to train

Why not MLLib?• MLLib is less stable

• too few algorithms comparing to scikit-learn

• ML pipelines are not so mature than in scikit-learn

• e. g. there is no simple way to use logistic regression for feature selection

• MLLib python API falls behind Java/Scala API

• MLLib is actively developed and may be feasible choice in near future

Spark + scikit-learn = ?• Parallel training

• meta-parameter grid search

• parallel one-vs-rest for multi-class models

• same features but different targets

• parallel bagging and ensembles

• parallel learning for multi-step classification

• Parallel prediction

Distributed learning

Distributed learning

Distributed learning

Distributed learning

Distributed learning

Distributed learning

Distributed learning

Distributed learning

Distributed learning

Distributed prediction

Distributed prediction

Distributed prediction

Distributed prediction

Distributed prediction

Distributed prediction

Distributed prediction

Models management

•A complex ML task can be expressed as scikit-learn

pipeline

• e. g. feature scaling, then LR feature selection then

GBT classification

• Trained models can be stored on HDFS / S3 along with

training report and loaded for prediction

Alternative approaches to LSML•partial / online learning

• e. g. stochastic gradient descent

•distributed stochastic gradient descent

• is implemented in MLLib

Spark is everywhere

•Mahout on spark aka Samsara

•H20 Sparkling water

Questions?

Pavel Mezentsev pavel@mezentsev.orgDmitri Babaev dmitri.lb@gmail.com

top related