data science on big data. pragmatic approach

Data science using Big Data. Pragmatic approach.

Dmitri Babaev Pavel Mesentsev

Predictive models design

Small data case

Big data case

Apache Spark case

How does its works

Spark Sql

Apache Spark case

Apache Spark in

Large Scale Machine Learning

LSML issues•Too much samples to classify

•Training data does not fit in memory

•Too much training samples

•Too much models to train

Why not MLLib?• MLLib is less stable

• too few algorithms comparing to scikit-learn

• ML pipelines are not so mature than in scikit-learn

• e. g. there is no simple way to use logistic regression for feature selection

• MLLib python API falls behind Java/Scala API

• MLLib is actively developed and may be feasible choice in near future

Spark + scikit-learn = ?• Parallel training

• meta-parameter grid search

• parallel one-vs-rest for multi-class models

• same features but different targets

• parallel bagging and ensembles

• parallel learning for multi-step classification

• Parallel prediction

Distributed learning

Distributed prediction

Models management•A complex ML task can be expressed as scikit-

learn pipeline

•e. g. feature scaling, then LR feature selection

then GBT classification

•Trained models can be stored on HDFS / S3 along

with training report and loaded for prediction

Alternative approaches to LSML•partial / online learning

•e. g. stochastic gradient descent

•distributed stochastic gradient descent

• average sub-gradients to get the gradient

• is implemented in MLLib

Spark is everywhere

•Mahout on spark aka Samsara

•H20 Sparkling water

Thank you!Questions?

Dmitri Babaev dmitri.lb@gmail.com Pavel Mezentsev pavel@mezentsev.org

data science on big data. pragmatic approach

Technology

big data meets big data

caterpillar big data infrastructure big data, data

on the pragmatic equivalence between representing data and

msa220/mve440 statistical learning for big data - lecture...

crowdsourcing from scratch: a pragmatic experiment in data

pragmatic steps to implement big data analytics

data leakage prevention a pragmatic...

solving the big problem a pragmatic approach towards...

big data visualization: turning big data into big insights...

big data solutions - big data technology

demystifying data science: a pragmatic guide to … ·...

the big future of big data · 2017-03-08 · the big future...

big data prophylactics - roger clarke · 2018. 8. 19. ·...

pragmatic big data view for translational...

pragmatic key management for data encryption (pdf) -...

data performance management fast and simple. may2018... ·...

big data visualization: turning big data into big...

pragmatic analytics - case studies of high performance...

big data meets big data analytics

big data visualization: turning big data into big insights...