the big data dead valley dilemma and much more

40
The Big Data Dead Valley Dilemma and Much More [email protected] Founder QMining @fraka6

Upload: francis-pieraut

Post on 25-May-2015

940 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: The big data dead valley dilemma and much more

The Big Data Dead Valley Dilemma and Much More

[email protected] Founder QMining

@fraka6

Page 2: The big data dead valley dilemma and much more

Unhidden Agenda

● Big Data Big Picture

● Big Data Dead Valley Dilemma

● Elastic Map Reduce (EMR) numbers

● Scaling Learning (MPI & hadoop)

Page 3: The big data dead valley dilemma and much more

Big Data =

Lot of Data (evidence)

+

CPU bounded (forgotten)

Page 4: The big data dead valley dilemma and much more

Big Data =

Lot of Data (evidence)

-

IO bounded (reality)

Page 5: The big data dead valley dilemma and much more

IO bounded

CPU<100%Data

● HD/Bus speed● Network● File server

Page 6: The big data dead valley dilemma and much more

Big Data Scalability(ex: hadoop)

= Cluster

+

Locality + node failure(Data move close to CPU)

Page 7: The big data dead valley dilemma and much more

The Big Data Dilemma

Page 8: The big data dead valley dilemma and much more

Big Data Dead ValleyTe

chno

Mat

urtit

y /

Ris

k

Enterprise size

SMB

Enterprise

Start-ups

Techno Maturity Risk

Page 9: The big data dead valley dilemma and much more

Big Data =

SMALL MARKET

(B2B vs B2C)

Page 10: The big data dead valley dilemma and much more

Small Market......hum?

Page 11: The big data dead valley dilemma and much more

WHY?????

MaturityData, Process, QA, infra, talent, $, Long term vision

Page 12: The big data dead valley dilemma and much more

Data->Analytics ->BI-> Big-Data -> Data-Mining -> ML

Page 13: The big data dead valley dilemma and much more

Data Access & Quality

User data privacy, IT outsourcing protection, Data Quality

Page 14: The big data dead valley dilemma and much more

Enterprise Slowness

1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit)

Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group)

Hierarchy vs network

Page 15: The big data dead valley dilemma and much more

Big Data Dead ValleyTe

chno

Mat

urtit

y /

Ris

k

Enterprise Maturity

SMB

Enterprise

Start-ups

Techno Maturity Risk

Page 16: The big data dead valley dilemma and much more
Page 17: The big data dead valley dilemma and much more

QMarketing exampleLeveraging hadoop● map = hits to session● reduce = sessions to ROI

Page 18: The big data dead valley dilemma and much more

Online Marketing Management

Channel % budget ROI----------------------------------------------PPC 50% ?Organic 20% ?Email Campaign 20% ?Social Media 10% ?

Page 19: The big data dead valley dilemma and much more

ROI Dashboard

Page 20: The big data dead valley dilemma and much more

All abstractions leakAbstract -> Procrastinate!

http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )

Page 21: The big data dead valley dilemma and much more

Minimize A Tower of AbstractionSimplify & lower the layer of abstraction

Examples:

● Work on file not BD if possible● HD direct connect on server● Low level linux command lines (cut, grep, sed etc.)● High level languages : python

Abstraction = 20X benefits

Page 22: The big data dead valley dilemma and much more

EMR vs AWS & S3 1.0(no data locality optimization + network &

~IO bounded)

EMR = 45 min AWS = 4 min

Page 23: The big data dead valley dilemma and much more

EMR vs AWS & S3 2.0

EMR = 5+10 min* AWS = ~4 min

*30 min prepro ;)EMR = 5+4 if (big files & compress files)

Page 24: The big data dead valley dilemma and much more

Scaling Machine Learning

● Scaling Data-Preprocessing = Hadoop● Small dataset = GPU● Train with Big Dataset = ?? Communication Infrastructures =

MPI & MapReduce (John Langford http://hunch.net/?p=2094)

Page 25: The big data dead valley dilemma and much more

MPI allreduce

Page 26: The big data dead valley dilemma and much more
Page 27: The big data dead valley dilemma and much more
Page 28: The big data dead valley dilemma and much more
Page 29: The big data dead valley dilemma and much more

Hadoop vs MPI

MPI● No fault tolerance by default● Poor understanding of where data is (manual split on nodes + bad

communication & prog complexity)● Limit scale to ~100 nodes in practice (sharing unavoidable)● Cluster shared -> slower nodes issues before disk/node failure

MapReduce ● Setup and teardown costs are significant (interaction schedular &

communicating the prog + large number of node)● Worst: mapreduce wait for free nodes + many mapreduce iteration +

reach high quality prediction● Flaw: required refactoring code in map/reduce

Page 30: The big data dead valley dilemma and much more

Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI)

● MPI = All reduce (all nodes same state)● MapReduce = Conceptual Simplicity● MPI: No need to refactor code● MapReduce: Data Locality (Map only)● MPI: Ability to use local storage (or RAM): temp file on

local disk + allow to be cached in RAM by OS● MapReduce: Automatic cleanup of local resources (tmp

files)● MPI: Fast Optimization approach remain within the

conceptual scope: AllReduce = fct call● MapReduce robustness (speculative execution to deal

with slow nodes)

Page 31: The big data dead valley dilemma and much more
Page 32: The big data dead valley dilemma and much more
Page 33: The big data dead valley dilemma and much more
Page 34: The big data dead valley dilemma and much more
Page 35: The big data dead valley dilemma and much more
Page 36: The big data dead valley dilemma and much more
Page 37: The big data dead valley dilemma and much more
Page 38: The big data dead valley dilemma and much more

Summary

● Big Data Big Picture○ BigData : Cluster + IO bounded (Locality)

● Big Data Dead Valley Dilemma (MMID)○ Small Market/Maturity/Data:access,quality/Slowness

● EMR (aws) = Slow● Minimize Tower or abstraction● Scaling MP: bottleneck = ML

○ MPI:no fault tolerance + where is the data?○ Hadoop: slow setup & teardown + Require

Refactoring○ Hadoop compatible AllReduce

Page 40: The big data dead valley dilemma and much more

hum...

Questions?

[email protected]