the big data dead valley dilemma and much more

The Big Data Dead Valley Dilemma and Much More

francis@qmining.com Founder QMining

@fraka6

Unhidden Agenda

● Big Data Big Picture

● Big Data Dead Valley Dilemma

● Elastic Map Reduce (EMR) numbers

● Scaling Learning (MPI & hadoop)

Big Data =

Lot of Data (evidence)

CPU bounded (forgotten)

Big Data =

Lot of Data (evidence)

IO bounded (reality)

IO bounded

CPU<100%Data

● HD/Bus speed● Network● File server

Big Data Scalability(ex: hadoop)

= Cluster

Locality + node failure(Data move close to CPU)

The Big Data Dilemma

Big Data Dead ValleyTe

Enterprise size

Enterprise

Start-ups

Techno Maturity Risk

Big Data =

SMALL MARKET

(B2B vs B2C)

Small Market......hum?

WHY?????

MaturityData, Process, QA, infra, talent, $, Long term vision

Data->Analytics ->BI-> Big-Data -> Data-Mining -> ML

Data Access & Quality

User data privacy, IT outsourcing protection, Data Quality

Enterprise Slowness

1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit)

Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group)

Hierarchy vs network

Big Data Dead ValleyTe

Enterprise Maturity

Enterprise

Start-ups

Techno Maturity Risk

QMarketing exampleLeveraging hadoop● map = hits to session● reduce = sessions to ROI

Online Marketing Management

Channel % budget ROI----------------------------------------------PPC 50% ?Organic 20% ?Email Campaign 20% ?Social Media 10% ?

ROI Dashboard

All abstractions leakAbstract -> Procrastinate!

http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )

Minimize A Tower of AbstractionSimplify & lower the layer of abstraction

Examples:

● Work on file not BD if possible● HD direct connect on server● Low level linux command lines (cut, grep, sed etc.)● High level languages : python

Abstraction = 20X benefits

EMR vs AWS & S3 1.0(no data locality optimization + network &

~IO bounded)

EMR = 45 min AWS = 4 min

EMR vs AWS & S3 2.0

EMR = 5+10 min* AWS = ~4 min

*30 min prepro ;)EMR = 5+4 if (big files & compress files)

Scaling Machine Learning

● Scaling Data-Preprocessing = Hadoop● Small dataset = GPU● Train with Big Dataset = ?? Communication Infrastructures =

MPI & MapReduce (John Langford http://hunch.net/?p=2094)

MPI allreduce

Hadoop vs MPI

MPI● No fault tolerance by default● Poor understanding of where data is (manual split on nodes + bad

communication & prog complexity)● Limit scale to ~100 nodes in practice (sharing unavoidable)● Cluster shared -> slower nodes issues before disk/node failure

MapReduce ● Setup and teardown costs are significant (interaction schedular &

communicating the prog + large number of node)● Worst: mapreduce wait for free nodes + many mapreduce iteration +

reach high quality prediction● Flaw: required refactoring code in map/reduce

Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI)

● MPI = All reduce (all nodes same state)● MapReduce = Conceptual Simplicity● MPI: No need to refactor code● MapReduce: Data Locality (Map only)● MPI: Ability to use local storage (or RAM): temp file on

local disk + allow to be cached in RAM by OS● MapReduce: Automatic cleanup of local resources (tmp

files)● MPI: Fast Optimization approach remain within the

conceptual scope: AllReduce = fct call● MapReduce robustness (speculative execution to deal

with slow nodes)

Summary

● Big Data Big Picture○ BigData : Cluster + IO bounded (Locality)

● Big Data Dead Valley Dilemma (MMID)○ Small Market/Maturity/Data:access,quality/Slowness

● EMR (aws) = Slow● Minimize Tower or abstraction● Scaling MP: bottleneck = ML

○ MPI:no fault tolerance + where is the data?○ Hadoop: slow setup & teardown + Require

Refactoring○ Hadoop compatible AllReduce

Reference MPI & hadoop

blog:http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.htmlhttp://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full)

CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research

Slides: http://lisaweb.iro.umontrea...

Implementation :

vowpal_wabbit

hum...

Questions?

francis@qmining.com

the big data dead valley dilemma and much more

data quality

lots of data

hadoop vs mpi mpi

data locality map

data locality optimization

big files

learning mpi hadoop

mpi allreduce

Technology

haunted highway 29 – headless horseman – dead man’s...

finance dilemma

concho valley horticulture...

costume - dilemma

gulf of elat/aqaba dead sea jordan valley arava valley dead...

benacerraf's dilemma

teenage dilemma

placing local food systems: farm tours as place-based...

jens dilemma

doctor's dilemma

marijuana dilemma

repeated games – prisoner’s dilemma -...

the valley: the music's not dead zine

1 governing the dead in guatemala: public authority and dead...

dilemma game - eur.nl · previous / next dilemma dilemma...

prisoner's dilemma

morning dilemma

ethical dilemma

security dilemma

prisoners dilemma