bigml's take on big data

80
Geneva, October 12, 2012 BigML Inc, 2012

Upload: francisco-j-martin

Post on 22-Nov-2014

2.181 views

Category:

Technology


0 download

DESCRIPTION

BigML's take on Big Data. University of Geneva, October 12, 2012. In the "Big Data" era, rapidly and easily getting insights from your data or creating data-driven applications does not have to be painful. BigML shows how business managers, application developers, and data scientists can start building their own predictive models in a matter of minutes.

TRANSCRIPT

Page 1: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012

Page 2: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 2

Agenda

·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API

Page 3: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 3

Francisco J Martin

BigML:

• Co-founder and CEO• Joined: January 2011• Tasks: Product conceptualization, design, and architecture• Develops: BigML middle-end and public API• 1202 (19%) of commits to total BigML code base

Background:• 5-year degree in Computer Science, UPV• Ph.D. in Artificial Intelligence, UPC• Postdoc (Machine Learning), Oregon State University• Founder and CEO at iSOCO• Founder and CEO at Strands• Co-authored 6 patents acquired by Apple Inc• Directly raised $75+MM in venture capital and cashed

out additional $18+MM for early investors• Directly sold and negotiated $30+MM in licenses

Page 4: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 4

Neo, sooner or later you're going to realize, just as I did, that there's

a difference between knowing the path, and walking

the path

Academia vs the Real-world

Page 5: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012

1996 1999 2002 2004 2011 2012

5

8-queen problem

Multi-agent Learning

Personalization

E-commerce

RecommenderSystems

Music, video, fitness, finance

Intrusion Detection

Machine Learning

Large-scale Machine Learning

Academia iSOCO Academia Strands Inc BigML Inc

Everything

Data

Walking the data path

Page 6: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 6

BigML Status

·•Founded in Jan 2011·•9 FTE, 1 PT·•5 Ph.Ds·•4 patent applications

US Patent Application No. 61/557,826For: METHODS FOR BUILDING AND USING DECISION TREES IN A DISTRIBUTED ENVIRONMENTFiled: November, 2011

US Patent Application No. 61/555,615For: VISUALIZATION AND INTERACTION WITH COMPACT REPRESENTATION OF DECISION TREESFiled: November, 2011

US Patent Application No. 61/557,539For: EVOLVING PARALLEL SYSTEM TO AUTOMATICALLY IMPROVE THE PERFORMANCE OF DISTRIBUTED SYSTEMSFiled: November, 2011

·•Advisors and BA:

US Patent Application No. 61/710,175For: SYSTEM AND METHODS TO EXCHANGE ACTIONABLE PREDICTIVE MODELS IN A VIRTUAL MARKETPLACEFiled: October, 2012

Page 7: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 7

Beneath Hill 60

From the trenches

BigML Team

Page 8: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 8

Agenda

·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API

Page 9: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 9

Big DataWhat is Big Data? What is a Data Scientist?

How not to start with Big Data? What is Data-driven Decision Making?

Page 10: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 10

Trends

http://strata.oreilly.com/2011/08/building-data-startups.html

Page 11: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 11

What’s Big Data?

Big Data means way too many different things to

many different people

“when the human cost of making the decision of throwing something away became higher than the machine cost of

continuing to store it” George Dyson

Page 12: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 12

What’s Big Data?

Volume(big, enormous, huge, vast, immense, very

large, etc)

Variety(heterogenous, diverse, complex, multiple

sources, sensors, etc)

Velocity(speed, dynamic real-time, streamed, etc)

The 3 v’s The 3 I’sImmediate

In the sense that you need to do something about it

IntimidatingWhat if you do not?

Ill-definedWhat is it? Anyway?

Data matters!!!

Page 13: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 13

Machine Learning

Even if we, human beings, are learning machines, we are really bad at processing small amounts of data

Machines are good at quickly processing huge amounts of data.Machine Learning can make them learn from

data

Page 14: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 14

It’s all about machine learning

It's as if the machines have been in training all their lives to adapt and make use of the Big Data now being thrown at them - a combination of Moore's Law and the cloud mixed in with Machine Learning finally makes it all possible. --- Jeff Bussgang

Forget plastics. It’s all about

machine learninghttp://www.youtube.com/watch?v=PSxihhBzCjk

Page 15: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 15

Unknown Modelf : X -> Y

Example: ideal credit approval formula

ModelsM

Example: set of candidate credit approval formulas

Learning from Data

Based on Learning from Data by Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin

Final Modelg ~ f

Example: learned credit approval formula

LearningAlgorithm

Training Examples(x1, l1), (x2, l2), ..., (xN, lN)

Example: historical records of credit customers

x1

xN

labelf1 f2 fn

Page 16: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 16

What’s Big Machine Learning?

VolumeWhat to do when data is too big to fit within the

system memory of a single computer?

Variety

Large-scale machine learning

Clean, refine, update, join, merge, aggregate, structure or deconstruct data until it matches the required input format or (why not) just generate/store data in the right format

Velocity Stream Algorithms

Page 17: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 17

...or you can deal with that!Machine Learning

Page 18: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 18

More featuresMore exam

ples

Does More Data beat Better Algorithms?

More Data or Better Models.Xavier Amatriain

The Unreasonable Effectiveness of Data

Page 19: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 19

Global realization that learning from data (i.e., Machine Learning)

can help us better analyze our past, understand our present, and predict our future. --- Francisco J Martin

Data Past Present Future

What’s Big Data?

Page 20: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 20

Big DataWhat is Big Data? What is a Data Scientist?

How not to start with Big Data? What is Data-driven Decision Making?

Page 21: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 21

Is Wikipedia right?

Really? Seriously?? Are you kidding me???

Page 22: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 22

Data can’t be wrong?

Page 23: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 23

McKinsey can’t be wrong

Critical Shortage Of “Data Scientist” Talent Predicted By 2018

Page 24: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 24

HBR can’t be wrong

Page 25: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 25

Wikipedia is right!

Page 26: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 26

If Data Scientists don’t existcan they be created?

Page 27: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 27

The first Data Scientist

Computer Scientist

Mathematician

Statistician

Hans’ brain, the first Data Scientist

Page 28: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 28

The magic formula

A data scientist is“part analyst, part artist.”

Anjul Bhambhri,Vice President of Big Data

Products at IBM

Page 29: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 29

Are Data Scientists super heroes?

Page 30: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 30http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.html

The most powerful human super hero

Page 31: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 31

Events Decathlon World Record High school World Record

World Record

100 m 10.21 10.08 9.58

Long Jump 8.23 m 8.16 m 8.95 m

Shot Put 14.20 m 20.65 m 23.12 m

High Jump 2.05 m 2.31 m 2.45 m

400 m 46.70 44.69 43.18

110 m hurdles 13.70 13.74 12.80

Discus throw 42.81 m 61.38 m 74.08 m

Pole Vault 5.30 m 5.56 m 6.14 m

Javelin Throw 58.87 m 73.74 m 98.48 m

1500m 4:14.48 3:38.26 3:26.00

Are Data Scientists super heroes?

Page 32: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 32

The Wikipedia is always right!

Page 33: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 33

BigML’s Data Science Team

Machine Learning Research

Large-scale and learning algorithm implementation

Architecture, Software Design,

Distributed Systems

Tom Dietterich, PhD

Charles Parker,PhD Adam Ashenfelter, MSc

Jao, PhD

Bea Garcia, BSc

Poul Petersen, MSc

Justin Donaldson Ph.D.

Francisco J Martin, PhD

Oscar Rovira, MSc* Infrastructure, Cloud-based

Com

puting

DesignVisualization

UI

Jos Verwoerd, MScBusi

ness

and

C

omm

on S

ense

Product Design

Page 34: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 34

Tom Dietterich, PhD

Charles Parker,PhD Adam Ashenfelter, MSc

Jao, PhD

Bea Garcia, BSc

Poul Petersen, MSc

Justin Donaldson Ph.D.

Francisco J Martin, PhD

Oscar Rovira, MSc*

Jos Verwoerd, MSc

Take Away

So instead of trying to quickly create “mediocre data scientists”, Universities should focus on creating excellent mathematicians, statisticians, computer scientists, software architects, designers, etc who are fabulous team players

Page 35: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 35

Big DataWhat is Big Data? What is a Data Scientist?

How not to start with Big Data? What is Data-driven Decision Making?

Page 36: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 36

Iris Dataset

http://en.wikipedia.org/wiki/Iris_flower_data_set

Page 37: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 37

Ingestion(capturing and storing)

Digestion(processing)

Absorption(deriving insights)

Assimilation (making insights actionable)

Egestion

(reject bad data, wrong insights)

Digesting Big Data

Too much attention!!!

Almost no attention!!!

Page 38: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 38

·•Hadoop has been excessively promoted as the way to make Big Data problems easy.

·•There are quite a few vendors pushing different Hadoop flavors to the market.

Big Data meets Hadoop

However, Hadoop is complex, slow, expensive and batch

Page 39: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 39

Running Hadoop on a cluster - The New IT sport of 2012

Big Data and Hadoop

Page 40: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 40

Real-Time Hadoop?

Really? Seriously?? Are you kidding me???

Page 41: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 41

Why not Hadoop?

·•Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB)

·•Iterative-machine learning algorithms, do not map trivially to MapReduce.·•Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM·•In terms of hardware and programmer time, this may be a better option for the majority of

data processing jobs.Rowstron, A. et al, Nobody ever got fired for using Hadoop on a cluster, Microsoft Research, Cambridge, 2012

·•Hadoop is bad at iterative algorithms: High job startup costs and awkward to retain state across iterations

·•High sensitivity to skew: iteration speed bounded by slowest task.·•Potentially poor cluster utilization: must shuffle all data to a single reducer.

Large-Scale Machine Learning at Twitter, Jimmy Lin

Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger

Page 42: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 42

Hadoop

·•Complex·•Slow·•Batch·•Expensive

Streaming Algorithms

·•Simple·•Fast·•Real-time·•Cheap

Making Big Data Small

Noel Welsh, Strata conference, London, October 2012

Page 43: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 43

Self-imposed Shackles

Tackling Big Data with Hadoop on a cluster is like self-imposing shackles on your own project

Once a baby elephant accepts the limitation imposed on him it becomes a permanent belief, or in his case, a conditioned reaction. Now as the elephant grows into adulthood, he has the power to easily pull the stake out of the ground, but his

conditioning has taught him that the effort will not only be futile, it will be

painful as well.

http://www.selfgrowth.com/articles/Martinez1.html

Page 44: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 44

•Buy a few machines and set up a cluster.•Installing and running any flavor of Hadoop.•Figure out how to implement complex map-reduce algorithms to compute a few analytics.

•Start with a very small data sample.•Use free or cloud-based tools to build a first predictive model that you can understand.•Check if the model gives you any practical insight.•Use the model to generate predictions and see if it can improve your performance.•Check how more data can improve the model.•Check if more sophisticated models can beat your model •Iterate.•Check if the volume, variety, and velocity of your data require a behind-the-firewall/ cloud solution or a batch/stream solution.

Starting with Big Data

Page 45: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 45

Big DataWhat is Big Data? What is a Data Scientist?

How not to deal with Big Data? What is Data-driven Decision Making?

Page 46: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 46

Data-Driven Decisions

http://www.nytimes.com/2011/04/24/business/24unboxed.html

Automated, data-driven decisions will significantly impact more industries than any other information

system since “computers” were people

Page 47: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 47

The “HiPPO” (Highest Paid Person’s Opinion) is dead

Page 48: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 48

Descriptive AnalyticsTraditional, backward-looking business

analytics

Predictive AnalyticsMachine Learning

Predictive Analytics

Page 49: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 49

“The goal of a predictive model is not

to predict the future but to help you make a better decision in the present”

Taken from Paul Saffo, HBR

Predictive Model

Page 50: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 50

Analytics and Predictive Analytics combined with Experience&Intuition

Data-Driven Decision Making

Page 51: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 51

Ingestion(capturing and storing)

Digestion(processing)

Absorption(deriving insights)

Assimilation (making insights actionable)

Egestion

(reject bad data, wrong insights)

less attention!!!

More attention!!!

More focus on the models and how to operationalize them than on the infrastructure to generate

them

It’s time to switch the attention

Page 52: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 52

Take aways•Big Data is just data

•It’s all about machine learning

•Try to excel in one of the data science disciplines

•Don’t shackle yourself to the wrong platform

•Trying to predict the future can help you make the right decision in the present

•Focus on evaluation and actionability of models and not on how they are built

Page 53: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 53

Agenda

·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API

Page 54: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 54

BigML Goal

Highly Scalable, Cloud-based Machine Learning Service

Simple, Easy-to-Use and Seamless-to-Integrate

Page 55: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 55

...or you can deal with that!

BigML vs ML

BigML 1-click model

You can deal with this...

Page 56: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 56

BigML vs Big Data

BigML 1-click model

You can deal with this...

...or you can deal with that!

Page 57: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 57

How it Works

Page 58: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 58

True

Machine Learning Made Easy

Page 59: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 59

“Any fool can make something complicated. It takes a genius to make it simple.”

― Woody Guthrie

Simple is not easy

Page 60: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 60

Fully Web based

Page 61: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 61

RESTful API

Page 62: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 62

Agenda

·•Short intro·•The Big Data Revolution·•What is BigML? - Demo·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API

Page 63: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 63

Agenda

·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API

Page 64: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 64

BigML’ Software Architecture

Middle-end[Apian]

Backend[Wintermute]

Boto, FabricInfrastructure

[Sauron]

Front-end[Neutronia]

[Sky]

[CuriousYellow]

[Medusa]

Page 65: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 65

BigML’s AWS-based Architecture

Page 66: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 66

Why Tree Models?

·•Highly scalable·•Graphically representable and interactive·•Easily understandable·•Easily translatable into rules, PMML, and code. ·•Easily upgradable with ensembles: boosting, bagging, and random forests, etc·•Top performers! http://www.niculescu-mizil.org/papers/empirical.icml06.pdfempirical.icml06.pdf

Page 67: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 67

Streaming

Data is never kept in memory but needs only one pass over

the data to capture the distribution.

Memory constrained

The less memory allocated, the lossier the compressed

distribution.

Dynamic

The histogram bins adjust themselves as they observe the

data.

Robust to ordered data

So it works even if the data stream is non-stationary

Merge friendly

For parallelization and distribution.

More...

http://blog.bigml.com/2012/06/18/bigmls-fancy-

histograms/

BigML's trees and dataset summaries use histograms with the following traits:

BigML Histograms

Page 68: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 68

BigML Streaming Trees

CART

Classification & Regression Trees

Grown breadth first

So partial trees are meaningful

Built Hoeffding-style

So they consume streaming data and can split "early"

Friendly for parallelization

Can work over multiple cores or multiple computers

BigML's trees are:

Page 69: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 69

Growing a Streaming Tree

·•Each split breaks the data into subsets.

·•The split should make the subsets as distinct from one another as possible.

·•Subsets are chosen to maximize information gain (classification) or minimize squared error (regression).

Page 70: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 70

 

Distributed Streaming Trees

Page 71: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 71

Streaming Trees - Early Splits

Page 72: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 72

Agenda

·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API

Page 73: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 73

Automatic Evaluations

Page 74: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 74

A marketplace for predictive models

Page 75: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 75

“Any fool can make something complicated. It takes a genius to make it simple.”

― Woody Guthrie

Simple is not easy

Page 76: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 76

True

Machine Learning Made Easy

Page 77: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 77

Agenda

·•Short intro·•The Big Data Revolution·•Demo·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API

Page 78: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 78

Back to the trenches

Gallipoli

Page 79: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012 79

Big Data Trends - David Feinleibhttp://www.slideshare.net/bigdatalandscape/big-data-trends

Hey Graduates: Forget Plastics - It's All About Machine Learning. Jess Bussgang. http://bostonvcblog.typepad.com/vc/2012/05/forget-plastics-its-all-about-machine-learning.html

More Data or Better Models. Xavier Amatriain http://technocalifornia.blogspot.ch/2012/07/more-data-or-better-models.html

Making Big Data Small. Noel Welshhttp://strataconf.com/strataeu/public/schedule/detail/25984

Data Killed the HiPPO star. Jeff Jordan, Andreessen Horowitzhttp://gigaom.com/2012/02/18/data-killed-the-hippo-star/

When There’s No Such Thing as Too Much Information. Steve Lohrhttp://www.nytimes.com/2011/04/24/business/24unboxed.html

Nobody ever got fired for using Hadoop on a cluster. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, Andrew Douglashttp://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf

Six Rules for Effective Forecasting. Paul Saffohttp://www.usc.edu/schools/annenberg/asc/projects/wkc/pdf/200912digitalleadership_saffo.pdf

Large-scale Machine Learning at Twitter. Jimmy Lin and Alek Kolczhttp://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

Good Reading

Page 80: BigML's take on Big Data

Geneva, October 12, 2012BigML Inc, 2012