danny bickson - python based predictive analytics with graphlab create

28
Dato Confidential 1 GraphLab Create Training UvA School of Business Danny Bickson, Co-Founder and VP EMEA [email protected]

Upload: pydata

Post on 16-Apr-2017

483 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential1

GraphLab Create TrainingUvA School of Business

Danny Bickson, Co-Founder and VP [email protected]

Page 2: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential2

Dato: We Intelligent Applications

Page 3: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential3

Businessmust be

intelligent

Machine learning applications

• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized

medicine • Churn prediction• Smart UX

(video & text)• Personal assistants• IoT• Socials networks• Log analysisLast decade:

Data managementNow:

Intelligent apps

?Last 5 years:

Traditional analytics

Page 4: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential4

Example Intelligent Applications- images- text- graphs- tabular data

Page 5: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential

Creating a model

exploration

data

modelingpipeline

Page 6: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential

Creating a model pipelineIngest Transfor

mModel Deplo

yUnstructured Data

Page 7: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential

The Dato Machine Learning PlatformDato Predictive Services

Predictive Engine

REST ClientModel Mgmt

Machine Learning Toolkits

Canvas Free for academic usage

SDK SGraphSFrame

Engine – sframe gihub

GraphLab Create

Page 8: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential

GraphLab Create Benefits

Page 9: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential9

Why use GraphLab Create?• Efficient storage

GraphLab Sframe compressed column store:• x20 smaller than pandas• x2 smaller than Gzip

Size on disk (the lower the better!)

Page 10: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential10

No need for huge RAM!

Effective Delay vs RAM

x2x5

Data size limited by disk size

My data is larger than my machine RAM

Page 11: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential11

Comparison to sklearn

Try it here: http://blog.dato.com/how-fast-are-out-of-core-algorithms

Page 12: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential12

Summary of differences vs. sklearn• Better multicore support

• Out of core implementation (working from disk)

• Automatic feature expansion

• Automatic parameter selection

• Support for model serving

• Additional algorithms

Page 13: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential

Some of our Customers

13

Page 14: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential14

Dato on Coursera

40,000 students in 4 monthshttps://www.coursera.org/learn/ml-foundations

Specialization content:

● Machine Leraning Foundations

● Regression● Classification● Clustering &

Retrieval● Recommendatio

n Systems & Dimentionality Reduction

● Capstone: An Intelligent Application with Deep Learning

Page 15: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential15

Remco Frijling

Page 16: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential16

Page 17: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential17

Create an intelligent world!

Data Engineering

Sophisticated ML Deployment

• Fast & scalable• Rich data types• Built for ML

• App-oriented ML• Scalable ML• Extensibility

• Batch & always-on• RESTful interface• Elastic & robust

[email protected]

Page 18: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential18

Appendix: Performance

Page 19: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Performance HighlightsDato’s Platform outperforms other frameworks on most tasks: Data munging, machine learning essentials, & graph analytics tasks.

● Data Munging - SFrame, the columnar and out-of-core abstraction enables tabular queries on a single node that are faster or comparable to queries on 5-node clusters for systems like Spark & Redshift.

● Machine Learning - Unparalleled speed & accuracy for tasks including classification, recommendation, and deep learning on images compared to systems like MLLib, H2O, and scikit-learn.

● Graph Analysis - Orders of magnitude faster than comparable frameworks like GraphX & Giraph for common graph analytics tasks. Tasks complete in reasonable times (mins) even on the world’s largest publicly available webgraph. The only other known system to complete these tasks is one that runs on non-commodity, custom hardware.

Page 20: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential20

0 2 4 6 8 10 120.60%

0.65%

0.70%

0.75%

0.80%

0.85%

Hours

Test

Erro

r

Digit recognition benchmark

4 min on 4 GPUs

Machine Learning – Deep Learning

10 machines/80 cores

Page 21: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential

Graph Analytics - 1

21

GraphLab Create

GraphX

Giraph

Spark

0 750 1500 2250

70 sec

251 sec

200 sec

2,128 sec

Connected components in Twitter graph

Source(s): Gonzalez et. al. (OSDI 2014)Twitter: 41 million Nodes, 1.4 billion Edges

SGraph

16 machines

1 machine

Page 22: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential22

Pagerank on Common Crawl Graph3.5 billion Nodes and 128 billion Edges

1 machine 16 machines0

2

4

6

8

10

Min

utes

per

iter

atio

n

256 CPUs16 CPUs16 machines 300 machines

Page 23: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential23

Criteo Terabyte Click Prediction4.4 Billion

Rows13 Features

½ TB of data

0 4 8 12 160

500

1000

1500

2000

2500

3000

3500

4000

#Machines

Runt

ime

Linear

Speedup 225s

3630s

Page 24: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Machine Learning – Logistic Reg. Accuracy

Dataset Source(s): LIBLinear binary classification datsets.

Page 25: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato ConfidentialConfidential – Dato internal use only. ©2015 Dato, Inc.

Data Munging

SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

5 Nodes

1 Node

Source(s): https://amplab.cs.berkeley.edu/benchmark/, Armbrust et. al. (SIGMOD 2015)

Dataset: Extracted from 775M visits to 90M documents in the Common Crawl corpus

Page 26: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential26

Appendix: Pricing & Deployment Scenarios

Page 27: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential27

• Subscription license which includes support and and upgrades

• Licensed by user for Create & by machine for production use

• Training & technical services also available

• Discounts available for 10 or more users

Page 28: Danny Bickson - Python based predictive analytics with GraphLab Create

Dato Confidential28

Deployment Scenarios“Getting Started”

“Real-time Predictions”

“Scaling Up”

GraphLab CreateDato Predictive ServicesDato Distributed

KeyGraphLab Create – installed on each team member machine• Working with data, training new models, doing ad-hoc

analysis

GraphLab Create • Installed on central team server• Trains production models periodically (ex. nightly)• Generates predictions and records to data store

GraphLab Create – installed on each team member machine• Installed on team member laptops• Working with data, ad-hoc analysis, training new models• Deploy new models to Predictive Services deployment

GraphLab Create – installed on central team server• Trains production models periodically (ex. nightly)• Deploys models to Dato Predictive Services Dato Predictive Services – installed on central team cluster• Hosting & Serving deployed models• REST API for application integrationGraphLab Create – installed on each team member machine• Working with data, training new models, doing ad-hoc

analysis• Deploys models to Predictive Services• Submits jobs to Distributed Dato Distributed – installed on central team cluster• Train models in parallel on larger dataset periodically (ex.

nightly)• Deploys newly trained models to Dato Predictive Services Dato Predictive Services – installed on central team cluster• Hosting deployed models• REST API for applicationintegration