pivotal data labs - technology and tools in our data scientist's arsenal

52
1 © Copyright 2013 Pivotal. All rights reserved. Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal Srivatsan Ramanujam Senior Data Scientist Pivotal Data Labs 15 Oct 2014

Upload: srivatsan-ramanujam

Post on 29-Nov-2014

373 views

Category:

Data & Analytics


2 download

DESCRIPTION

These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.

TRANSCRIPT

Page 1: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.

Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal

Srivatsan Ramanujam Senior Data Scientist Pivotal Data Labs 15 Oct 2014

Page 2: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

2 © Copyright 2013 Pivotal. All rights reserved.

Agenda �  Pivotal: Technology and Tools Introduction

–  Greenplum MPP Database and Pivotal Hadoop with HAWQ

�  Data Parallelism –  PL/Python, PL/R, PL/Java, PL/C

�  Complete Parallelism –  MADlib

�  Python and R Wrappers –  PyMADlib and PivotalR

�  Open Source Integration –  Spark and PySpark examples

�  Live Demos – Pivotal Data Science Tools in Action –  Topic and Sentiment Analysis –  Content Based Image Search

Page 3: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

3 © Copyright 2013 Pivotal. All rights reserved.

Technology and Tools

Page 4: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

4 © Copyright 2013 Pivotal. All rights reserved.

MPP Architectural Overview Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Page 5: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

5 © Copyright 2013 Pivotal. All rights reserved.

Implicit Parallelism – Procedural Languages

Page 6: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

6 © Copyright 2013 Pivotal. All rights reserved.

Data Parallelism – Embarrassingly Parallel Tasks

� Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.

� Examples: –  map() function in Python:

>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> map(lambda e: e*e, x) >>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013

Page 7: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

7 © Copyright 2013 Pivotal. All rights reserved.

�  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster

•  Data Parallelism: -  PL/X piggybacks on

Greenplum/HAWQ’s MPP architecture

•  Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

Page 8: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

8 © Copyright 2013 Pivotal. All rights reserved.

User Defined Functions – PL/Python Example �  Procedural languages need to be installed on each database used.

�  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.

CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer  AS  $$      if  a  >  b:          return  a      return  b  $$  LANGUAGE  plpythonu;  

SQL wrapper

SQL wrapper

Normal Python

Page 9: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

9 © Copyright 2013 Pivotal. All rights reserved.

Returning Results �  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) �  Composite types can be returned by creating a composite type in the database:

CREATE  TYPE  named_value  AS  (      name    text,      value    integer  );  

�  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:

CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value  AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value  $$  LANGUAGE  plpythonu;  

 �  For functions which return multiple rows, prefix “setof” before the return type

http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston

Page 10: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

10 © Copyright 2013 Pivotal. All rights reserved.

Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator:

CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value  AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])    $$  LANGUAGE  plpythonu;  

Sequence

Generator

CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)    $$  LANGUAGE  plpythonu;  

Page 11: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

11 © Copyright 2013 Pivotal. All rights reserved.

Accessing Packages �  On Greenplum DB: To be available packages must be installed on the

individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!)

�  Then just import as usual inside function:

   CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value  AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))  $$  LANGUAGE  plpythonu;  

Page 12: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

12 © Copyright 2013 Pivotal. All rights reserved.

UCI Auto MPG Dataset – A toy problem Sample Data

�  Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters (bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?

�  Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.

�  This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP architecture. One segment can build a model for Hatchbacks another for Sedan

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

Page 13: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

13 © Copyright 2013 Pivotal. All rights reserved.

Ridge Regression with scikit-learn on PL/Python

Python

SQL wrapper

SQL wrapper

User Defined Function

User Defined Type User Defined Aggregate

Page 14: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

14 © Copyright 2013 Pivotal. All rights reserved.

PL/Python + scikit-learn : Model Coefficients

Physical machine on the cluster in which the regression model was built

Invoke UDF

Build Feature Vector

Choose Features

One model per body style

Page 15: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

15 © Copyright 2013 Pivotal. All rights reserved.

Parallelized R in Pivotal via PL/R: An Example � With placeholders in SQL, write functions in the native R language

�  Accessible, powerful modeling framework

http://pivotalsoftware.github.io/gp-r/

Page 16: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

16 © Copyright 2013 Pivotal. All rights reserved.

Parallelized R in Pivotal via PL/R: An Example �  Execute PL/R function

�  Plain and simple table is returned

http://pivotalsoftware.github.io/gp-r/

Page 17: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

17 © Copyright 2013 Pivotal. All rights reserved.

Aggregate and obtain final prediction

Each tree makes a prediction

Parallelized R in Pivotal via PL/R: Parallel Bagged Decision Trees

http://pivotalsoftware.github.io/gp-r/

Page 18: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

18 © Copyright 2013 Pivotal. All rights reserved.

Complete Parallelism

Page 19: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

19 © Copyright 2013 Pivotal. All rights reserved.

Complete Parallelism – Beyond Data Parallel Tasks

� Data Parallel computation via PL/X libraries only allow us to run ‘n’ models in parallel.

� This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data

� For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.

Page 20: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

20 © Copyright 2013 Pivotal. All rights reserved.

MADlib: Scalable, in-database Machine Learning

http://madlib.net

Page 21: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

21 © Copyright 2013 Pivotal. All rights reserved.

MADlib In-Database Functions

Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Page 22: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

22 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Streaming Algorithm

� Finding linear dependencies between variables

� How to compute with a single scan?

Page 23: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

23 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

XT

y

XT y = xiT yi

i∑

Page 24: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

24 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master

XT y

Segment 1 Segment 2

X1T y1 X2

T y2+ =

Page 25: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

25 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =

Page 26: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

26 © Copyright 2013 Pivotal. All rights reserved.

Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Page 27: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

27 © Copyright 2013 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring

�  MADlib allows users to easily and create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model

https://www.youtube.com/watch?v=Gur4FS9gpAg

Page 28: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

28 © Copyright 2013 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’,!

‘bedroom’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable

Create multiple output models (one for each value of bedroom)

�  MADlib allows users to easily and create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

Features included in the model

https://www.youtube.com/watch?v=Gur4FS9gpAg

Page 29: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

29 © Copyright 2013 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],

m.coef!)as predict !

FROM houses, houses_linregr m;!

MADlib model scoring function

Table with data to be scored Table containing model

�  MADlib allows users to easily and create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

Page 30: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

30 © Copyright 2013 Pivotal. All rights reserved.

Python and R wrappers to MADlib

Page 31: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

31 © Copyright 2013 Pivotal. All rights reserved.

PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

� Simple solution: Translate R code into SQL

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!

! ! !+ bath!! ! !+ size!! ! !, data=d)!

Pivotal R SELECT madlib.linregr_train( 'houses’,!

'houses_linregr’,!'price’,!

'ARRAY[1, tax, bath, size]’);!

SQL Code

https://github.com/pivotalsoftware/PivotalR

Page 32: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

32 © Copyright 2013 Pivotal. All rights reserved.

PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

� Simple solution: Translate R code into SQL

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!

! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!

Pivotal R

# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!

https://github.com/pivotalsoftware/PivotalR

Page 33: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

33 © Copyright 2013 Pivotal. All rights reserved.

PivotalR Design Overview

2. SQL to execute

3. Computation results 1. R à SQL

RPostgreSQL

PivotalR

Data lives here No data here

Database/Hadoop w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

https://github.com/pivotalsoftware/PivotalR

Page 34: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

34 © Copyright 2013 Pivotal. All rights reserved.

http://pivotalsoftware.github.io/pymadlib/

PyMADlib : Power of MADlib + Flexibility of Python Linear Regression

Logistic Regression

Extras –  Support for Categorical variables –  Pivoting

Current PyMADlib Algorithms –  Linear Regression –  Logistic Regression –  K-Means –  LDA

Page 35: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

35 © Copyright 2013 Pivotal. All rights reserved.

Visualization

Page 36: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

36 © Copyright 2013 Pivotal. All rights reserved.

Visualization

Commercial Open Source

Page 37: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

37 © Copyright 2013 Pivotal. All rights reserved.

Hack one when needed – Pandas_via_psql

http://vatsan.github.io/pandas_via_psql/

SQL Client DB

Page 38: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

38 © Copyright 2013 Pivotal. All rights reserved.

Integration with Open Source – (Py)Spark Example

Page 39: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

39 © Copyright 2013 Pivotal. All rights reserved.

Apache Spark Project – Quick Overview

http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

•  Apache Project, originated in AMPLab Berkeley •  Supported on Pivotal Hadoop 2.0!

Page 40: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

40 © Copyright 2013 Pivotal. All rights reserved.

MapReduce vs. Spark

http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Page 41: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

41 © Copyright 2013 Pivotal. All rights reserved.

Data Parallelism in PySpark – A Simple Example

•  Next we’ll take the UCI automobile dataset example from PL/Python and demonstrate how to run in PySpark

Page 42: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

42 © Copyright 2013 Pivotal. All rights reserved.

Scikit-Learn on PySpark – UCI Auto Dataset Example

•  This is in essence similar to the PL/Python example from the earlier slide, except we’re using data store on HDFS (Pivotal HD) with Spark as the platform in place of HAWQ/Greenplum

Page 43: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

43 © Copyright 2013 Pivotal. All rights reserved.

Large Scale Topic and Sentiment Analysis of Tweets

Social Media Demo

Page 44: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

44 © Copyright 2013 Pivotal. All rights reserved.

Pivotal GNIP Decahose Pipeline

Parallel Parsing of JSON

PXF

Twitter Decahose (~55 million tweets/day)

Source: http Sink: hdfs

HDFS

External Tables

PXF

Nightly Cron Jobs

Topic Analysis through MADlib pLDA

Unsupervised Sentiment Analysis

(PL/Python)

D3.js

http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-on-pivotal-greenplum-database

Page 45: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

45 © Copyright 2013 Pivotal. All rights reserved.

Data Science + Agile = Quick Wins

� The Team –  1 Data Scientist –  2 Agile Developers –  1 Designer (part-time) –  1 Project Manager (part-time)

� Duration –  3 weeks!

Page 46: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

46 © Copyright 2013 Pivotal. All rights reserved.

Live Demo – Topic and Sentiment Analysis

Page 47: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

47 Pivotal Confidential–Internal Use Only

Content Based Image Search

CBIR Live Demo

Page 48: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

48 Pivotal Confidential–Internal Use Only

Content Based Information Retrieval - Task

http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database

Page 49: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

49 Pivotal Confidential–Internal Use Only

CBIR - Components

http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database

Page 50: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

50 Pivotal Confidential–Internal Use Only

Live Demo – Content Based Image Search

http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database

Page 51: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

51 Pivotal Confidential–Internal Use Only

Appendix

Page 52: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

52 Pivotal Confidential–Internal Use Only

Acknowledgements •  Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan

Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team, Sumedh Mungee, Girish Lingappa