pivotal data labs - technology and tools in our data scientist's arsenal
DESCRIPTION
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.TRANSCRIPT
1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal
Srivatsan Ramanujam Senior Data Scientist Pivotal Data Labs 15 Oct 2014
2 © Copyright 2013 Pivotal. All rights reserved.
Agenda � Pivotal: Technology and Tools Introduction
– Greenplum MPP Database and Pivotal Hadoop with HAWQ
� Data Parallelism – PL/Python, PL/R, PL/Java, PL/C
� Complete Parallelism – MADlib
� Python and R Wrappers – PyMADlib and PivotalR
� Open Source Integration – Spark and PySpark examples
� Live Demos – Pivotal Data Science Tools in Action – Topic and Sentiment Analysis – Content Based Image Search
3 © Copyright 2013 Pivotal. All rights reserved.
Technology and Tools
4 © Copyright 2013 Pivotal. All rights reserved.
MPP Architectural Overview Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
5 © Copyright 2013 Pivotal. All rights reserved.
Implicit Parallelism – Procedural Languages
6 © Copyright 2013 Pivotal. All rights reserved.
Data Parallelism – Embarrassingly Parallel Tasks
� Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.
� Examples: – map() function in Python:
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> map(lambda e: e*e, x) >>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013
7 © Copyright 2013 Pivotal. All rights reserved.
� The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster
• Data Parallelism: - PL/X piggybacks on
Greenplum/HAWQ’s MPP architecture
• Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby
Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
8 © Copyright 2013 Pivotal. All rights reserved.
User Defined Functions – PL/Python Example � Procedural languages need to be installed on each database used.
� Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.
CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
9 © Copyright 2013 Pivotal. All rights reserved.
Returning Results � Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) � Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS ( name text, value integer );
� Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu;
� For functions which return multiple rows, prefix “setof” before the return type
http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston
10 © Copyright 2013 Pivotal. All rights reserved.
Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator:
CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu;
Sequence
Generator
CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu;
11 © Copyright 2013 Pivotal. All rights reserved.
Accessing Packages � On Greenplum DB: To be available packages must be installed on the
individual segment nodes. – Can use “parallel ssh” tool gpssh to conda/pip install – Currently Greenplum DB ships with Python 2.6 (!)
� Then just import as usual inside function:
CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu;
12 © Copyright 2013 Pivotal. All rights reserved.
UCI Auto MPG Dataset – A toy problem Sample Data
� Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters (bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?
� Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.
� This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP architecture. One segment can build a model for Hatchbacks another for Sedan
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
13 © Copyright 2013 Pivotal. All rights reserved.
Ridge Regression with scikit-learn on PL/Python
Python
SQL wrapper
SQL wrapper
User Defined Function
User Defined Type User Defined Aggregate
14 © Copyright 2013 Pivotal. All rights reserved.
PL/Python + scikit-learn : Model Coefficients
Physical machine on the cluster in which the regression model was built
Invoke UDF
Build Feature Vector
Choose Features
One model per body style
15 © Copyright 2013 Pivotal. All rights reserved.
Parallelized R in Pivotal via PL/R: An Example � With placeholders in SQL, write functions in the native R language
� Accessible, powerful modeling framework
http://pivotalsoftware.github.io/gp-r/
16 © Copyright 2013 Pivotal. All rights reserved.
Parallelized R in Pivotal via PL/R: An Example � Execute PL/R function
� Plain and simple table is returned
http://pivotalsoftware.github.io/gp-r/
17 © Copyright 2013 Pivotal. All rights reserved.
Aggregate and obtain final prediction
Each tree makes a prediction
Parallelized R in Pivotal via PL/R: Parallel Bagged Decision Trees
http://pivotalsoftware.github.io/gp-r/
18 © Copyright 2013 Pivotal. All rights reserved.
Complete Parallelism
19 © Copyright 2013 Pivotal. All rights reserved.
Complete Parallelism – Beyond Data Parallel Tasks
� Data Parallel computation via PL/X libraries only allow us to run ‘n’ models in parallel.
� This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data
� For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
20 © Copyright 2013 Pivotal. All rights reserved.
MADlib: Scalable, in-database Machine Learning
http://madlib.net
21 © Copyright 2013 Pivotal. All rights reserved.
MADlib In-Database Functions
Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market
Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions
22 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Streaming Algorithm
� Finding linear dependencies between variables
� How to compute with a single scan?
23 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT y = xiT yi
i∑
24 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master
XT y
Segment 1 Segment 2
X1T y1 X2
T y2+ =
25 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master Segment 1 Segment 2
XT yX1T y1 X2
T y2+ =
26 © Copyright 2013 Pivotal. All rights reserved.
Performing a linear regression on 10 million rows in seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
27 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable Features included in the
model
https://www.youtube.com/watch?v=Gur4FS9gpAg
28 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable
Create multiple output models (one for each value of bedroom)
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
Features included in the model
https://www.youtube.com/watch?v=Gur4FS9gpAg
29 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!)as predict !
FROM houses, houses_linregr m;!
MADlib model scoring function
Table with data to be scored Table containing model
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
30 © Copyright 2013 Pivotal. All rights reserved.
Python and R wrappers to MADlib
31 © Copyright 2013 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!
! ! !+ bath!! ! !+ size!! ! !, data=d)!
Pivotal R SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!'price’,!
'ARRAY[1, tax, bath, size]’);!
SQL Code
https://github.com/pivotalsoftware/PivotalR
32 © Copyright 2013 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!
! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!
Pivotal R
# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!
https://github.com/pivotalsoftware/PivotalR
33 © Copyright 2013 Pivotal. All rights reserved.
PivotalR Design Overview
2. SQL to execute
3. Computation results 1. R à SQL
RPostgreSQL
PivotalR
Data lives here No data here
Database/Hadoop w/ MADlib
• Call MADlib’s in-DB machine learning functions directly from R
• Syntax is analogous to native R function
• Data doesn’t need to leave the database • All heavy lifting, including model estimation
& computation, are done in the database
https://github.com/pivotalsoftware/PivotalR
34 © Copyright 2013 Pivotal. All rights reserved.
http://pivotalsoftware.github.io/pymadlib/
PyMADlib : Power of MADlib + Flexibility of Python Linear Regression
Logistic Regression
Extras – Support for Categorical variables – Pivoting
Current PyMADlib Algorithms – Linear Regression – Logistic Regression – K-Means – LDA
35 © Copyright 2013 Pivotal. All rights reserved.
Visualization
36 © Copyright 2013 Pivotal. All rights reserved.
Visualization
Commercial Open Source
37 © Copyright 2013 Pivotal. All rights reserved.
Hack one when needed – Pandas_via_psql
http://vatsan.github.io/pandas_via_psql/
SQL Client DB
38 © Copyright 2013 Pivotal. All rights reserved.
Integration with Open Source – (Py)Spark Example
39 © Copyright 2013 Pivotal. All rights reserved.
Apache Spark Project – Quick Overview
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
• Apache Project, originated in AMPLab Berkeley • Supported on Pivotal Hadoop 2.0!
40 © Copyright 2013 Pivotal. All rights reserved.
MapReduce vs. Spark
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
41 © Copyright 2013 Pivotal. All rights reserved.
Data Parallelism in PySpark – A Simple Example
• Next we’ll take the UCI automobile dataset example from PL/Python and demonstrate how to run in PySpark
42 © Copyright 2013 Pivotal. All rights reserved.
Scikit-Learn on PySpark – UCI Auto Dataset Example
• This is in essence similar to the PL/Python example from the earlier slide, except we’re using data store on HDFS (Pivotal HD) with Spark as the platform in place of HAWQ/Greenplum
43 © Copyright 2013 Pivotal. All rights reserved.
Large Scale Topic and Sentiment Analysis of Tweets
Social Media Demo
44 © Copyright 2013 Pivotal. All rights reserved.
Pivotal GNIP Decahose Pipeline
Parallel Parsing of JSON
PXF
Twitter Decahose (~55 million tweets/day)
Source: http Sink: hdfs
HDFS
External Tables
PXF
Nightly Cron Jobs
Topic Analysis through MADlib pLDA
Unsupervised Sentiment Analysis
(PL/Python)
D3.js
http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-on-pivotal-greenplum-database
45 © Copyright 2013 Pivotal. All rights reserved.
Data Science + Agile = Quick Wins
� The Team – 1 Data Scientist – 2 Agile Developers – 1 Designer (part-time) – 1 Project Manager (part-time)
� Duration – 3 weeks!
46 © Copyright 2013 Pivotal. All rights reserved.
Live Demo – Topic and Sentiment Analysis
47 Pivotal Confidential–Internal Use Only
Content Based Image Search
CBIR Live Demo
48 Pivotal Confidential–Internal Use Only
Content Based Information Retrieval - Task
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
49 Pivotal Confidential–Internal Use Only
CBIR - Components
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
50 Pivotal Confidential–Internal Use Only
Live Demo – Content Based Image Search
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
51 Pivotal Confidential–Internal Use Only
Appendix
52 Pivotal Confidential–Internal Use Only
Acknowledgements • Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan
Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team, Sumedh Mungee, Girish Lingappa