data science as a commodity: use madlib, r, & other oss tools for data science (from pivotal...

BUILT FOR THE SPEED OF BUSINESS

Additional Line 18 Point Verdana

2 © Copyright 2014 Pivotal. All rights reserved. 2 © Copyright 2014 Pivotal. All rights reserved.

Data Science as a Commodity: How to use MADlib, R, and other Publicly Available and Open Source Tools for Data Science Pivotal OSS Meetups Sarah Aerni Pivotal Senior Data Scientist @itweetsarah [email protected]

January 28, 2014

3 © Copyright 2014 Pivotal. All rights reserved.

What we will cover in today’s Meetup �  What is data science, big data,

buzzword, buzzword? �  What are some examples of data

science in action? �  What do I do at Pivotal? �  Who are our data scientists? �  Why is open source software

important for data science? �  What tools does our team use? For

NLP? For optimization? For regression?

�  What do I do with loads of data? �  How can I create good models? �  What types of open source tools can

I use to build models? �  How can I build a quick app? �  What can I do to get started

analyzing text data? �  Which tools exist to create

visualizations of my data that I can understand?


What we will not cover #notdatascience


Instead: Practical Data Science Tools #useful

http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science – Kaushik Das


Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-to- revolutionize-healthcare-get-data-scientists-and-app-developers-together/


Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-to- revolutionize-healthcare-get-data-scientists-and-app-developers-together/ “The use of statistical and machine

learning techniques on big multi-structured data — in a distributed computing environment — to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.” – Annika Jimenez http://blog.gopivotal.com/news-2/annika-jimenez-on-disruptive-data-science-at-the-strata-conference


DATA IS THE NEW CENTER OF GRAVITY

Data > Application!“BIG DATA IS THE NEW NORMAL”

“‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN”


What Can “Small Data” Scientists Bring on Their “Big Data” Journey?

http://factspy.net/the-difference-between-geeks-vs-nerds/


What Can “Small Data” Scientists Bring on Their “Big Data” Journey?

Flat files

Distributed computing

HDFS

In-memory model building

Cloud computing

MapReduce

Command-line tools

Databases

Command-line tools

Small Data Big Data Many tools and approaches are being adapted to big data technologies


Basic DS Tools: From Command-line to GUI

Ian Huston, Alex Kagoshima, Ronert Obst

�  Quick-and-dirty tricks using command-line tools

–  Fast feedback - interactive –  Fast to process –  Easy to write, hard to read –  Background processing (screen)




�  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster





�  Python and R –  Rstudio –  iPython (iPythonNotebook)


�  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster




Favorite python and R packages and resources Python

–  NumPy –  SciPy –  scikit-learn – machine

learning package –  statsmodels –  pandas –  pyMC –  IPython

(IPythonNotebook) –  matplotlib



Favorite python and R packages, resources, and more

� R –  ggplot –  reshape –  plyr –  Shiny –  Good support for time

series analyses –  Rstudio ( weave ) –  foreach, parallel –  taskviews –  parboost



A New Platform for a New Era What do I do at Pivotal?

...ETC

Cloud Fabric “The new OS”

Data Fabric “The new Database”

App Fabric “The new Middleware”

“The new Hardware”

DATA-DRIVEN APPLICATION DEVELOPMENT


Pivotal Big Data Technology: HAWQ Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database


Performance Through Parallelism �  Automatic parallelization

–  Load and query like any database –  Automatically distributed tables

across nodes

�  Analytics-oriented query optimization

�  Scalable MPP architecture –  All nodes can scan and process in

parallel –  Linear scalability by adding nodes

Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database


Data Science Tools for Big Data C O M M E R C I A L OP E N SO U R C E (O R FR E E)

PL/R, PL/Python PL/Java


Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns




�  How do you build models that use all the data? Score all the data?





�  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is

possible, but generally better to understand your data –  Columns with little or no variation or only null values





�  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is

possible, but generally better to understand your data –  Columns with little or no variation or only null values

� These functions exist in MADlib


MADlib In-Database Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition

(SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber

white, clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis,

Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions


MADlib in Action: Regression on Billions of Rows

Drilling into the San Andreas Fault at Parkfield California. Credit: Stephen H. Hickman, USGS Rashmi Raghu

�  Input Data –  10s of millions of rows from data collected at multiple drill

testing sites –  Sensor data for drills during operation, including rate of

penetration, depth of penetration, weight on drill bit and more

�  Data Massaging and Review –  Rapid summarization of many columns of data - to identify

outliers, missing data and remove them from analysis –  Used window functions to construct a moving average

(smoothing) of all the features and dependent variable

�  Model –  Linear regression on the complete dataset –  K-means clustering to determine similarities of sites


Linear Regression: Streaming Algorithm

� Finding linear dependencies between variables

� How to compute with a single scan?


Linear Regression: Parallel Computation

XT

y

XT y = xiT yi

i∑



y

XT

Master

XT y

Segment 1 Segment 2

X1T y1 X2

T y2+ =



y

XT

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =


Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.


Calling MADlib Functions: Fast Training, Scoring �  MADlib allows users to easily and

create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model









'price’,!'ARRAY[1, tax, bath, size]’,!

‘bedroom’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model Create multiple output models (one for each value of bedroom)









'price’,!'ARRAY[1, tax, bath, size]’);!

SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],

m.coef!)as predict !

FROM houses, houses_linregr m;!

MADlib model scoring function

Table with data to be scored Table containing model


PivotalR: Bringing MADlib and HAWQ to a familiar R interface �  Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

�  Simple solution: Translate R code into SQL

Woo Jung

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!

! ! !+ bath!! ! !+ size!! ! !, data=d)!

Pivotal R SELECT madlib.linregr_train( 'houses’,!

'houses_linregr’,!'price’,!

'ARRAY[1, tax, bath, size]’);!

SQL Code

http://gopivotal.github.io/PivotalR/


PivotalR: Bringing MADlib and HAWQ to a familiar R interface �  Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

�  Simple solution: Translate R code into SQL

Woo Jung

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!

! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!

Pivotal R

# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!



PivotalR Design Overview

SQL to execute

Computation results

RPostgreSQL

Data lives here

R à SQL

PivotalR

No data here

Database w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

Woo Jung



PivotalR: Current Features

MADlib Functionality • Linear Regression • Logistic Regression • Elastic Net • ARIMA • Marginal Effects • Cross Validation • Bagging • summary on model objects

• Automated Indicator Variable Coding as.factor

• predict

•  $ [ [[ $<- [<- [[<-

•  is.na

•  + - * / %% %/% ^

•  & | !

•  == != > < >= <=

•  merge

•  by

•  db.data.frame

•  as.db.data.frame

•  preview •  sort

•  c mean sum sd var min max length colMeans colSums

•  db.connect db.disconnect db.list db.objects

db.existsObject delete •  dim names

•  content

And more ... (SQL wrapper)



Woo Jung



Woo Jung

http://www.rstudio.com/shiny/ http://gopivotal.github.io/PivotalR/


Shiny Showcase: Example Web Apps in R �  Users can choose

input parameters with sliders, drop-downs, and text fields.

�  HTML/JavaScript knowledge not required.

http://www.rstudio.com/shiny/


�  Users can choose input parameters with sliders, drop-downs, and text fields.

�  HTML/JavaScript knowledge not required.

Shiny Showcase: Example Web Apps in R

http://www.rstudio.com/shiny/


http://d3js.org/


http://d3js.org/

D3 Data-Driven Documents


PyMADlib

� Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846


Procedural Languages in Big Data Science �  HAWQ & PL/X can take advantage of “data

parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks

�  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks

�  Examples of ‘data parallel’ problems: –  Counting words in documents –  Genome-Wide Association Study –  Studying network anomalies

Network Interconnect

Master Severs

Segment Severs

Doc1 Doc2 DocM

Stem1 Stem2 StemM

SQL & R

Count1 Count2 CountM http://gopivotal.github.io/gp-r/


Structure of input table for PL/R function

�  Topology: Hubs connected to multiple terminal points

�  Using historical readings, solve a linear program to establish baseline behavior, for example number of shipments

�  Detecting anomalies within sub-networks on future observations

Terminal readings

Columns Network ID Topology Network Readings

Description ID of the network. 300K in total.

Array of integers defining the topology tree.

Array of readings from network terminal points over (say) a week.

Vivek Ramamurthy

0

C

D B

A


Performance Analysis

0 100 200 300 400 500 600 700

0 50 100 150 200 250 300

Tim

e (s

econ

ds)

Number of networks (in thousands)

Execution time v/s number of networks

Vivek Ramamurthy

Number of networks

Time/network (ms)

Total time (seconds)

500 6.604 3.30

1000 3.637 3.64

5000 2.822 14.11

10,000 2.356 23.56

50,000 2.160 108.02

100,000 2.142 214.20

150,000 2.162 324.29

200,000 2.142 428.48

250,000 2.138 534.69

300,000 2.132 639.85


Performance Analysis R package used optim quadprog Rsymphony Rglpk

Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s

300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min

Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms

Vivek Ramamurthy






COIN-OR : Computational Infrastructure for Operations Research

–  Libraries for linear and non-linear programming, integer programming

–  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs

GLPK : GNU Linear Programming Kit Used for large-scale LPs, MIPs and related problems

Vivek Ramamurthy

http://www.coin-or.org/






COIN-OR : Computational Infrastructure for Operations Research

–  Libraries for linear and non-linear programming, integer programming

–  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs

GLPK : GNU Linear Programming Kit –  Used for large-scale LPs, MIPs and related problems http://www.gnu.org/software/glpk/

Vivek Ramamurthy

http://www.coin-or.org/


Natural language processing

Data sources Text sources Documents, books, emails Speech Phone logs, conversations

NLP processing pipeline

Sentence detection Tokenization Morphological

stemming

Stop word removal

Word-sense disambiguation

Part-of-Speech tagging

Syntactic parsing

Semantic role labeling

Entity recognition

Reference resolution

Event processing …

Common tasks/tools in NLP

Applications Word clouds Topic modeling Sentiment analysis Machine translation Document classification Document summarization Language generation Search Question answering Information Extraction

Niels Kasch


Open source tools for common NLP tasks

W O R D C L O U D S

I N F O R M A T I O N E X T R A C T I O N

T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N

OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS

Niels Kasch



W O R D C L O U D S Stemming/

lemmatization

Stop word removal

Tokenization •  GPText •  Apache UIMA •  OpenNLP (Java)

•  NLTK (Python) •  WordNet •  Pytagcloud




Niels Kasch




lemmatization

Stop word removal





Language detection

Stemming/lemmatization

Stop word removal

Tokenization •  Madlib (PLDA) •  gensim (LSA & LDA package for python) •  https://code.google.com/p/language-detection/


Niels Kasch




lemmatization

Stop word removal




Syntactic parsing

Sentence detection

Language detection

Tokenization

Entity extraction

Relationship extraction


Language detection

Stemming/lemmatization

Stop word removal

Tokenization •  Madlib (PLDA) •  gensim (LSA & LDA package for python) •  https://code.google.com/p/language-detection/

•  GPText and Madlib •  OpenNLP •  NLTK

•  Stanford CoreNLP (incl. POS tagger, NER, parser, etc.)


Niels Kasch


Topic Analysis – MADlib pLDA

Prepare dataset for

Topic Modeling

Social Media

Tokenizer

Align Data

Stemming, frequency

filtering

Natural Language Processing - GPText

Filter relevant content

Srivatsan Ramanujam


Topic Analysis – MADlib pLDA

Prepare dataset for

Topic Modeling

MADlib Topic Model

Social Media

Tokenizer

Align Data

Stemming, frequency

filtering

Natural Language Processing - GPText

Filter relevant content

Topic composition

Topic Clouds

Topic Graph

Srivatsan Ramanujam


Is there more? What’s next?

blog.gopivotal.com/tag/data-science-tech

blog.gopivotal.com/tag/data-science

BUILT FOR THE SPEED OF BUSINESS